1
votes

I am using python docx for word file processing. While using larger files(50+ pages), the paragraph.text method is returning string which is inconsistent with my file.

import docx
document=Document(f)
paratext=[]
paragraphs=document.paragraphs
for paragraph in paragraphs:
    text=paragraph.text
    paratext.append(text)
print(paratext[30])

Ideally this should print the 30th paragraph. But the output seems distorted (Beginning few characters are missing and the printed output starts from somewhere in the middle of the actual paragraph in some cases). However it works fine if I copy the adjacent few paragraphs in a fresh ms word document (1 page only) and run the code by just changing the index of paratext. For eg I copied 3 adjacent paras into a new doc and used print(paratext[2]), the output seems just perfect here. How do I get rid of this inconsistency as I have to work with larger documents.

1

1 Answers

0
votes

I expect this means that the missing text is in runs that are "enclosed" in some other XML element, like perhaps a field or a hyperlink.

The quickest way to discover specifically what's happening might be to modify your short script to temporarily capture the paragraph XML.

import docx
document = Document(f)
p_xml = [paragraph._element.xml for paragraph in document.paragraphs]
print(p_xml[30])

Your choices at that point are likely to be editing the Word documents to remove the offending "enclosure" or to process XML for each paragraph yourself using lxml calls.

That might be easier that it sounds if you use the .xpath() method available on paragraph._element. In any case, that would be a separate question in which you show the XML you find with the method above.