1
votes

I'm trying to read in text from a word document, docx and trying to find all the text that has been highlighted in yellow but it gives me an error message

import docx
document = docx.Document(r'C:/Users/devff/Documents/Prac2.docx')
rs = document._element.xpath("//w:r")
WPML_URI = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'
tag_t = WPML_URI + 't'
for word in rs:
    for rPr in word.findall(tag_rPr):
        high = rPr.findall(tag_highlight)
        for hi in high:
            if hi.attribute[tag_val] == 'yellow':  ##here is the problem
                print(word.find(tag_t).text.encode('utf-8').lower())

ideally it should print out the text thats been highlighted as yellow, but instead it just gives me:

AttributeError: 'CT_Highlight' object has no attribute 'attribute'
1

1 Answers

0
votes

I think you're looking for .attrib, not .attribute.

Fixing that would get you to the next step, but the way you've structured it is a little less than reliable because it would raise an exception if no val attribute is present. I recommend _Element.get() https://lxml.de/api/lxml.etree._Element-class.html which just returns None if no attribute with the requested name is present:

if hi.get(tag_val) == 'yellow':
    ...