I am working with a bunch of word documents in which I have text (words) that are highlighted (using color codes e.g. yellow,blue,gray), now I want to extract the highlighted words associated with each color. I am programming in Python. Here is what I have done currently:
opened the word document with [python-docx][1]
and then get to the <w:r>
tag which contains the tokens (words) in the document. I have used following code:
#!/usr/bin/env python2.6
# -*- coding: ascii -*-
from docx import *
document = opendocx('test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
print word
Now I am stuck at the part where I check for each word if it has <w:highlight>
tag and extract the color code from it and if it matches to yellow print text inside <w:t>
tag. I will really appreciate if someone can point me towards extracting the word from the parsed file.