In this post @mikemaccana describes how to use python-docx to extract raw text data from an MS Word document from within python. I'd like to go one step further. Instead of simple extracting the raw text information, can I also use this module to harvest information about font face (e.g. bold versus italic) or font size (e.g. 12 versus 18pt). The closest I was able to come was this post asking about using this module to extract highlighted text entries.
Seemed a little abstract, and I'm not totally sure what's going on here. Is there a more straightforward way to extract formatting information from a Word doc in python? By way of a quick document template:
Here the first line is a large header with one sentence.
The second line is slightly smaller. It also has two sentences.
Even smaller. But that's not all. This line has three sentences.
And finally here's a regular line of unbolded text.
If we call these four lines my word document, I'd like to writing a parsing function, call it doc_parser
, that returns something like the following:
>>>> doc_data = doc_parser(path_to_example_doc)
>>>> print(doc_data)
[1] [{'font': 18, 'face': 'bold', 'n_sentence': 1},
{'font': 16, 'face': 'bold', 'n_sentence': 2},
{'font': 14, 'face': 'bold', 'n_sentence': 3},
{'font': 12, 'face': 'plain', 'n_sentence': 1}]