How to extract text from .docx with tables in Python

Question

The .docx file I have has tables, headers, etc. and I was wondering how I could extract text from that document. The only example code I could find uses paragraphs, and it doesn't work with my file.

Here is the code:

    doc = docx.Document(self.filename)
    fullText = []
    for para in doc.paragraphs:
        txt = para.text.encode('ascii', 'ignore')
        fullText.append(txt)
    self.text = '\n'.join(fullText)

When I run this code, I get this error:

 File "annotatorConnections.py", line 75, in openFile
    self.text = '\n'.join(fullText)
TypeError: sequence item 0: expected str instance, bytes found

Abhishek Kulkarni Abhishek Kulkarni · Accepted Answer · 2020-04-12T05:03:18

Since you are getting a byte type instead of a string type in your fullText, you can use this to get this working :

doc = docx.Document(self.filename)
fullText = []
for para in doc.paragraphs:
    txt = para.text.encode('ascii', 'ignore')
    fullText.append(txt)
self.text = b'\n'.join(fullText)             ---------> Add prefix b to make it a byte object.

How to extract text from .docx with tables in Python

1 Answers