1
votes

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.

I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.

How can I go about doing this? I tried a simple python script using python-docx:

import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
    currentDoc=Document(os.path.join(folderPath,xmlFileName))
    docxFileName=xmlFileName.replace('.xml','.docx')
    currentDoc.save(os.path.join(folderPath,docxFileName))
    currentDoc.close()

This gives:

docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml

I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.

Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?

1
When the file is not a valid or an empty docx file, that kind of exception is thrownLudisposed

1 Answers

1
votes

It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.

python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.

If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.

You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.

If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.