5
votes

When you open up Word, it allows you to save as Word Open XML format. I've seen posts regarding opening up the docx file as a zip and then extracting stuff from there. But what I really want is a way to turn the docx into a single XML exactly like when doing the "save as" action in MS Office. What to do?

And how to do this for the .doc format ?

Note: I would like to do this programmatically. Preferably under Linux development conditions with PHP. But if that's not available, then other languages will do. Lastly, if it comes down to it, I can consider spinning up a Windows server to do this.

3

3 Answers

9
votes

Sorry to resuscitate a dead thread, but I just found an answer for the DOCX files. A DOCX file is just a ZIP archive of XML files. So for extracting the contents of one of its file, v.gr. word/document.xml under a Linux environment, you have to run unzip:

unzip -q -c myfile.docx word/document.xml

For catching the output of this command into the $xml variable of a PHP script, you can issue:

$xml = shell_exec ("unzip -q -c myfile.docx word/document.xml");

Hoping this answer helps for DOCX files. Better late than never.

For DOC files, this method does not work.

3
votes

Eric White explains how to do this for docx in C# at transforming-open-xml-documents-to-flat-opc-format

You can also do it using docx4j (which I work on), the 'j' being Java.

2
votes

In Word: file | save as | Word XML Document (*.xml) gives you the Open XML Format you want, as a single XML file

In code using Interop: use Document object's SaveAs method, using WdSaveFormat.wdFormatXMLDocument as the save format. You should also use the Document.Convert method to update the compatibility to the MS Office version installed.

So, not necessarily a complete demo, but this should give you the right idea:

ActiveDocument.Convert();

WdSaveFormat myNewSaveFormat = WdSaveFormat.wdFormatXMLDocument;
ActiveDocument.SaveAs(newFilePath, myNewSaveFormat); //where newFilePath can be a FileInfo object specifying the new file name and extension (docx)