2
votes

My girlfriend is writing a Word document for a homework. She's using the old .doc format as required by her teacher ( :'( ). At some point, the .doc file went from 150 kB to 2.6 MB with no noticeable change (seen in Dropbox history. Sadly, Word's comparison function fails because Word crashes). From that point, she was unable to save her document without crashing word...

I converted the .doc to docx, unzipped it, and found a 18 MB document.xml file ! I can't even format the xml properly because it crashes Notepad++, but I can see that the file is filled with the same xml tag repeating over and over :

<w:p w:rsidR="002A70E5" w:rsidRDefault="002A70E5" w:rsidP="00565ED9"/>

Do you have any idea what could cause this ?

EDIT: Here's the docx

EDIT2: The motivation for this question is more curiosity than looking for a fix. Thanks for your answers though.

2
To the down voter and closer, do you mind telling me why you think this question is inappropriate and point me to the right community ? Thx...Julien
I just downloaded the document and it seems to be working properly. Did you manage to get it fixed up?scanny
Yes it works, but it's heavy like hell, try to save it as .doc. I finally managed to display the xml properly. The offending tags where in a single text area. I deleted and re-created it and the tags were gone.Julien

2 Answers

1
votes

If you're willing to edit the XML directly, you can just delete all the empty <w:p> tags and rezip.

If you're good with Python, you might give python-docx a try and use it to delete all empty paragraphs.

Hopefully that will at least recover the work she's done so far.

Not sure how this would happen, or whether it matters much. Only thing I can think of is a sticking Return key on the keyboard that would insert a huge number of carriage returns. Those each insert a new paragraph. I've actually had that happen occasionally on a Windows virtual machine running on my Mac. No clue why it does it though.

-1
votes

The

tag you are talking about is the OpenXml format for building word documents. The openxml stores the document as a zipped file and I am afraid you are seeing the unzipped document.xml file. If you want to keep working with the doc just convert the doc file to docx. Dont unzip it.