7
votes

I have been using Apache POI to manipulate Microsoft Word .docx files — ie open a document that was originally created in Microsoft Word, modify it, save it to a new document.

I notice that new paragraphs created by Apache POI are missing a Revision Save ID, often known as an RSID or rsidR. This is used by Word to identify changes made to a document in one session, say between saves. It is optional — users could turn it off in Microsoft Word if they want — but in reality almost everyone has it on so almost every document is fulls of RSIDs. Read this excellent explanation of RSIDs for more about that.

In a Microsoft Word document, word/document.xml contains paragraphs like this:

<w:p w:rsidR="007809A1" w:rsidRDefault="007809A1" w:rsidP="00191825">
  <w:r>
    <w:t>Paragraph of text here.</w:t>
  </w:r>
</w:p>

However the same paragraph created by POI will look like this in word/document.xml:

<w:p>
  <w:r>
    <w:t>Paragraph of text here.</w:t>
  </w:r>
</w:p>

I've figured out that I can force POI to add an RSID to each paragraph using code like this:

    byte[] rsid = ???;
    XWPFParagraph paragraph = document.createParagraph();
    paragraph.getCTP().setRsidR(rsid);
    paragraph.getCTP().setRsidRDefault(rsid);

However I don't know how I should be generating the RSIDs.

Does POI have a way or generate and/or keep track of RSIDs? If not, is there any way I can ensure that an RSID that I generate doesn't conflict with one that's already in the document?

1
From the article you referenced: "They are completely random, and are only used for seeing where things match. So they aren't of much use unless you are merging with another document that also has RSIDs." So you can generate appropriate random numbers. As to conflicts, a list of them is stored in one of the properties parts. Do you really need to add them? They only improve certain compare/diff cases.JasonPlutext

1 Answers

4
votes

It looks like the list of valid rsid entries is held in word/settings.xml in the <w:rsids> entry. XWPF should be able to give you access to that already.

You'd probably want to generate a 8 hex digit long random number, check if that's in there, and re-generate if it is. Once you have a unique one, add it into that list, then tag your paragraphs with it.

What I'd suggest is that you join the poi dev list (mailing list details), and we can give you a hand on working up a patch for it. I think the things to do are:

  • Wrapper around the RSids entry in word/settings.xml, to let you easily fetch the list and generate a new (unique one)
  • A wrapper around the different RSid entries on a paragraph and a run
  • Methods on paragraphs and runs to get the RSid wrapper, add a new one, or clear the existing one

We should take this to the dev list though :)