1
votes

I was trying to save a word file in xml format and performing some operations on that xml file after parsing it.

The data which i have in my word document was broken in different tags.

example

If i have $date in my word document it was broken as $ and date in two tags.Also tlyadd is broken into two tags tly and add whereas tlyabcd remained in a single tag.

In another document these values are not broken into different tags.

I am not understanding on what basis are these values put in different tags.

I couldn't find anything on the word xml format on msdn.

Can someone give me an explanation on why and on what basis is this done.

Here is the document containing these values

Let me know if it is unclear and needs more explanation

1
Different versions of Word use different XML formats, see e.g. Introducing the Office (2007) Open XML File Formats,msdn.microsoft.com/en-us/library/aa338205%28v=office.12%29.aspx.Jukka K. Korpela
@JukkaK.Korpela I agree but I am not asking about the .docx which is a zipped xml format.msdn link tells many things but not the basis on which data is divided among tagsmanu_dilip_shah

1 Answers

1
votes

You shouldn't make any assumptions about whether text is in one run or several. There are no rules restricting the circumstances in which text may be split.

That said, there are various things which will force your text to be split across runs:

  • spelling/grammar checking (probably happening with $date), which you can turn off
  • formatting, for example, if half the word was bold
  • revisions (different people changing the document at different times - rsid)
  • change tracking etc

You can/should preprocess your document to join up your runs. See for example, docx4j's VariablePrepare.java