I am using the Open XML SDK 2.5 to read .docx files in my console application.
There appears to be some discrepency between how Word displays the document and how the document is represented in XML when opened with the Open XML SDK.
Here is my example as seen in Word with whitespace visible:
So in my application I have a reference to this paragraph as a DocumentFormat.OpenXml.Wordprocessing.Paragraph object. After browsing the Open XML documentation it became clear to me that there is no representation of a "line" in the XML format. So the best I can do is have my Paragraph and the closest approximation to a line is the Run object. The Paragraph node has a collection of 6 Run objects in this example. If I get the InnerXml property of the Paragraph in this example here is how it looks:
<w:pPr xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:pStyle w:val=\"PlainText\" /><w:numPr><w:ilvl w:val=\"0\" /><w:numId w:val=\"17\" /></w:numPr><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr></w:pPr><w:r w:rsidRPr=\"000558F8\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t>Should we use the term “Verify” instead of “Confirm”</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00F5335C\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t xml:space=\"preserve\"> as per work instruction</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00411638\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:t>?</w:t></w:r><w:r w:rsidR=\"000558F8\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /></w:rPr><w:br /><w:t>Med</w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"003E76BD\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr><w:br /><w:t xml:space=\"preserve\">JD: </w:t></w:r><w:r w:rsidRPr=\"000558F8\" w:rsidR=\"00A118AB\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\"><w:rPr><w:rFonts w:ascii=\"Arial\" w:hAnsi=\"Arial\" /><w:b /></w:rPr><w:t>Done.</w:t></w:r>
All I see are the paragraph properties node and the 6 run nodes. And as you can see the run nodes don't equate to lines. Looking at my example from within Word I see that the paragraph has 2 carriage returns and I would expect this to be represented by 3 "lines". However in XML I get 6 runs which seem to be a close approximation to the 3 lines but for some reason some lines are split up seemingly arbitrarily.
The REAL issue is that I don't see any way of interpreting the run nodes in a way that I could reconstruct the line structure I have in the example in Word. For instance, nothing indicates to me that runs 1, 2, and 3 together make up line 1.
I need to parse over 300 word documents that depend on the line breaks for formatting. I NEED the line breaks, how can I get them? Is this possible with Open XML SDK?
Thanks in advance.
