0
votes

Am using Word Interop adn C# to build a program at work and one of the features in it is getting a word count.

Now this can't be the Word word count as i need to emulate the word count of a CAT toool used at work.

One of the issues i found is that the CAT tool uses text formatting to split up words. This means that if i have the word 1st with st superscripted, word will count one word (as there is nothing separating the two) and the CAT tool counts 2 words as per the text format change.

Thing is the CAT tool keeps track of the format changes and that information breaks the word.

So, i could go word by word, character by character, and check all possibilities (font, bold, italic, etc) but that would be really slow working with multiple documents each with 1000s of words.

Does anyone know a better solution?

1
Can you check the different styles applied to a document and where are they? - Ignacio Soler Garcia
Which version of Word - doc or docx? If docx, you can try parsing the xml. - sq33G
Interop seems my best choice - know a better one? - 537mfb
doc, docx and rtf - could be any coming from client - 537mfb
Any pointer on how to check styles? can't find any information on that - 537mfb

1 Answers

2
votes

Well Cindy from the MSDN forums gave me the answer on this one

http://social.msdn.microsoft.com/Forums/en-US/worddev/thread/16fc1fb9-4713-45e5-ae00-76bbaafe0a56

then the approach I'd look at would be to use Document.Content.WordOpenXML to extract the content into a string. The content will be in the Office Open XML "flat package" format, meaning it should contain everything.

You should then be able to "parse" the string to get the information you need.

If you look at such a string, you should see that all the text is in elements. If there's formatting, then it will break the into parts - one part for each formatting change. So all that you'd need to do in addition to extracting all the w:t elements would be to check for the punctuation and spaces that otherwise delineate "words" in the text.