I have a stream of doc/docx documents that I need to get the word count of.
The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.
This is what I tried:
library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1
Unfortunately, wordCount
is NOT what MS Word suggests.
For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr
does not even import them.
Is there a workaround? I don't mind trying something in Python, too, although I'm less experienced there.
Any help would be greatly appreciated.