3
votes

I have a stream of doc/docx documents that I need to get the word count of.

The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.

This is what I tried:

library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1

Unfortunately, wordCount is NOT what MS Word suggests.

For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr does not even import them.

Is there a workaround? I don't mind trying something in Python, too, although I'm less experienced there.

Any help would be greatly appreciated.

1

1 Answers

1
votes

This should be able to be done using the tidytext package in R.

library(textreadr)
library(tidytext)
library(dplyr)

#read in word file without password protection
x <- read_docx(myDocxFile)
#convert string to dataframe
text_df <-tibble(line = 1:length(x),text = x)
#tokenize dataframe to isolate separate words
words_df <- text_df %>%
  unnest_tokens(word,text)
#calculate number of words in passage
word_count <- nrow(words_df)