2
votes
# Loading required libraries


# Set up logistics such as reading in data and setting up corpus

```{r}

# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"

# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")

# Truncate file names so it is only showing "FirstLast-Term"
prez.out=substr(speeches, 6, nchar(speeches)-4)

# Create a vector NA's equal to the length of the number of speeches
length.speeches=rep(NA, length(speeches))

# Create a corpus
ff.all<-Corpus(DirSource(folder.path))
```

# Clean the data

```{r}

# Use tm_map to strip all white spaces to a single space, to lower case case, remove stop words, empty strings and punctuation.
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

The problem line

ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom")

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)

So I am trying to replace words that are similar by one root word. For example, replacing "free" by "freedom" in a text mining project.

Then I learned this line from a Youtube tutorial: ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom"). Without this line, the code runs.

With this line added, R Studio gives this error "Error: inherits(doc, "TextDocument") is not TRUE" on the execution of this line: "tdm.all<-TermDocumentMatrix(ff.all)"

I think this should be a relatively simple issue, however I could not find a solution on stackoverflow.

1

1 Answers

1
votes

Using the tm's builtin crude data I was able to fix your problem by wrapping gsub in a content_transformer call like so.

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

It has been my experience that tm_map does wierd things to the returned object for custom functions. So while your original line worked tm_map doesn't quite return a true "Corpus" that is what causes the errors.

As a side note:

This line seems to do nothing ff.all<-tm_map(ff.all, removeWords, character(0))

Same with the "" in ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", ""))

My full example

library(tm)
data(crude)
ff.all <- crude

ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)