R - How to tune text classifier created with RTextTools

Question

I am trying to create a text classifier using the RTextTools library in R. The training and testing data frames are in the same format. They both consist of two columns: the first of which is text and the second is the label.

Minimal reproducible example (substituted data) of my program so far:

# Packages
## Install
install.packages('e1071', 'RTextTools')
## Import
library(e1071)
library(RTextTools)

data.train <- data.frame("content" = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.", "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."), "label" = c("yes", "yes", "no"))
data.test <- data.frame("content" = c("It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."), "label" = c("no", "yes", "yes"))

# Process training dataset
data.train.dtm <- create_matrix(data.train$content, language = "english", weighting = tm::weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms = 0, removeStopwords = TRUE,  stemWords = TRUE, stripWhitespace = TRUE, toLower = TRUE)
data.train.container <- create_container(data.train.dtm, data.train$label, trainSize = 1:nrow(data.train), virgin = FALSE)

# Create linear SVM model
model.linear <- train_model(data.train.container, "SVM", kernel = "linear", cost = 10, gamma = 1^-2)

# Process testing dataset
data.test.dtm <- create_matrix(data.test$content, originalMatrix = data.train.dtm)
data.test.container <- create_container(data.test.dtm, labels = rep(0, nrow(data.test)), testSize = 1:nrow(data.test), virgin = FALSE)

# Classify testing dataset
model.linear.results <- classify_model(data.test.container, model.linear)
model.linear.results.table <- table(Predicted = model.linear.results$SVM_LABEL, Actual = data.test$label) 
model.linear.results.table

The code I have so far works, and results in a table comparing the predicted values with the actual values. The results are highly inaccurate though and it is clear to me that the model needs to be fine-tuned.

I know that the e1071 library (which RTextTools is based on) contains a tune.svm function to return the best cost and gamma values to yield the best results. The problem with using this is that the data parameter on the tune.svm function requires a dataframe to be read in, but since I am doing a text classifier, I am not just reading a simple dataframe into the SVM but a document-term matrix.

To no avail, I tried reading the DTM in as a dataframe like this:

model.tuned <- tune.svm(label~., data = as.data.frame(data.train.dtm), gamma = 10^(-6:-1), cost = 10^(-1:1))

I'm completely lost and any insight would be appreciated.

hongsy hongsy · Accepted Answer · 2017-07-31T03:06:36

You can look at the code in train_model (press F2 in RStudio) to see how it calls svm() with the container (in your case, data.train.container). By default, train_model uses

cross=0 (don't perform cross validation on training data)
cost=100 (cost of constraints violation)
probability=TRUE (model should allow for probability predictions)
kernel="radial" (radial kernel used for SVM training)

as parameters to be passed into svm().

To actually answer your question, the container returned by create_container() has slots training_matrix and training_codes which you can use below:

model.tuned <- tune.svm(x = data.train.container@training_matrix,
                        y = data.train.container@training_codes,
                        gamma = 10^(-6:-1),
                        cost = 10^(-1:1),
                        # fill in any other SVM params as needed here
                        )

R - How to tune text classifier created with RTextTools

1 Answers