4
votes

I have a dataframe having two columns. One Column contains text. Each row of that column one contains some type of data of three different classes(skill,qualification,experience) and other column is their respective class labels.

Snapshot of the dataframe:

snapshot of the dataframe

How to apply svm from package e1071. How to Convert text data Column into some score. I thought of converting the textual column into document-term matrix. Is their any other way? How to make a d-t-matrix ?

3

3 Answers

5
votes

You can use RTextTools packages to create a document term matrix. Use create_matrix function :

# Create the document term matrix. If column name is v1
dtMatrix <- create_matrix(data["v1"])

Then you can train your SVM model using this:

# Configure the training data
container <- create_container(dtMatrix, data$label, trainSize=1:102, virgin=FALSE)
 
# train a SVM Model
model <- train_model(container, "SVM", kernel="linear", cost=1)

For information, RTextTools user e1071 package internally to train the models.

For more details, please refer the RTextTools and e1071 documentation.

0
votes

You could use the tm package in R. You will have to preprocess the text before forming the document term matrix which includes - removal of stop words,punctuations, numbers ,normalizations (USA = U.S.A) , stemming etc. add weighting to the dtm - ( tfidf) to add more importance to significant terms.

Once you are done with these steps, you may use the svm() from e1071 to train the classifier

 fit <- svm(x, y, kernel = "linear") 

Here,

  x = dtm 

  y = a vector of the corresponding labels 

Use the model to predict the classes for your test data ( make sure your test data is pre-processed as well)

0
votes

I also considered using RTextTools. It has a relatively easy implementation. However, it is useless if your data has a class imbalance. It doesn't allow you to control a stratified split in your container.

container <- create_container(dtMatrix, data$label, trainSize=1:102, virgin=FALSE)

You don't know how the proportion of your class labels would end up in "trainSize=1:102" argument. It is also not being maintained. So, I would avoid using it.