How to test data with R e1071 SVM multiclass

Question

First time I'm using R and e1071 package and SVM multiclass! I'm very confused, then. The goal is: if I have a sentence with sunny; it will be classified as "yes" sentence; if I have a sentence with cloud, it will be classified as "maybe", if I have a sentence with rainy; il will be classified ad "no". The true goal is to do some text classification for my research.

I have two files:

train.csv: a file where there are two columns/Variables one is the data, the other is the label.

Example:

                  V1    V2
1              sunny   yes
2        sunny sunny   yes
3  sunny rainy sunny   yes
4  sunny cloud sunny   yes
5              rainy    no
6        rainy rainy    no
7  rainy sunny rainy    no
8  rainy cloud rainy    no
9              cloud maybe
10       cloud cloud maybe
11 cloud rainy cloud maybe
12 cloud sunny cloud maybe

test.csv: in this file there are the new data to be classified.

Example:

      V1
1  sunny
2  rainy
3  hello
4  cloud
5      a
6      b
7  cloud
8      d
9      e
10     f
11     g
12 hello

Following the examples for the iris dataset (https://cran.r-project.org/web/packages/e1071/e1071.pdf and http://rischanlab.github.io/SVM.html) I created my model and then test the training data in this way:

> library(e1071)
> train <- read.csv(file="C:/Users/Stef/Desktop/train.csv", sep = ";", header = FALSE)
> test <- read.csv(file="C:/Users/Stef/Desktop/test.csv", sep = ";", header = FALSE)
> attach(train)
> x <- subset(train, select=-V2)
> y <- V2
> model <- svm(V2 ~ ., data = train, probability=TRUE)
> summary(model)

Call:
svm(formula = V2 ~ ., data = train, probability = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.08333333 

Number of Support Vectors:  12

 ( 4 4 4 )


Number of Classes:  3 

Levels: 
 maybe no yes

> pred <- predict(model,x)
> system.time(pred <- predict(model,x))
   user  system elapsed 
      0       0       0 
> table(pred,y)
       y
pred    maybe no yes
  maybe     4  0   0
  no        0  4   0
  yes       0  0   4
> pred
    1     2     3     4     5     6     7     8     9    10    11    12 
  yes   yes   yes   yes    no    no    no    no maybe maybe maybe maybe 
Levels: maybe no yes

I think it's ok until now. Now the question is: what about the test data? I didn't find anything for the test data. Then, I thought that maybe I should test the model with the test data. And I did this:

> test
      V1
1  sunny
2  rainy
3  hello
4  cloud
5      a
6      b
7  cloud
8      d
9      e
10     f
11     g
12 hello
> z <- subset(test, select=V1)
> pred <-predict(model,z)
Error in predict.svm(model, z) : test data does not match model !

What is wrong here? Can you please explain me how can I test new data using the old train model? Thank you

EDIT

These are the first 5 rows for each file .csv

> head(train,5)
                 V1  V2
1             sunny yes
2       sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5             rainy  no
> head(test,5)
     V1
1 sunny
2 rainy
3 hello
4 cloud
5     a

can you please provide something like head train.csv and head test.csv? i'm having trouble reconciling your statement that there are two columns with the output you pasted (eg sunny,rainy,sunny,yes) — 3pitt

1.618 1.618 · Accepted Answer · 2017-09-03T21:01:14

Factors in train and test dataset are different here so you would need to fix it first.

library(e1071)
#sample data
train_data <- data.frame(V1 = c("sunny","sunny sunny","rainy","rainy rainy","cloud","cloud cloud"),
                         V2= c("yes","yes","no","no","maybe","maybe"))
test_data <- data.frame(V1 = c("sunny","rainy","hello","cloud"))

#fix levels in train_data & test_data dataset before running model
train_data$ind <- "train"
test_data$ind <- "test"
merged_data <- rbind(train_data[,-grep("V2", colnames(train_data))],test_data)
#train data
train <- merged_data[merged_data$ind=="train",]
train$V2 <- train_data$V2
train <- train[,-grep("ind", colnames(train))]
#test data
test <- merged_data[merged_data$ind=="test",]
test <- data.frame(V1 = test[,-grep("ind", colnames(test))])

#svm model
svm_model <- svm(V2 ~ ., data = train, probability=TRUE)
summary(svm_model)
train_pred <- predict(svm_model,train["V1"])
table(train_pred,train$V2)

#prediction on test data
test$test_pred <- predict(svm_model,test)
test

Hope this helps!

How to test data with R e1071 SVM multiclass

2 Answers