1
votes

First time I'm using R and e1071 package and SVM multiclass! I'm very confused, then. The goal is: if I have a sentence with sunny; it will be classified as "yes" sentence; if I have a sentence with cloud, it will be classified as "maybe", if I have a sentence with rainy; il will be classified ad "no". The true goal is to do some text classification for my research.

I have two files:

  • train.csv: a file where there are two columns/Variables one is the data, the other is the label.

Example:

                  V1    V2
1              sunny   yes
2        sunny sunny   yes
3  sunny rainy sunny   yes
4  sunny cloud sunny   yes
5              rainy    no
6        rainy rainy    no
7  rainy sunny rainy    no
8  rainy cloud rainy    no
9              cloud maybe
10       cloud cloud maybe
11 cloud rainy cloud maybe
12 cloud sunny cloud maybe
  • test.csv: in this file there are the new data to be classified.

Example:

      V1
1  sunny
2  rainy
3  hello
4  cloud
5      a
6      b
7  cloud
8      d
9      e
10     f
11     g
12 hello

Following the examples for the iris dataset (https://cran.r-project.org/web/packages/e1071/e1071.pdf and http://rischanlab.github.io/SVM.html) I created my model and then test the training data in this way:

> library(e1071)
> train <- read.csv(file="C:/Users/Stef/Desktop/train.csv", sep = ";", header = FALSE)
> test <- read.csv(file="C:/Users/Stef/Desktop/test.csv", sep = ";", header = FALSE)
> attach(train)
> x <- subset(train, select=-V2)
> y <- V2
> model <- svm(V2 ~ ., data = train, probability=TRUE)
> summary(model)

Call:
svm(formula = V2 ~ ., data = train, probability = TRUE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.08333333 

Number of Support Vectors:  12

 ( 4 4 4 )


Number of Classes:  3 

Levels: 
 maybe no yes

> pred <- predict(model,x)
> system.time(pred <- predict(model,x))
   user  system elapsed 
      0       0       0 
> table(pred,y)
       y
pred    maybe no yes
  maybe     4  0   0
  no        0  4   0
  yes       0  0   4
> pred
    1     2     3     4     5     6     7     8     9    10    11    12 
  yes   yes   yes   yes    no    no    no    no maybe maybe maybe maybe 
Levels: maybe no yes

I think it's ok until now. Now the question is: what about the test data? I didn't find anything for the test data. Then, I thought that maybe I should test the model with the test data. And I did this:

> test
      V1
1  sunny
2  rainy
3  hello
4  cloud
5      a
6      b
7  cloud
8      d
9      e
10     f
11     g
12 hello
> z <- subset(test, select=V1)
> pred <-predict(model,z)
Error in predict.svm(model, z) : test data does not match model !

What is wrong here? Can you please explain me how can I test new data using the old train model? Thank you

EDIT

These are the first 5 rows for each file .csv

> head(train,5)
                 V1  V2
1             sunny yes
2       sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5             rainy  no
> head(test,5)
     V1
1 sunny
2 rainy
3 hello
4 cloud
5     a
2
can you please provide something like head train.csv and head test.csv? i'm having trouble reconciling your statement that there are two columns with the output you pasted (eg sunny,rainy,sunny,yes)3pitt
@MikePalmice I just edited the question for you :)KeyPi
Shouldn't train and test have the same number of columns?3pitt

2 Answers

1
votes

Factors in train and test dataset are different here so you would need to fix it first.

library(e1071)
#sample data
train_data <- data.frame(V1 = c("sunny","sunny sunny","rainy","rainy rainy","cloud","cloud cloud"),
                         V2= c("yes","yes","no","no","maybe","maybe"))
test_data <- data.frame(V1 = c("sunny","rainy","hello","cloud"))

#fix levels in train_data & test_data dataset before running model
train_data$ind <- "train"
test_data$ind <- "test"
merged_data <- rbind(train_data[,-grep("V2", colnames(train_data))],test_data)
#train data
train <- merged_data[merged_data$ind=="train",]
train$V2 <- train_data$V2
train <- train[,-grep("ind", colnames(train))]
#test data
test <- merged_data[merged_data$ind=="test",]
test <- data.frame(V1 = test[,-grep("ind", colnames(test))])

#svm model
svm_model <- svm(V2 ~ ., data = train, probability=TRUE)
summary(svm_model)
train_pred <- predict(svm_model,train["V1"])
table(train_pred,train$V2)

#prediction on test data
test$test_pred <- predict(svm_model,test)
test

Hope this helps!

0
votes

I think the problem may be with your select argument to the subset function - what happens if you just execute pred<-predict(model,test)? It's a bit hard to tell whether your original data has two columns (V1,V2) or up to four. Since you trained/initialized the model with data=train, I think predicting on test instead of subset(test,) should resolve the issue.

Predict will work on SVM's even if the number of rows in the test set is different than the number of rows the SVM was trained on ... it should be trivial. something like:

test.preds<-predict(some.svm, test)
misclassification.rate<-mean(test.preds != test$V2)