First time I'm using R and e1071 package and SVM multiclass! I'm very confused, then. The goal is: if I have a sentence with sunny; it will be classified as "yes" sentence; if I have a sentence with cloud, it will be classified as "maybe", if I have a sentence with rainy; il will be classified ad "no". The true goal is to do some text classification for my research.
I have two files:
- train.csv: a file where there are two columns/Variables one is the data, the other is the label.
Example:
V1 V2
1 sunny yes
2 sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5 rainy no
6 rainy rainy no
7 rainy sunny rainy no
8 rainy cloud rainy no
9 cloud maybe
10 cloud cloud maybe
11 cloud rainy cloud maybe
12 cloud sunny cloud maybe
- test.csv: in this file there are the new data to be classified.
Example:
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a
6 b
7 cloud
8 d
9 e
10 f
11 g
12 hello
Following the examples for the iris dataset (https://cran.r-project.org/web/packages/e1071/e1071.pdf and http://rischanlab.github.io/SVM.html) I created my model and then test the training data in this way:
> library(e1071)
> train <- read.csv(file="C:/Users/Stef/Desktop/train.csv", sep = ";", header = FALSE)
> test <- read.csv(file="C:/Users/Stef/Desktop/test.csv", sep = ";", header = FALSE)
> attach(train)
> x <- subset(train, select=-V2)
> y <- V2
> model <- svm(V2 ~ ., data = train, probability=TRUE)
> summary(model)
Call:
svm(formula = V2 ~ ., data = train, probability = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.08333333
Number of Support Vectors: 12
( 4 4 4 )
Number of Classes: 3
Levels:
maybe no yes
> pred <- predict(model,x)
> system.time(pred <- predict(model,x))
user system elapsed
0 0 0
> table(pred,y)
y
pred maybe no yes
maybe 4 0 0
no 0 4 0
yes 0 0 4
> pred
1 2 3 4 5 6 7 8 9 10 11 12
yes yes yes yes no no no no maybe maybe maybe maybe
Levels: maybe no yes
I think it's ok until now. Now the question is: what about the test data? I didn't find anything for the test data. Then, I thought that maybe I should test the model with the test data. And I did this:
> test
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a
6 b
7 cloud
8 d
9 e
10 f
11 g
12 hello
> z <- subset(test, select=V1)
> pred <-predict(model,z)
Error in predict.svm(model, z) : test data does not match model !
What is wrong here? Can you please explain me how can I test new data using the old train model? Thank you
EDIT
These are the first 5 rows for each file .csv
> head(train,5)
V1 V2
1 sunny yes
2 sunny sunny yes
3 sunny rainy sunny yes
4 sunny cloud sunny yes
5 rainy no
> head(test,5)
V1
1 sunny
2 rainy
3 hello
4 cloud
5 a