Why does SVM work when using the comma delimited form but not the formula form? R

Question

So I have a data set of nrow = 218, and I'm going through [this][https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/] example [git here][https://github.com/nagdevAmruthnath]. I've split my data into train (nrow = 163; ~75%) and test (nrow = 55; ~25%).

When I get to the part where "pred <- predict(model_svm, test)", if I convert pred into a data frame, instead of 55 rows there are 163 (when using the function form of the svm call). Is this normal because it used 163 rows to train? Or should it only have 55 rows since Im using the test set to test?

When I use the 'formula' form of the svm I have issues with the # of rows in the predict function:

model_svm <- svm(trainlabel ~ as.matrix(train) )

But when I use the 'traditional' form, predict on the test data works fine:

model_svm <- svm(as.matrix(train), trainlabel)

Any idea why this is?

Some fake data:

featuredata_all <- matrix(rexp(218, rate=.1), ncol=23)

Some of the code:


library(data.table)

pt1 <- scale(featuredata_all[,1:22],center=T)
pt2 <- as.character(featuredata_all[,23]) #since the label is a string I kept it separate 

ft<-cbind.data.frame(pt1,pt2) #to preserve the label in text
colnames(ft)[23]<- "Cluster"

## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)

train <- ft[train_ind,1:22] #163 reads
test  <- ft[-train_ind,1:22] #55 reads

trainlabel<- ft[train_ind,23] #163 labels
testlabel <- ft[-train_ind,23] #55 labels

#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ as.matrix(train) )
summary(model_svm)

#Use the predictions on the data
pred <- predict(model_svm, test) 


 [1]: https://iamnagdev.com/2018/01/02/sound-analytics-in-r-for-animal-sound-classification-using-vector-machine/
 [2]: https://github.com/nagdevAmruthnath

PleaseHelp PleaseHelp · Accepted Answer · 2020-05-20T22:25:27

You are correct, your formula way is giving you the number of results for training when pred should give you the number of results for testing. I think the problem is because you're writing your formula with as.matrix(). If you look at the results of your pred, you'll see there are actually a bunch of NAs.

Here's the correct way to use the formula

#Create training and testing sets

set.seed(123)
intrain<-createDataPartition(y=beaver2$activ,p=0.8,list=FALSE)
train<-beaver2[intrain,] #80 rows, 4 variables
test<-beaver2[-intrain,] #20 rows, 4 variables

svm_beaver2 <- svm(activ ~ ., data=train)

pred <- predict(svm_beaver2, test) #20 responses, the same as the length of test set

Your outcome just has to be a factor. So even if it is a string, you can convert it to a factor by doing train$outcome <- as.factor(train$outcome) and then you can use the formula above.

Why does SVM work when using the comma delimited form but not the formula form? R

1 Answers