8
votes

I am probably making a very simple (and stupid) mistake here but I cannot figure it out. I am playing with some data from Kaggle (Digit Recognizer) and trying to use SVM with the Caret package to do some classification. If I just plug the label values into the function as type numeric, the train function in Caret seems to default to regression and performance is quite poor. So what I tried next is to convert it to a factor with the function factor() and try and run SVM classification. Here is some code where I generate some dummy data and then plug it into Caret:

library(caret)
library(doMC)
registerDoMC(cores = 4)

ytrain <- factor(sample(0:9, 1000, replace=TRUE))
xtrain <- matrix(runif(252 * 1000,0 , 255), 1000, 252)

preProcValues <- preProcess(xtrain, method = c("center", "scale"))
transformerdxtrain <- predict(preProcValues, xtrain)

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
svmFit <- train(transformerdxtrain[1:10,], ytrain[1:10], method = "svmradial")

I get this error:

Error in kernelMult(kernelf(object), newdata, xmatrix(object)[[p]], coef(object)[[p]]) : 
  dims [product 20] do not match the length of object [0]
In addition: Warning messages:
1: In train.default(transformerdxtrain[1:10, ], ytrain[1:10], method = "svmradial") :
  At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1, X2, X3, X4, X5, X6, X7, X8, X9
2: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,  :
  There were missing values in resampled performance measures.

Can somebody tell me what I am doing wrong? Thank you!

2
The error message is pretty self explanatory, isn't it? Call your factor levels something other than 0, 1,...9.joran
@joran the warning message, isn'it?agstudy
@agstudy Yes, thank you. That's certainly an embarrassing warning (oops!, I mean error!) on my part! :)joran
@mchangun it is better to update your answer, than doing it in the comment.agstudy
This may be just a toy example, but resampling from only 10 cases when you have 10 classes seems like trouble. And, in fact, if I reduce it to two classes, it runs fine. Adding labels where ytrain is defined also runs fine for me. Keeping 10 cases and classes and changing to another method of classifier (rpart, cforest) also works. So my guess is that train can't combine the output of whatever svm function in kernlab is getting run if the different outputs have different numbers of classes. This is just a guess though.MattBagg

2 Answers

2
votes

You have 10 different classes and yet you are only including 10 cases in train(). This means that when you resample you will frequently not have all 10 classes in individual instances of your classifier. train() is having difficulty combining the results of these varying-category SVMs.

You can fix this by some combination of increasing the number of cases, decreasing the number of classes, or even using a different classifier.

0
votes

I found it challenging to use caret with the digit recognition use case. I think part of the problem is the label data is numeric. When caret tries to create variables from it they end up starting with a numeric, which is truly not accepted as an R variable.

In my case, I got around it by discretizing the label data using dplyr. This assumes your training data is placed into "train" dataframe.

descretize label to label2

train$label2=dplyr::recode(train$label, 0="zero", 1="one", 2="two",3="three",4="four",5="five",6="six",7="seven",8="eight",9="nine")

rearrange the columns so you can see new label2 along side original label

train <- train[, c((1),(786),(2:785))] head(train)

change label to be the factorized version of the discretized label2

train$label <- factor(train$label2)

kill label2 since it was a temp variable

train$label2 <- NULL

view result

head(train)