R e1071 cross-validation accuracy is not the same

Question

I was trying to reproduce an example shown in the libsvm "A Practical Guide to Support Vector Classification" on Page 10. The data "train.2" that I was using can be downloaded here "http://www.csie.ntu.edu.tw/~cjlin/papers/guide/data/".

In order to parse the data and test the classification accuracy, I wrote the following code:

library(e1071)
rm(list=ls(all=T))
root <- "C:/Users/administrator/Documents/RProjects/libsvm"
bioDataFile <- sprintf("%s/data/train.2", root)
bioData <- read.delim(bioDataFile, header=F, sep=" ", stringsAsFactors=F)
bioData <- bioData[, c(-2,-3,-ncol(bioData))]
bioData <- lapply(1:nrow(bioData), function(n){
reformData <- bioData[n,-1,drop=F]
reformData <- sapply(1:ncol(reformData), function(m){
as.numeric(unlist(strsplit(reformData[,m], ":"))[2])
})
data.frame(Type=factor(bioData[n,1]), t(reformData))
})
bioData <- do.call("rbind", bioData)

Then I performed the test:

bioData.model <- svm(Type~., data=bioData, cross=5)

However, I found that: 1. I couldn't get the same results as shown in the manual; 2. I found that the k-fold cross-validation accuracy (either mean(bioData.model$accuracies) or bioData.model$tot.accuracy) is different each time I run the command.

I did the same test using the svm-train.exe provided in the libsvm package, it did produce the same results as shown in the manual, and no matter how many times I ran the test, it always gives me the same k-fold cross-validation accuracies.

Can anyone tell me why? Any help would be much appreciated.

For reproducibility of results you need to set your random seed prior to running cross validation. — Alex A.

ibreznik ibreznik · Accepted Answer · 2015-04-27T16:12:38

If you look into documentation you'll see that the function you are using relies on "random numbers". The term "random" is somewhat ambiguous in computer science. In truth there is an algorithm that creates what are called "pseudo-random" numbers. That algorithm (in basic terms) takes in one parameter (where it should start) and produces the same sequence every time (random seed). Incidentally, this is what all modern encryption systems are based on, the fact that the sequence will always be the same, given the same random seed.

To set random seed in R use:

set.seed(3)

Where 3 can be replaced by any number that you want to set. Now once you have set it, every time you generate a random number the next number in the pseudo-random sequence will be taken. So if you set seed, try it out a couple of times and then run your code it should not generate the same result as running the code right after setting a random seed.

Hope this helps!

R e1071 cross-validation accuracy is not the same

1 Answers