3
votes

I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.

I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!

does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?

3

3 Answers

6
votes

LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to

3
votes

As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.

The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.

  • You could randomly sample your data to work on a small subset of it.
  • You could project your data into a lower dimension with something like PCA
  • Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?

Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.

0
votes

easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).