0
votes

I am new to machine learning.
I am using Support Vector Machines (SVM) with 'polynomial' kernel for multi-class classification. My dataset size is (56010395, 4) in the form of (no of samples, no of features). However, my machine has been training endlessly since the past 1 week and the training is still not finished. My code is really basic, so I don't understand what is the problem in my code. I can't sub-sample my data set. My RAM is 15 GB and I am using i7 Intel CPU.

I have already tried SVM with linear classifier and the training finished in 3 hours with 75% accuracy.Also the data is scaled using MinMaxscaler.

from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X_data, y_labels, test_size=0.3, random_state=0)

print('start training')
start = time. time()
svm_model_linear = SVC(kernel='poly', degree=3, C=1.0, gamma = 'auto').fit(X_train, y_train)
print('training_finished')
end = time. time()
print('time: ', end - start)
svm_predictions = svm_model_linear.predict(X_test)
1
try to do feature selection by reducing your vocabularykrits
This is already a vocabulary with reduced features. I can't reduce it further.Simran Agarwal
If it takes that much time, it does not necessarily mean there is a problem in your code. Linear SVC scales much better, so I wouldn't be too surprized there is such a big time difference. Your RAM is quite big, so one option could be to use the cache_size argument for your kernel. Maybe you can try cache_size = 500 (or larger values)? Try this out on a subsample of your data and see if it gives you any improvement.MaximeKan
How many classes do you have?Jon Nordby
Why cant you subsample the data?Jon Nordby

1 Answers

0
votes

SVM has a training time that scales quadratically with the number of samples, or worse. For O(n^2) the time is proportional to c * n^2). Your model configuration takes about 20 seconds with 100k features on my machine, giving a constant around c=2e9. So the expected training time for 56 010 395 samples is 72 days, probably significantly more.

So either subsample your dataset, or use another classifier. You can use a small Multilayer Perceptron to get an expressiveness similar to a SVM with polynomial kerne. It can be trained with mini-batches using SGD. Using Hinge loss is the same kind of loss as SVM uses.

Btw, you basically always need to optimize the hyperparameter C for SVM. The best practice way is to do 5-fold cross validation in a gridsearch. So you should plan to train at least 50 models...


import time

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import numpy
import pandas

def time_training(estimator, n_samples):
    X, y = make_moons(n_samples=n_samples, noise=0.1, random_state=1)
    X = numpy.concatenate([X, X], axis=1)
    assert (X.shape[1] == 4), X.shape

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

    start = time. time()
    estimator.fit(X_train, y_train)
    end = time.time()
    t = end-start
    print('took', n_samples, t)
    return t

def main():
    model = SVC(kernel='poly', degree=3, C=1.0, gamma = 'auto')
    sizes = numpy.array((100, 1e3, 1e4, 2e4, 4e4, 6e4, 1e5, 1.1e5, 1.2e5)).astype(int)
    times = [ time_training(model, s) for s in sizes ]

    df = pandas.DataFrame({
        'samples': sizes,
        'time': times,
    })
    df.to_csv('temp/svmtrain.csv')

if __name__ == '__main__':
    main()

[jon@jon-thinkpad ~]$ python3 temp/svm-training-time.py
took 100 0.0006172657012939453
took 1000 0.00444340705871582
took 10000 0.26808977127075195
took 20000 1.1068146228790283
took 40000 3.8822362422943115
took 60000 8.051671743392944
took 100000 20.05191993713379
took 110000 36.83517003059387
took 120000 61.012284994125366
>>> 0.26/(10000**2)
2.6e-09
>>> 20/(100000**2)
2e-09
>>> 2e-9*(56e6**2)/(3600*24)
72.5925925925926