SVM has a training time that scales quadratically with the number of samples, or worse. For O(n^2) the time is proportional to c * n^2).
Your model configuration takes about 20 seconds with 100k features on my machine, giving a constant around c=2e9
. So the expected training time for 56 010 395
samples is 72 days, probably significantly more.
So either subsample your dataset, or use another classifier. You can use a small Multilayer Perceptron to get an expressiveness similar to a SVM with polynomial kerne. It can be trained with mini-batches using SGD. Using Hinge loss is the same kind of loss as SVM uses.
Btw, you basically always need to optimize the hyperparameter C
for SVM. The best practice way is to do 5-fold cross validation in a gridsearch. So you should plan to train at least 50 models...
import time
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import numpy
import pandas
def time_training(estimator, n_samples):
X, y = make_moons(n_samples=n_samples, noise=0.1, random_state=1)
X = numpy.concatenate([X, X], axis=1)
assert (X.shape[1] == 4), X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
start = time. time(), y_train)
end = time.time()
t = end-start
print('took', n_samples, t)
return t
def main():
model = SVC(kernel='poly', degree=3, C=1.0, gamma = 'auto')
sizes = numpy.array((100, 1e3, 1e4, 2e4, 4e4, 6e4, 1e5, 1.1e5, 1.2e5)).astype(int)
times = [ time_training(model, s) for s in sizes ]
df = pandas.DataFrame({
'samples': sizes,
'time': times,
if __name__ == '__main__':
[jon@jon-thinkpad ~]$ python3 temp/
took 100 0.0006172657012939453
took 1000 0.00444340705871582
took 10000 0.26808977127075195
took 20000 1.1068146228790283
took 40000 3.8822362422943115
took 60000 8.051671743392944
took 100000 20.05191993713379
took 110000 36.83517003059387
took 120000 61.012284994125366
>>> 0.26/(10000**2)
>>> 20/(100000**2)
>>> 2e-9*(56e6**2)/(3600*24)