11
votes

What I am doing wrong here? I have a large data set that I want to perform a partial fit on using Scikit-learn's SGDClassifier

I do the following

from sklearn.linear_model import SGDClassifier
import pandas as pd

chunksize = 5
clf2 = SGDClassifier(loss='log', penalty="l2")

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
    X = train_df[features_columns]
    Y = train_df["clicked"]
    clf2.partial_fit(X, Y)

I'm getting the error

Traceback (most recent call last): File "/predict.py", line 48, in sys.exit(0 if main() else 1) File "/predict.py", line 44, in main predict() File "/predict.py", line 38, in predict clf2.partial_fit(X, Y) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 512, in partial_fit coef_init=None, intercept_init=None) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/linear_model/stochastic_gradient.py", line 349, in _partial_fit _check_partial_fit_first_call(self, classes) File "/Users/anaconda/lib/python3.5/site-packages/sklearn/utils/multiclass.py", line 297, in _check_partial_fit_first_call raise ValueError("classes must be passed on the first call " ValueError: classes must be passed on the first call to partial_fit.

2
"Classes across all calls to partial_fit. Can be obtained by via np.unique(y_all), where y_all is the target vector of the entire dataset. This argument is required for the first call to partial_fit and can be omitted in the subsequent calls. Note that y doesn’t need to contain all labels in classes." scikit-learn.org/stable/modules/generated/… - user554546
@JackManey Please post your comment as an answer, so that the asker can accept and/or close the question. - Vivek Kumar

2 Answers

15
votes

Please notice that the classifier does not know the number of classes at the beginning, therefore for the first pass, you need to tell the number of classes using np.unique(target), where target is the class column. Because you are reading the data in chunks, you need to make sure that your first chunk has all possible values for the class label, so it works! Therefore, your code would be:

for train_df in pd.read_csv("train.csv", chunksize=chunksize, iterator=True):
   X = train_df[features_columns]
   Y = train_df["clicked"]
   clf2.partial_fit(X, Y, classes=np.unique(Y))
1
votes

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit

clf2.partial_fit(X, Y, classes=np.unique(Y))

Suppose you don't have sufficient record of class and so classifier need values of a total number of classes that need to be classified.