How to use trained model for another dataset in Sklearn's Random Forest Classifier?

Question

Update : Some "Terminology "
Sample : row
features: columns
'labels' : classes for the prediction (one column among the features).

Basically I wonder : I have dataset1 and dataset2 identical in terms of shape and size. After training and testing with dataset1, I am using this model to predict dataset2. (Number of features are also same).

If I predict all the items in the dataset2 , accuracy is close to dataset1 test results. But If I pick 1 item for each class from dataset2, accuracy is around 30%. How it is possible that full dataset2 accuracy is drastically different than the "subsampled" dataset2?

I am using RandomForestClassifier.

I have a data set with 200K sample (rows) having around 90 classes. After Training and testing, accuracy is high enough (around ~96%).

Now since I have a trained model, I am using another different database (again with 200 K samples and 90 classes) to make predictions.

If I submit all samples from this second database, accuracy is close enough to training accuracy (around ~92% ).

But If I select 90 samples (one from each class) from this second database accuracy is not what I have expected. (around ~30%)

.... data preprocessing is done.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=nestimators, bootstrap=False, 
class_weight=None, criterion='entropy',
        max_features='auto', max_leaf_nodes=None,
        min_impurity_decrease=0.0, min_impurity_split=None,

        min_weight_fraction_leaf=0.0, n_jobs=6,
        oob_score=False, random_state=np.random.seed(1234), verbose=0, warm_start=False)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)    

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

and accuracy is around ~96%.

Now I am using this trained model with a new database (identical in shape) :

df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)
features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values  # Labels
y_prednew=clf.predict(Xnew)

Accuracy is above ~90%. Close to first database accuracy. But

If I filter this new data set for 1 sample for each class with this :

df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)

samplesize=1
df2=df2.sample(frac=1)
df2=df2.groupby('labels')
df2=df2.head(samplesize).reset_index(drop=True)

features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values  # Labels
y_prednew=clf.predict(Xnew)

... accuracy is ~35%. But if I do not filter this new data and submit it to the model accuracy is above ~90%.

First and second data sets are identical in term of shape. If I give all samples from the second data set to this trained model, accurayc is close to the first dataset test results. But If I filter it for 1 sample from each class, accuracy is ~30%.

I dont know where did I make mistake.

Are you sure there is a mistake in this? If I use sample size = 1 instead of sample size = 200K, I would not be surprised to obtain low scores. By "classes" you mean features, right? — offeltoffel
I say "classes" for predictions. features are my inputs : col1 col2 col3 etc. — ogursoy
Basically if I use this trained model for another database with the same shape, accuracy is what I have expected. But if I subsample this second database (1 sample for eac class = 90 sample instead of 200K samples) accuracy is around 30% — ogursoy
You make 90 different predictions with one model? How many features do you use for training then? I am still not sure if this is really a problem of the code. No one guarantees to you that the one sample you draw per "class" (whatever that is, I still don't quite relate to your terminology) can really be predicted by the model with the same accuracy as the whole data set. — offeltoffel
Yes data set has 90 different "classes". Features are around 256~. The part I dont get is if I use second database with all 200K samples for predictions , model accuracy is reasonable. But if I subsample this dataset to 1 sample from each class (which means 90 samples in total) , accuracy is around 35% — ogursoy

andersource andersource · Accepted Answer · 2018-12-06T13:46:54

Generally the code seems OK. It's hard to know but I would hazard a guess that the classes aren't equally represented in the dataset (at least the second, perhaps also the first), and the more dominant classes are more accurately identified.

The classic example is some extremely imbalanced binary classification task where 99% of the samples are positive. By always predicting positive you can get 99% accuracy, but a sample of 1 datapoint for each class would have 50% accuracy (and while out-of-context the accuracy might seem good, the model isn't very useful).

I would recommend examining the class frequencies, and also using other metrics (see precision, recall and f1) with the appropriate average parameter to more accurately assess your model's performance.

To summarise, a 90%+ accuracy on the entire dataset and 30% accuracy on a sample of 1 datapoint for each class aren't necessarily conflicting, e.g. if the classes aren't balanced in the dataset.

Edit: In short what I'm trying to say is that it could be you're experiencing the Accuracy Paradox.

How to use trained model for another dataset in Sklearn's Random Forest Classifier?

1 Answers