Update : Some "Terminology "
Sample : row
features: columns
'labels' : classes for the prediction (one column among the features).
Basically I wonder : I have dataset1 and dataset2 identical in terms of shape and size. After training and testing with dataset1, I am using this model to predict dataset2. (Number of features are also same).
If I predict all the items in the dataset2 , accuracy is close to dataset1 test results. But If I pick 1 item for each class from dataset2, accuracy is around 30%. How it is possible that full dataset2 accuracy is drastically different than the "subsampled" dataset2?
I am using RandomForestClassifier.
I have a data set with 200K sample (rows) having around 90 classes. After Training and testing, accuracy is high enough (around ~96%).
Now since I have a trained model, I am using another different database (again with 200 K samples and 90 classes) to make predictions.
If I submit all samples from this second database, accuracy is close enough to training accuracy (around ~92% ).
But If I select 90 samples (one from each class) from this second database accuracy is not what I have expected. (around ~30%)
.... data preprocessing is done.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=nestimators, bootstrap=False,
class_weight=None, criterion='entropy',
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_weight_fraction_leaf=0.0, n_jobs=6,
oob_score=False, random_state=np.random.seed(1234), verbose=0, warm_start=False)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
and accuracy is around ~96%.
Now I am using this trained model with a new database (identical in shape) :
df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)
features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values # Labels
y_prednew=clf.predict(Xnew)
Accuracy is above ~90%. Close to first database accuracy. But
If I filter this new data set for 1 sample for each class with this :
df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)
samplesize=1
df2=df2.sample(frac=1)
df2=df2.groupby('labels')
df2=df2.head(samplesize).reset_index(drop=True)
features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values # Labels
y_prednew=clf.predict(Xnew)
... accuracy is ~35%. But if I do not filter this new data and submit it to the model accuracy is above ~90%.
First and second data sets are identical in term of shape. If I give all samples from the second data set to this trained model, accurayc is close to the first dataset test results. But If I filter it for 1 sample from each class, accuracy is ~30%.
I dont know where did I make mistake.
features
, right? – offeltoffel