
I have spent 30 hours on this single problem de-bugging and it makes absolutely no sense, hopefully one of you guys can show me a different perspective.

The problem is that I use my training dataframe in a random forest and get very good accuracy 98%-99% but when I try and load in a new sample to predict on. The model ALWAYS guesses the same class.

#  Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)

#  Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)

#  Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)

#  Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
#  Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)

print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))

The way I am processing the data is the same, but when I predict on the X_test or X_train I get my normal 98% and when I predict on my new data it always guesses the same class.

    #  The json file is not in the correct format, this function normalizes it
    normalized_json = json_normalizer(json_file, "", training=False)
    #  Turn the json into a list of dictionaries which contain the features
    features_dict = create_dict(normalized_json, label=None)

    #  Convert the dictionaries into pandas dataframes
    df = pd.DataFrame.from_records(features_dict)
    print('Total amount of email samples: ', len(df))

    df = df.fillna(-1)
    #  One hot encodes string values
    df = one_hot_encode(df, noOverride=True)
    if 'label' in df.columns:
        df = df.drop(['label'], axis='columns')

Above is my testing scenario, you can see in the last two lines I am predicting on X_train the data used to train the model and df the out of sample data that it always guesses class 0.

Some useful information:

  • The datasets are imbalanced; class 0 has about 150,000 samples while class 1 has about 600,000 samples
  • There are 141 features
  • changing the n_estimators and max_depth doesn't fix it

Any ideas would be helpful, also if you need more information let me know my brain is fried right now and that's all I could think of.

couple of question that are not answered in the OP. 1. Have you applied any measure to treat the imbalanced data before training the model? 2. Has the data been randomly sampled before training the models? 3. Has cross-validation been applied before building the model?mnm
@mnm So the model originally worked a few days ago and predicted things accurately even without balancing the data so I made no attempt. The data was randomly sampled and I've even tried re-processing and predicting on samples that were used in training which ended up guessing the same class every timeDL_Engineer
Could you please check if df is filled out correctly? Maybe it's all -1's after df=df.fillna(-1)? Just a guess.Kate Melnykova
Accuracy is not so good for imbalnced data as it guides the model towards predicting correctly on majority class. You need to either (1) resample data so that classes are more or less evenly reperesented (2) weight classes (3) choose more robust metrics like AUC or f1Sergey Bushmanov
@DL_Engineer as per the OP, the class 1 is 4 times greater than class 0 so its quite possible initially the model was working only for class 1. Try building a new model using ROC AUC as an evaluation metric.mnm

1 Answers


Fixed, The issue was the imbalance of datasets also I realized that changing the depth gave me different results.

For example, 10 trees with 3 depth -> seemed to work fine 10 trees with 6 depth -> back to guessing only the same class