I am trying to validate my data using Kfold.
def printing_kfold_score(X,y):
fold = KFold(5,shuffle=False)
recall_accs=[]
for train_index, test_index in fold.split(X):
X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
y_train, y_test = y.iloc[train_index,:], y.iloc[test_index,:]
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = 0.01, penalty = 'l1',solver = 'liblinear')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(X_train, y_train.values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(X_test)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_test,y_pred_undersample)
recall_accs.append(recall_acc)
print(np.mean(recall_accs))
printing_kfold_score(X_undersample,y_undersample)
X_undersample is a dataframe (984,29)
y_undersample is a dataframe (984,1)
I am getting the below Warning:
0.5349321454470113
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Why am I getting this warning, my data is perfectly balanced(50/50) this warning and low recall score wasn't expected.Can you tell me what am I doing wrong?
I tried printing the value shape and value of x_test and y_test.
x_train shape (788, 29)
x_test shape (196, 29)
y_train shape (788, 1)
y_test shape (196, 1)
x_test V1 V2 V3 ... V27 V28 normAmount
541 -2.312227 1.951992 -1.609851 ... 0.261145 -0.143276 -0.353229
623 -3.043541 -3.157307 1.088463 ... -0.252773 0.035764 1.761758
4920 -2.303350 1.759247 -0.359745 ... 0.039566 -0.153029 0.606031
y_test Class
38042 0
170554 0
16019 0
Is it because of the first column which represents index?
Thanks.
y_test
, in one of your folds, has no positive cases – especially with a sample of only 984 records. Although if the dependent variable is truly balanced 50-50, that may be unlikely. – blacksitey_test
in your DataFrame seems to be an index, whereClass
seems (although we only see 0s here) binary. Can you provide the counts of each value by class for the predicted and actual class vectors? – blacksite