0
votes

I am trying to validate my data using Kfold.

def printing_kfold_score(X,y):
fold = KFold(5,shuffle=False)
recall_accs=[]

for train_index, test_index in fold.split(X):
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index,:], y.iloc[test_index,:]

    # Call the logistic regression model with a certain C parameter
    lr = LogisticRegression(C = 0.01, penalty = 'l1',solver = 'liblinear')
    # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
    lr.fit(X_train, y_train.values.ravel())

    # Predict values using the test indices in the training data
    y_pred_undersample = lr.predict(X_test)

    # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
    recall_acc = recall_score(y_test,y_pred_undersample)
    recall_accs.append(recall_acc)
print(np.mean(recall_accs))

printing_kfold_score(X_undersample,y_undersample)

X_undersample is a dataframe (984,29)

y_undersample is a dataframe (984,1)

I am getting the below Warning:

0.5349321454470113
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
C:\Users\sudha\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1272: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Why am I getting this warning, my data is perfectly balanced(50/50) this warning and low recall score wasn't expected.Can you tell me what am I doing wrong?

I tried printing the value shape and value of x_test and y_test.

   x_train shape (788, 29) 
   x_test shape (196, 29) 
   y_train shape (788, 1) 
   y_test shape (196, 1) 

 x_test      V1        V2        V3  ...       V27       V28     normAmount
    541  -2.312227  1.951992 -1.609851  ...  0.261145 -0.143276   -0.353229
    623  -3.043541 -3.157307  1.088463  ... -0.252773  0.035764    1.761758
    4920 -2.303350  1.759247 -0.359745  ...  0.039566 -0.153029    0.606031

y_test         Class
38042       0
170554      0
16019       0

Is it because of the first column which represents index?

Thanks.

1
"I am unable to get the desired output" is not helpful; what exactly is your issue and your question?desertnaut
Where exactly (which command)? Please edit & update the question with the full error trace.desertnaut
It could be y_test, in one of your folds, has no positive cases – especially with a sample of only 984 records. Although if the dependent variable is truly balanced 50-50, that may be unlikely.blacksite
@blacksite, I have updated the question with my train and test shape. also I have printed the value of y_test and x_test. Is it because of the first column of my df which is index value?AMIT BISHT
@AMITBISHT, this is a binary classification model, right? Perhaps I'm misunderstanding, but y_test in your DataFrame seems to be an index, where Class seems (although we only see 0s here) binary. Can you provide the counts of each value by class for the predicted and actual class vectors?blacksite

1 Answers

1
votes

You described the issue in your comment:

y_test changes – sometimes it is all 0, sometimes 1, etc.

This is effectively what's happening:

>>> from sklearn.metrics import *
>>> recall_score([0,0], [1,0])

UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use zero_division parameter to control this behavior. _warn_prf(average, modifier, msg_start, len(result))

You should take steps to ensure y_test always has positive and negative samples available so you can more accurately assess the performance of your classifier.