how to test the data after training the train data with k-fold cross validation?

Question

Here in the code, I have:

Split the dataset into two part: Train set and Test set (7:3). The dataset consists of 200 rows and 9394 columns.
Define the model
cross validation used: 10 folds on train set
accuracy obtained for each fold
mean accuracy obtained: 94.29%

The confusion is:

Is it the right way I am doing?
Is cross_val_predict() used in the right way to predict the x over the test data?

Tasks remaining:

To plot accuracy of model.
To plot loss of model.

Can anyone suggest in this regards. Sorry for this long notes!!!

The dataset is as: (These are tfidf of each words in the title and body of news)

    Unnamed: 0  Unnamed: 0.1    Label   Cosine_Similarity   c0  c1  c2  c3  c4  c5  ... c9386   c9387   c9388   c9389   c9390   c9391   c9392   c9393   c9394   c9395
0   0   0   Real    0.180319    0.000000    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   1   1   Real    0.224159    0.166667    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2   2   2   Real    0.233877    0.142857    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3   3   3   Real    0.155789    0.111111    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4   4   4   Real    0.225480    0.000000    0.0 0.111111    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

The code and output are:

df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")

dataset=df_all.values
x=dataset[0:,3:]
Y= dataset[0:,2]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15,shuffle=True)
x_train.shape,y_train.shape

def baseline_model():
    model = Sequential()
    model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
    model.add(Dense(64, activation='relu')))
    model.add(Dense(2, activation='softmax'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

code for fitting the model:

estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)

for train_index, test_index in kf.split(x_train,y_train):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

code for taking out the results:

results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results

Output:

[0.9285714  1.         0.9285714  1.         0.78571427 0.85714287
 1.         1.         0.9285714  1.        ]

Mean accuracy:`

print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))

Output:

Accuracy: 94.29 (+/-0.14)

code for prediction:

from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)
print(y_test[0])
print(y_pred[0])

Output:after processing

[1. 0.]
0

Here prediction seems seems okay. Because 1 is REAL and O is FALSE. y_test is 0 and y_predict is also 0.

Confusion matrix:

import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

Output:

array([[32,  0],
       [ 1, 27]], dtype=int64)

I don't know your dataset, but 200 rows and 9394 columns with an accuracy of 94% sounds extremly suspicious. Normally you need at leas 2x the amount of datapoints(rows) as you have features (columns) as a rule of thumb to even get some decent results. Also in your output results there are multiple 100% accuracies which again is extremly suspicious. Also you never defined kf in your example code, so there is no way of knowing if cross_val_predict with the parameter cv=kf works properly. — Andreas
Thank you @ Andreas Hofmann for your quick response. I have put the dataset and kf is mentioned in the code. Could you please review it again. Am I in the right track? — Shiva RD
Does the batch size affect the accuracy of the model? When I increase the batch_size (say batch_size= 50), the accuracy is observed at 87%. Anyone , please suggest. — Shiva RD

finlytics-hub finlytics-hub · Accepted Answer · 2020-05-28T02:49:20

Subject to Andreas' comment related to the number of your observations, does this help you in any way: Keras - Plot training, validation and test set accuracy

Bests

how to test the data after training the train data with k-fold cross validation?

2 Answers