
Here in the code, I have:

  1. Split the dataset into two part: Train set and Test set (7:3). The dataset consists of 200 rows and 9394 columns.
  2. Define the model
  3. cross validation used: 10 folds on train set
  4. accuracy obtained for each fold
  5. mean accuracy obtained: 94.29%

The confusion is:

  1. Is it the right way I am doing?
  2. Is cross_val_predict() used in the right way to predict the x over the test data?

Tasks remaining:

  1. To plot accuracy of model.
  2. To plot loss of model.

Can anyone suggest in this regards. Sorry for this long notes!!!

The dataset is as: (These are tfidf of each words in the title and body of news)

    Unnamed: 0  Unnamed: 0.1    Label   Cosine_Similarity   c0  c1  c2  c3  c4  c5  ... c9386   c9387   c9388   c9389   c9390   c9391   c9392   c9393   c9394   c9395
0   0   0   Real    0.180319    0.000000    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1   1   1   Real    0.224159    0.166667    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2   2   2   Real    0.233877    0.142857    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3   3   3   Real    0.155789    0.111111    0.0 0.000000    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4   4   4   Real    0.225480    0.000000    0.0 0.111111    0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

The code and output are:

df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")

Y= dataset[0:,2]
encoder = LabelEncoder()
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)

from sklearn.model_selection import train_test_split

def baseline_model():
    model = Sequential()
    model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
    model.add(Dense(64, activation='relu')))
    model.add(Dense(2, activation='softmax'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model

code for fitting the model:

estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)

for train_index, test_index in kf.split(x_train,y_train):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)

code for taking out the results:

results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results


[0.9285714  1.         0.9285714  1.         0.78571427 0.85714287
 1.         1.         0.9285714  1.        ]

Mean accuracy:`

print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))


Accuracy: 94.29 (+/-0.14)

code for prediction:

from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)

Output:after processing

[1. 0.]

Here prediction seems seems okay. Because 1 is REAL and O is FALSE. y_test is 0 and y_predict is also 0.

Confusion matrix:

import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


array([[32,  0],
       [ 1, 27]], dtype=int64)
I don't know your dataset, but 200 rows and 9394 columns with an accuracy of 94% sounds extremly suspicious. Normally you need at leas 2x the amount of datapoints(rows) as you have features (columns) as a rule of thumb to even get some decent results. Also in your output results there are multiple 100% accuracies which again is extremly suspicious. Also you never defined kf in your example code, so there is no way of knowing if cross_val_predict with the parameter cv=kf works properly.Andreas
Thank you @ Andreas Hofmann for your quick response. I have put the dataset and kf is mentioned in the code. Could you please review it again. Am I in the right track?Shiva RD
Does the batch size affect the accuracy of the model? When I increase the batch_size (say batch_size= 50), the accuracy is observed at 87%. Anyone , please suggest.Shiva RD

2 Answers


Subject to Andreas' comment related to the number of your observations, does this help you in any way: Keras - Plot training, validation and test set accuracy



Unfortunately my comment became to long therefore I try it here:

Please have a look at this: https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e in short, larger batch sizes have often worse results but are faster, which in your case might be irrelevant (200 rows). Secondly you do not have a (reusable) hold-out which might give you false assumptions regarding your true accuracy. That you have an accuracy of over 90% on your first try can mean either: overfitting, leaking or imbalanced data (e.g. here: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html) or that you were really lucky. K-fold in combination with small samples sizes can lead to wrong assumptions: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0224365

A few rule of thumbs: 1. you want to have 2x as much datapoints (rows) as features (columns). 2. If you still get a good result, this can mean multiple things. Most likely its an error in code or methodology.

Imagine you have to predict the fraud risk of a bank. If the chance a fraud happens is 1% I can build you a modell which is right 99% of the time by simply saying there is never any fraud....

Neuronal Nets are extremly powerfull, that is good and bad. The bad thing is that they nearly always find some kind of pattern, even if there isn't one. If you give them 2000 columns essentially it gets a bit like the number "Pi" if you search long enough in the numbers after the comma you will find any number combination you want. Here its explained in a bit more detail: https://medium.com/@jennifer.zzz/more-features-than-data-points-in-linear-regression-5bcabba6883e