Here in the code, I have:
- Split the dataset into two part: Train set and Test set (7:3). The dataset consists of 200 rows and 9394 columns.
- Define the model
- cross validation used: 10 folds on train set
- accuracy obtained for each fold
- mean accuracy obtained: 94.29%
The confusion is:
- Is it the right way I am doing?
- Is cross_val_predict() used in the right way to predict the x over the test data?
Tasks remaining:
- To plot accuracy of model.
- To plot loss of model.
Can anyone suggest in this regards. Sorry for this long notes!!!
The dataset is as: (These are tfidf of each words in the title and body of news)
Unnamed: 0 Unnamed: 0.1 Label Cosine_Similarity c0 c1 c2 c3 c4 c5 ... c9386 c9387 c9388 c9389 c9390 c9391 c9392 c9393 c9394 c9395
0 0 0 Real 0.180319 0.000000 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 1 Real 0.224159 0.166667 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 2 Real 0.233877 0.142857 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 3 3 Real 0.155789 0.111111 0.0 0.000000 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4 4 Real 0.225480 0.000000 0.0 0.111111 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The code and output are:
df_all = pd.read_csv("C:/Users/shiva/Desktop/allinone200.csv")
dataset=df_all.values
x=dataset[0:,3:]
Y= dataset[0:,2]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
y = np_utils.to_categorical(encoded_Y)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=15,shuffle=True)
x_train.shape,y_train.shape
def baseline_model():
model = Sequential()
model.add(Dense(512, activation='relu',input_dim=x_train.shape[1]))
model.add(Dense(64, activation='relu')))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
code for fitting the model:
estimator = KerasClassifier(build_fn=baseline_model, epochs=5, batch_size=4, verbose=1)
kf = KFold(n_splits=10, shuffle=True,random_state=15)
for train_index, test_index in kf.split(x_train,y_train):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
code for taking out the results:
results = cross_val_score(estimator, x_train, y_train, cv=kf)
print results
Output:
[0.9285714 1. 0.9285714 1. 0.78571427 0.85714287
1. 1. 0.9285714 1. ]
Mean accuracy:`
print("Accuracy: %0.2f (+/-%0.2f)" % (results.mean()*100, results.std()*2))
Output:
Accuracy: 94.29 (+/-0.14)
code for prediction:
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(estimator, x_test, y_test,cv=kf)
print(y_test[0])
print(y_pred[0])
Output:after processing
[1. 0.]
0
Here prediction seems seems okay. Because 1 is REAL and O is FALSE. y_test is 0 and y_predict is also 0.
Confusion matrix:
import numpy as np
y_test=np.argmax(y_test, axis=1)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm
Output:
array([[32, 0],
[ 1, 27]], dtype=int64)