1
votes

I'm trying to do a score between two files. The two have the same data but not the same label. Labels from train data are corrects and the labels from test data not necessarily... and I would like to know the accuracy, recall and f-score.

import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

df_train = pd.read_csv('train.csv', sep = ',')
df_test = pd.read_csv('teste.csv', sep = ',')

vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])
y_train = df_train['label']

vec_test = TfidfVectorizer()
X_test = vec_test.fit_transform(df_train['text'])
y_test = df_test['label']

clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')

y_pred = clf.predict(X_test)

print ("Accuracy on training set:")
print (clf.score(X_train, y_train))
print ("Accuracy on testing set:")
print (clf.score(X_test, y_test))
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred))

A stupid example of the data:

TRAIN
text,label
dogs are cool,animal
flowers are beautifil,plants
pen is mine,objet
beyonce is an artist,person

TEST
text,label
dogs are cool,objet
flowers are beautifil,plants
pen is mine,person
beyonce is an artist,animal

Error:

Traceback (most recent call last):

File "accuracy.py", line 30, in y_pred = clf.predict(X_test)

File "/usr/lib/python3/dist-packages/sklearn/linear_model/base.py", line 324, in predict scores = self.decision_function(X)

File "/usr/lib/python3/dist-packages/sklearn/linear_model/base.py", line 298, in decision_function "yet" % {'name': type(self).name}) sklearn.exceptions.NotFittedError: This LogisticRegression instance is not fitted yet

I just wanted to calculate the accuracy of the test

2
You did not fitted your model at all!!! First you should use fit() funtion. then use predict. And you can use confusion_matrix to count true or false prediciton.M. Doosti Lakhani

2 Answers

1
votes

You are fitting a new TfidfVectorizer on test data. This will give wrong results. You should use the same object which you fitted on train data.

Do this:

vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])

X_test = vec_train.transform(df_test['text'])

After that, as @MohammedKashif said, you need to first train your LogisticRegression model and then predict on test.

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

After that you can use the scoring code without any errors.

1
votes

You have to first train your classifier object using the X_train before using the predict function over X_test. Something like this

clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')

#Then train the classifier over training data
clf.fit(X_train, y_train)

#Then use predict function to make predictions
y_pred = clf.predict(X_test)