2
votes

after I cross-validated my training datasets - I began to have trouble with the confusion matrix. my X_Train shape shows (835, 5) and my y_train shape shows (835,). I cannot use this method when my data is mixed. Otherwise, the modules before it, were working perfectly. The code that I have is written below. How do I setup the training data to work with the confusion_matrix method?

cross_validate/cross_val_score module

from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
lasso = linear_model.Lasso()
cross_validate_results = cross_validate(lasso, X_train, y_train, return_train_score=True)
sorted(cross_validate_results.keys())
cross_validate_results['test_score']
print(cross_val_score(lasso, X_train, y_train))

confusion_matrix module

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, X_train)

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-83-78f76b6bc798> in <module>()
      1 from sklearn.metrics import confusion_matrix
      2 
----> 3 confusion_matrix(y_test, X_test)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    248 
    249     """
--> 250     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    251     if y_type not in ("binary", "multiclass"):
    252         raise ValueError("%s is not supported" % y_type)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     79     if len(y_type) > 1:
     80         raise ValueError("Classification metrics can't handle a mix of {0} "
---> 81                          "and {1} targets".format(type_true, type_pred))
     82 
     83     # We can't have more than one value on y_type => The set is no more needed

ValueError: Classification metrics can't handle a mix of multiclass and multiclass-multioutput targets

print shape of arrays module

print(X_train.shape)
print(y_train.shape)
(835, 5)
(835,)

UPDATE: I am now receiving this error ValueError: Found input variables with inconsistent numbers of samples: [356, 209]

When I run confusion_matrix(y_train, X_train)

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, y_pred)

Full error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-3caf00cb052f> in <module>()
      1 from sklearn.metrics import confusion_matrix
      2 
----> 3 confusion_matrix(y_train, y_pred)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in confusion_matrix(y_true, y_pred, labels, sample_weight)
    248 
    249     """
--> 250     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
    251     if y_type not in ("binary", "multiclass"):
    252         raise ValueError("%s is not supported" % y_type)

~\Anaconda3\lib\site-packages\sklearn\metrics\classification.py in _check_targets(y_true, y_pred)
     69     y_pred : array or indicator matrix
     70     """
---> 71     check_consistent_length(y_true, y_pred)
     72     type_true = type_of_target(y_true)
     73     type_pred = type_of_target(y_pred)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    202     if len(uniques) > 1:
    203         raise ValueError("Found input variables with inconsistent numbers of"
--> 204                          " samples: %r" % [int(l) for l in lengths])
    205 
    206 

ValueError: Found input variables with inconsistent numbers of samples: [356, 209]
1

1 Answers

1
votes

You need to pass y to the confusion matrix, not X (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html). Ideally, you would reserve a proportion of your data as a test set using sklearn's train_test_split (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and use your model to predict y based on the test set. Then you would use

confusion_matrix(y_test, y_pred)

to calculate the confusion matrix. In cases where there is no test set you would still use the predict method of your classifier with X_train in order to get y_pred. In this case, you pass y_train as the true labels and y_pred as the predicted labels to the confusion matrix, e.g

confusion_matrix(y_train, y_pred)

Looking at your code again, your estimator is a regression model (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso, e.g it predicts numerical values and then you are trying to use confusion matrix with it which is used for assessing the performance of classification models, e.g. how well labels have been predicted. So, you ought to consider metrics other than confusion_matrix for your problem.

Since you have now decided to use knn try the following first before dealing with cross validation.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

# Assuming your target column is y, otherwise use the appropriate column name
X = df.drop(['y'], axis=1).values.astype('float')
y = df['y'].values.astype('float') # assuming you have label encoded your target variable

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=23, stratify=y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)