I am a Python beginner and recently learnt about scikit-learn. I am trying to import data from a csv into a numpy array and then run clf.fit on it to 'fit' the data. I am using np.genfromtxt to import data from csv. If I don't have column names in my csv's everything seems to work well. However, if I include the column names and use names=TRUE, the clf.fit fails with the following error message: "ValueError: X and y have incompatible shapes. X has 1 samples, but y has 420." I am using 2 csv files - one is the data and the other is the target. The data file contains 420 rows (excluding the column names) and about 56 columns. The target file contains 420 rows (again excluding the column names) and 1 column. All the data is int/float.
I have attached the outputs below. I am wondering why the output of the clf.fit changes depending on whether the numpy arrays contain the column names or not. Please let me know if you need more information. Please note that MLB_data1 is the same as MLB_data but without the column names. And similarly for MLB_target1 and MLB_target.
Code and Output with names=TRUE
import numpy as np
from sklearn import svm
mlb_data = np.genfromtxt("MLB_data.csv", dtype=float, delimiter=',', names=True)
mlb_target = np.genfromtxt("MLB_target.csv", dtype=float, delimiter=',', names=True)
clf = svm.SVC()
clf.fit(mlb_data, mlb_target)
Output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
6
7 clf = svm.SVC()
----> 8 clf.fit(mlb_data, mlb_target)
C:\Users\Anand\Anaconda\lib\site-packages\sklearn\svm\base.pyc in fit(self, X, y, sample_weight)
149 raise ValueError("X and y have incompatible shapes.\n" +
150 "X has %s samples, but y has %s." %
--> 151 (X.shape[0], y.shape[0]))
152
153 if self.kernel == "precomputed" and X.shape[0] != X.shape[1]:
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 420.
Code and Output with names=None
import numpy as np
from sklearn import svm
mlb_data = np.genfromtxt("MLB_data1.csv", dtype=float, delimiter=',', names=None)
mlb_target = np.genfromtxt("MLB_target1.csv", dtype=float, delimiter=',', names=None)
clf = svm.SVC()
clf.fit(mlb_data, mlb_target)
Output:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)