Using clf.fit with numpy arrays from csv

Question

I am a Python beginner and recently learnt about scikit-learn. I am trying to import data from a csv into a numpy array and then run clf.fit on it to 'fit' the data. I am using np.genfromtxt to import data from csv. If I don't have column names in my csv's everything seems to work well. However, if I include the column names and use names=TRUE, the clf.fit fails with the following error message: "ValueError: X and y have incompatible shapes. X has 1 samples, but y has 420." I am using 2 csv files - one is the data and the other is the target. The data file contains 420 rows (excluding the column names) and about 56 columns. The target file contains 420 rows (again excluding the column names) and 1 column. All the data is int/float.

I have attached the outputs below. I am wondering why the output of the clf.fit changes depending on whether the numpy arrays contain the column names or not. Please let me know if you need more information. Please note that MLB_data1 is the same as MLB_data but without the column names. And similarly for MLB_target1 and MLB_target.

Code and Output with names=TRUE

import numpy as np
from sklearn import svm
mlb_data = np.genfromtxt("MLB_data.csv", dtype=float, delimiter=',', names=True)
mlb_target = np.genfromtxt("MLB_target.csv", dtype=float, delimiter=',', names=True)
clf = svm.SVC()
clf.fit(mlb_data, mlb_target)

Output:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
6
7 clf = svm.SVC()
----> 8 clf.fit(mlb_data, mlb_target)

C:\Users\Anand\Anaconda\lib\site-packages\sklearn\svm\base.pyc in fit(self, X, y, sample_weight) 149 raise ValueError("X and y have incompatible shapes.\n" +
150 "X has %s samples, but y has %s." %
--> 151 (X.shape[0], y.shape[0]))
152 153 if self.kernel == "precomputed" and X.shape[0] != X.shape[1]:

ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 420.

Code and Output with names=None

import numpy as np
from sklearn import svm
mlb_data = np.genfromtxt("MLB_data1.csv", dtype=float, delimiter=',', names=None)
mlb_target = np.genfromtxt("MLB_target1.csv", dtype=float, delimiter=',', names=None)
clf = svm.SVC()
clf.fit(mlb_data, mlb_target)

Output:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

ogrisel ogrisel · Accepted Answer · 2013-12-18T08:23:39

The effective shape of your data is not what you expect: add print statements such as print(mlb_data), print(mlb_data.dtype) and print(mlb_data.shape) to debug how the data was parsed by np.genfromtxt.

I suspect that when you pass names=True you get a 1D record array where each row is structured. This is not the kind of data scikit-learn expects. scikit-learn always want homogeneous 2D numpy array with a float dtype.

Using clf.fit with numpy arrays from csv

2 Answers