0
votes

I am trying to train (fit) a Random forest classifier using python and scikit-learn for a set of data stored as feature vectors. I can read the data, but I can't run the training of the classifier because of Value Erros. The source code that I am using is the following:

from sklearn.ensemble import RandomForestClassifier
from numpy import genfromtxt

 my_training_data = genfromtxt('csv-data.txt', delimiter=',')

 X_train = my_training_data[:,0]
 Y_train = my_training_data[:,1:my_training_data.shape[1]]

 clf = RandomForestClassifier(n_estimators=50)
 clf = clf.fit(X_train.tolist(), Y_train.tolist())

The error returned to me is the following:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/sklearn/ensemble/forest.py",  line 260, in fit
n_samples, self.n_features_ = X.shape
ValueError: need more than 1 value to unpack

The csv-data.txt is a comma separated values file, containing 3996 vectors for training of the classifier. I use the f irst dimension to label the vector and the rest are float values. These are the dimensions of the feature vectors used in the classifier.

Did I miss some conversion here?

1
If the first number in every row of your text file of training examples is the label, shouldn't X_train and Y_train be swapped?Matt Hancock

1 Answers

3
votes

The training examples are stored by row in "csv-data.txt" with the first number of each row containing the class label. Therefore you should have:

X_train = my_training_data[:,1:]
Y_train = my_training_data[:,0]

Note that in the second index in X_train, you can leave off the end index, and the indices will automatically run to the end (of course you can be explicit for clarity, but this is just FYI.

Also, there is no need to call tolist() in your call to fit() since these are already numpy ndarray, and the fit() function will convert them back to numpy ndarray if the argument is a list.

clf.fit(X_train.tolist(), Y_train.tolist())