I'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:

Question

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.read_csv("data.csv")
df['is_train'] = np.random.uniform(0,1,len(df)) <= 0.75
train, test = df[df['is_train'] == True], df[df['is_train'] == False]
features = df.columns[:10]
y = pd.factorize(train['Selector'])[0]
clf = RandomForestClassifier(n_jobs = 2, random_state = 0)
clf.fit(train[features],y)

ValueError Traceback (most recent call last) in () ----> 1 clf.fit(train[features],y)

C:\Users\abhir\Anaconda2\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight) 244 """ 245 # Validate or convert input data --> 246 X = check_array(X, accept_sparse="csc", dtype=DTYPE) 247 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None) 248 if sample_weight is not None:

C:\Users\abhir\Anaconda2\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 400 force_all_finite) 401 else: --> 402 array = np.array(array, dtype=dtype, order=order, copy=copy) 403 404 if ensure_2d:

ValueError: could not convert string to float: Male

Any help on why is this happening and how to resolve this? link to dataset

Harald Gliebe Harald Gliebe · Accepted Answer · 2017-09-09T07:10:41

Scikit learn's RandomForestClassifier doesn't support categorical data, like in your case 'gender' with values 'Male' and 'Female': See this question for details.

To solve that problem, you could encode the categorical variable with a label encoder:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(['Male', 'Female'])
df.loc[:,'gender'] =  le.transform(df['gender'])

The dataset also contains some NaN in the column Alkphos which you would need to handle before training the classifier. The easiest but not necessarily the best option is to remove the datasets with missing values:

df = df[np.isfinite(df['Alkphos'])]

You need to do this preprocessing before splitting the data into training and test set, so both data sets undergo the same transformation and filtering.

I'm trying to build a random forest classifier upon a liver disorder data set. But the fit method returns an error as such:

1 Answers