Scikit-learn DictVectorizer for categoricals variables

Question

I have a .csv file which entries look like this:

b0002 ,0,>0.00 ,3,<=0.644 ,<=0.472 ,<=0.690 ,<=0.069672 ,>15.00 ,>21.00 ,>16.00 ,>6.00 ,>16.00 ,>21.00 ,>9.00 ,>11.00 ,>20.00 ,>7.00 ,>4.00 ,>9.00 ,>9.00 ,>13.00 ,>8.00 ,>14.00 ,>3.00 ,"(1.00, 8.00] ",>10.00 ,>9.00 ,>183.00 ,1

I want to use the GaussianNB() to classify this. So far I managed to do that using another csv with numerical data, now I wanted to use this but I'm stuck.

What's the best way to transform categorical data for a classifier?

This:

p = read_csv("C:path to\\file.csv")

trainSet = p.iloc[1:20,2:5] //first 20 rows and just 3 attributes
dic = trainSet.transpose().to_dict()

vec = DictVectorizer()
vec.fit_transform(dic)

give this error:

Traceback (most recent call last):
  File "\prova.py", line 23, in <module>
vec.fit_transform(dic)
File "\dict_vectorizer.py", line 142, in fit_transform
return self.transform(X)
File "\\dict_vectorizer.py", line 230, in transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number

What's the best way to transform categorical data for a classifier?

JAB JAB · Accepted Answer · 2015-02-08T18:59:46

The issue is with the transposed 'dataframe' returns a nested dict when .to_dict() is called on it.

#create a dummy frame
df = pd.DataFrame({'factor':['a','a','a','b','c','c','c'], 'factor1':['d','a','d','b','c','d','c'], 'num':range(1,8)})

#transpose the dataframe and get the inner dict from to_dict()
feats =df.T().to_dict().values()

from sklearn.feature_extraction import DictVectorizer
Dvec = DictVectorizer()
Dvec.fit_transform(feats).toarray()

The solution is to call .values() on the dict to get the inner dict

Get new feature names from Dvec:

Dvec.get_feature_names()

Scikit-learn DictVectorizer for categoricals variables

1 Answers