2
votes

I have a dataframe that has int and categorical features. The categorical features are 2 types: numbers and strings.

I was able to One hot encode columns that were int and categorical that were numbers. I get an error when I try to One hot encode categorical columns that are strings.

ValueError: could not convert string to float: '13367cc6'

Since the dataframe is huge with high cardinality so I only want to convert it to a Sparse form. I would prefer a solution that uses from sklearn.preprocessing import OneHotEncoder since I am familiar with it.

I checked other questions too but none of them addresses what I am asking.

data = [[623, 'dog', 4], [123, 'cat', 2],[623, 'cat', 1], [111, 'lion', 6]]

The above dataframe contains 4 rows and 3 columns

Column names - ['animal_id', 'animal_name', 'number']

Assume that animal_id and animal_name are stored in pandas as category and number as int64 dtype.

2
can you provide a small sample reproducible data set?MaxU
Added an example. Let me know if you need any other details.Aman

2 Answers

1
votes

Assuming you have the following DF:

In [124]: df
Out[124]:
   animal_id animal_name  number
0        623         dog       4
1        123         cat       2
2        623         cat       1
3        111        lion       6

In [125]: df.dtypes
Out[125]:
animal_id         int64
animal_name    category
number            int64
dtype: object

first save animal_name column (if you need it in future):

In [126]: animal_name = df['animal_name']

convert animal_name column to categorical (memory saving) numeric column:

In [127]: df['animal_name'] = df['animal_name'].cat.codes.astype('category')

In [128]: df
Out[128]:
   animal_id animal_name  number
0        623           1       4
1        123           0       2
2        623           0       1
3        111           2       6

In [129]: df.dtypes
Out[129]:
animal_id         int64
animal_name    category
number            int64
dtype: object

Now OneHotEncoder should work:

In [130]: enc = OneHotEncoder()

In [131]: enc.fit(df)
Out[131]:
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

In [132]: X = enc.fit(df)

In [134]: X.n_values_
Out[134]: array([624,   3,   7])

In [135]: enc.feature_indices_
Out[135]: array([  0, 624, 627, 634], dtype=int32)
1
votes

FYI, there are other powerful encoding schemes which did not add a big number of columns as onehot encoding (In fact they did not add any columns at all). Some of them are count encoding, target encoding. For more details, see my answer here and my ipynb here.