One-hot encoding using scikit-learn

Question

I am working on a machine learning project and one feature of my dataset consists of categorical data. This data is first stored in an panda series (<class 'pandas.core.series.Series'>) mesh with dimmensions of(2000,). The number of rows corresponds to the total number of data instances. Each row contains a string of the categories to which that data instance belongs to where the categories are separated by a comma. There are several hundred different categories. For example,

0       Aged, Angiotensin-Converting Enzyme Inhibitors...
1       Aged, Angiotensin-Converting Enzyme Inhibitors...
2       Adult, Aged, Aged, 80 and over, Angiotensin-Co...
....

In this example Aged is one category and Angiotensin-Converting Enzyme Inhibitors another. As you can see in the example, the same category may occur multiple times but this should encode no differently if the category was only in the string once.

I wish to use one-hot encoding to represent them. To try to do this I use this code:

mlb = MultiLabelBinarizer(sparse_output=True)
for s in data:
   pre_data = mlb.fit_transform(str(s).split(', '))
return pre_data, len(mlb.classes_)

However, this produces a numpy array of dimensions (19, 37). Why is this the case?

In response to MaxU's answer:

When repalcing str(s).split(', ') with s.str.split(',\s*'), this produces this error:

Traceback (most recent call last):
  File ".../guidedLearning.py", line 166, in <module>
    X, y = processTrainingData(directory, filename)
  File ".../guidedLearning.py", line 130, in processTrainingData
    pre_mesh, meshN = oneHot(mesh)
  File ".../guidedLearning.py", line 73, in oneHot
    pre_data = mlb.fit_transform(data.str.split(',\s*'))
  File ".../sklearn/preprocessing/label.py", line 723, in fit_transform
    yt = self._transform(y, class_mapping)
  File ".../sklearn/preprocessing/label.py", line 781, in _transform
    indices.extend(set(class_mapping[label] for label in labels))
TypeError: 'float' object is not iterable

MaxU MaxU · Accepted Answer · 2017-04-14T10:43:30

str(s) converts Pandas.Series of strings into a single string, delimited by '\n', so use Pandas.Series.str.split() method instead.

replace

str(s).split(', ')

with

s.str.split(',\s*')

Demo:

In [88]: s
Out[88]:
0       Aged, Angiotensin-Converting Enzyme Inhibitors
1       Aged, Angiotensin-Converting Enzyme Inhibitors
2    Adult, Aged, Aged, 80 and over, Angiotensin-Co...
Name: s, dtype: object

In [89]: mlb = MultiLabelBinarizer(sparse_output=True)

In [90]: pre_data = mlb.fit_transform(s.str.split(',\s*'))

In [91]: mlb.classes_
Out[91]: array(['80 and over', 'Adult', 'Aged', 'Angiotensin-Converting Enzyme Inhibitors'], dtype=object)

In [92]: pre_data.toarray()
Out[92]:
array([[0, 0, 1, 1],
       [0, 0, 1, 1],
       [1, 1, 1, 1]])

One-hot encoding using scikit-learn

1 Answers