8
votes

I'm new to data analytics. I'm trying some models in python Sklearn. I have a dataset in which some of the columns have text columns. Like below,

Dataset

Is there a way to convert these column values into numbers in pandas or Sklearn?. Assigning numbers to these values will be right?. And what if a new string pops out in test data?.

Please advice.

2
consider using get_dummies function available in pandas. Ignore all new values encountered in test data, you cannot use values which was not seen in during training. - shanmuga
i was thinking of using it. but some of the columns have many unique values (upto 400+). - Selva Saravana Er

2 Answers

3
votes

Consider using Label Encoding - it transforms the categorical data by assigning each category an integer between 0 and the num_of_categories-1:

from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(['a','b','c','d','a','c','a','d'], columns=['letter'])

  letter
0      a
1      b
2      c
3      d
4      a
5      c
6      a

Applying:

le = LabelEncoder()
encoded_series = df[df.columns[:]].apply(le.fit_transform)

encoded_series:

    letter
0   0
1   1
2   2
3   3
4   0
5   2
6   0
7   3
0
votes

You can convert them into integer codes by using the categorical datatype.

column = column.astype('category')
column_encoded = column.cat.codes

As long as use use a tree based model with deep enough trees, eg GradientBoostingClassifier(max_depth=10), your model should be able to split out the categories again.