Factorizing text features for classification

Question

I have a dataframe, df consisting of both text and numerical features similar to one shown below.

Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                keyword             Human             1
  2             3                Keywords            Dog               0
  8             2                Stackoverflow       cat               0

Currently I convert the text features into numerical features using factorize function and then use the new dataframe for classification.

df[' Feature 3'] = df[' Feature 3'].factorize()[0]
df[' Feature 4'] = df[' Feature 4'].factorize()[0]

After running the above code my dataframe looks like this

 Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                0                    0                 1
  2             3                1                    1                 0
  8             2                2                    2                 0

The factorize function is reading 'keywords' and 'keyword' as different words, so is there any function which will read words similar to 'keywords' and 'keyword' as same words ?

The output dataframe should actually look like this

 Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                0                    0                 1
  2             3                0                    1                 0
  8             2                1                    2                 0

JimmyA JimmyA · Accepted Answer · 2019-03-04T14:50:34

You might want to look at stemmers.

NLTK give example on how to use them here, but in short stemmers cut words down to their stem, for example...

from nltk.stem.porter import *

stemmer = PorterStemmer()

words = ['jog', 'jogging', 'jogged']

[stemmer.stem(word) for word in words]

returns:

['jog', 'jog', 'jog']

or for you

words = ['keyword', 'keywords']

[stemmer.stem(word) for word in words]

returns:

['keyword', 'keyword']

Edit:

I should point out that the words don't need to be similar for this to work:

words = ['drinking', 'running', 'walking', 'walked']

outputs:

['drink', 'run', 'walk', 'walk']

Factorizing text features for classification

1 Answers