NLTK-based stemming and lemmatization

Question

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.

from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)

The final output I get is:

This beautiful day16~ . I ; working exercise45.^^^45 text34 .

And expected output should look like:

This beautiful day I work exercise text

I'd use regex to get rid of the noise, before calling the lemmatizer. — cs95
Thank you for this suggestion. But should not the above code work as I am expecting. I had used the same code before and it worked, but not sure why its not working this time. — Alex

cs95 cs95 · Accepted Answer · 2017-10-16T21:43:59

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).

import re

__stop_words = set(nltk.corpus.stopwords.words('english'))

def clean(tweet):
    cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
    return ' '.join([lemmatizer.lemmatize(i, 'v') 
                for i in cleaned_tweet.split() if i not in __stop_words])

Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.

from nltk.stem.porter import PorterStemmer  
stemmer = PorterStemmer()

And, call the stemmer like this:

stemmer.stem(i)

NLTK-based stemming and lemmatization

3 Answers