As far as I'm concerned, there is no question like this. I'm working on a NLP and sentiment analysis project in Kaggle and first of all I'm preparing my data. The dataframe is a text column followed by a number from 0 to 9 which categorizes which cluster does the row (the document) belongs. I'm using TF-IDF Vectorizer in sklearn. I want to get rid of anything that's not an english language word, so I'm using the following:
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
s_words = list(nltk.corpus.stopwords.words("english"))
c = TfidfVectorizer(sublinear_tf=False,
stop_words=s_words,
token_pattern =r"(?ui)\\b\\w*[a-z]+\\w*\\b",
tokenizer = LemmaTokenizer(),
analyzer = "word",
strip_accents = "unicode")
#a_df is the original dataframe
X = a_df['Text']
X_text = c.fit_transform(X)
which as far as I know, when calling c.get_feature_names()
should return only the tokens which are proper words, without numbers or punctuation symbols.
I found the regex in a post in StackOverflow, but using a simpler one like [a-zA-Z]+
will do exactly the same (this is, nothing).
When I call the feature names, I get stuff like
["''abalone",
"#",
"?",
"$",
"'",
"'0",
"'01",
"'accidentally",
...]
Those are just examples, but it's representative of the output I get, instead of just the words.
I've been stuck with this for days trying different regular expressions or methods to call. Even hardcoded some of the outputs for the features on the stop words.
I'm asking this because later I'm using LDA
to get the topics of each cluster and get punctuation symbols as the "topics".
I hope I'm not duplicating another post. Anymore information I need to provide will do gladly. Thank you in advance!