3
votes

I'm using Scikit-learn for clustering a corpus of tweets (only the text) about of the #oscars.

It would be really useful if an username such as @LeonardoDiCaprio or a hashtag like #redcarpet can be considered as more significant in the preproces.

I would like to know if it is possible to add more weight to those common usernames and hashtags for being more important features.

1
When you say adding weight, how do you mean? k-means uses a distance calculation to try and figure out how "similar" all of the features of two instances are. Are you saying you want the "closeness" to overweight the appearance of certain words? Conversely, does that mean you want them to overweight the "distance" between tweets if one has those words and the other doesn't? Also, is your data normalized in some manner?flyingmeatball
@flyingmeatball for example I want to take as a feature "@sasha" because it's a common word in the corpus (appears in more than %30 of the tweets) and because it's a username (it's a token that starts with "@"). But besides that I don't want to miss features that are not usernames. The pipeline for my data is CountVectorizer -> K-Meansfuxes
I happened to run into something like this with SGDClassifier. It has a parameter sample_weight given to its fit method. Perhaps K-means has something like that. Take a look: scikit-learn.org/stable/auto_examples/linear_model/…Shovalt

1 Answers

10
votes

K-means is well defined only for Euclidean spaces, where distance between vector A and B is expressed as

|| A - B || = sqrt( SUM_i (A_i - B_i)^2 )

thus if you want to "weight" particular feature, you would like something like

|| A - B ||_W = sqrt( SUM_i w_i(A_i - B_i)^2 )

which would result in feature i being much more important (if w_i>1) - thus you would get a bigger penalty for having different value (in terms of bag of words/set of words - it simply means that if two documents have different amount of this particular word, they are assumed to be much more different than the ones which differ on another set of words).

So how can you enforce it? Well, basic maths is all you need! You can easily see that

|| A - B ||_W = || sqrt(W)*A - sqrt(W)*B ||

in other words - you take out your tfidf transformer (or whatever you use to map your text to constant size vector), check which features are responsible for words you are interested in, you create a vector of ones (with size equal to number of dimensions) and increase values for the words you care about (for example 10x) and take square root of this thing. Then, you simply preprocess all your data by multipling "point-wise" with broadcasting (np.multiply) by this weight vector. That's all you need, now your words will be more important in this well defined manner. From mathematical perspective this is introducing Mahalanobis distance instead of Euclidean, with the covariance matrix equal to w*I (thus - diagonal Gaussian used as a generator of your norm).