Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

Question

I have an array of TF-IDF feature vectors. I'd like to find similar vectors in the array using two methods:

Cosine similarity
k-means clustering

Using Scikit Learn, this process is pretty simple.

Now I'd like to weight certain features so that they will influence the results more than the other features. For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features.

How can I meaningfully weight certain features in my feature vectors? Is the process for weighting certain features the same for each of the similarity algorithms I listed above?

Andreas Buehlmeier Andreas Buehlmeier · Accepted Answer · 2018-01-22T16:40:01

As I understand, low values in the TFIDF matrix mean that the words are less significant. So one approach is to lower the values in the matrix for those columns you considered.

The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/.

Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

1 Answers