9
votes

Suppose I have some text sentences that I want to cluster using kmeans.

sentences = [
    "fix grammatical or spelling errors",
    "clarify meaning without changing it",
    "correct minor mistakes",
    "add related resources or links",
    "always respect the original author"
]

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)

Now I could predict which of the classes a new text would fall into,

new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]

However, say I apply PCA to reduce 10,000 features to 50.

from sklearn.decomposition import RandomizedPCA

pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)

I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant

new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50

So how do I transform my new text into the lower dimensional feature space?

2

2 Answers

8
votes

You want to use pca.transform on your new data before feeding it to the model. This will perform dimensionality reduction using the same PCA model that was fitted when you ran pca.fit_transform on your original data. You can then use your fitted model to predict on this reduced data.

Basically, think of it as fitting one large model, which consists of stacking three smaller models. First you have a CountVectorizer model that determines how to process data. Then you run a RandomizedPCA model that performs dimensionality reduction. And finally you run a KMeans model for clustering. When you fit the models, you go down the stack and fit each one. And when you want to do prediction, you also have to go down the stack and apply each one.

# Initialize models
vectorizer = CountVectorizer(min_df=1)
pca = RandomizedPCA(n_components=50, whiten=True)
km = KMeans(n_clusters=2, init='random', n_init=1, verbose=1)

# Fit models
X = vectorizer.fit_transform(sentences)
X2 = pca.fit_transform(X)
km.fit(X2)

# Predict with models
X_new = vectorizer.transform(["hello world"])
X2_new = pca.transform(X_new)
km.predict(X2_new)
4
votes

Use a Pipeline:

>>> from sklearn.cluster import KMeans
>>> from sklearn.decomposition import RandomizedPCA
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import make_pipeline
>>> sentences = [
...     "fix grammatical or spelling errors",
...     "clarify meaning without changing it",
...     "correct minor mistakes",
...     "add related resources or links",
...     "always respect the original author"
... ]
>>> vectorizer = CountVectorizer(min_df=1)
>>> svd = TruncatedSVD(n_components=5)
>>> km = KMeans(n_clusters=2, init='random', n_init=1)
>>> pipe = make_pipeline(vectorizer, svd, km)
>>> pipe.fit(sentences)
Pipeline(steps=[('countvectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,...n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=1))])
>>> pipe.predict(["hello, world"])
array([0], dtype=int32)

(Showing TruncatedSVD because RandomizedPCA will stop working on text frequency matrices in an upcoming release; it actually performed an SVD, not full PCA, anyway.)