Suppose I have some text sentences that I want to cluster using kmeans.
sentences = [
"fix grammatical or spelling errors",
"clarify meaning without changing it",
"correct minor mistakes",
"add related resources or links",
"always respect the original author"
]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(sentences)
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,verbose=1)
km.fit(X)
Now I could predict which of the classes a new text would fall into,
new_text = "hello world"
vec = vectorizer.transform([new_text])
print km.predict(vec)[0]
However, say I apply PCA to reduce 10,000 features to 50.
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=50,whiten=True)
X2 = pca.fit_transform(X)
km.fit(X2)
I cannot do the same thing anymore to predict the cluster for a new text because the results from vectorizer are no longer relevant
new_text = "hello world"
vec = vectorizer.transform([new_text]) ##
print km.predict(vec)[0]
ValueError: Incorrect number of features. Got 10000 features, expected 50
So how do I transform my new text into the lower dimensional feature space?