1
votes

I'm trying to do a clustering with word2vec and Kmeans, but it's not working.

Here part of my data:

demain fera chaud à paris pas marseille
mauvais exemple ce n est pas un cliché mais il faut comprendre pourquoi aussi
il y a plus de travail à Paris c est d ailleurs pour cette raison qu autant de gens",
mais s il y a plus de travail, il y a aussi plus de concurrence
s agglutinent autour de la capitale

Script:

import nltk
import pandas
import pprint
import numpy as np
import pandas as pd
from sklearn import cluster
from sklearn import metrics
from gensim.models import Word2Vec
from nltk.cluster import KMeansClusterer
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import NMF

dataset = pandas.read_csv('text.csv', encoding = 'utf-8')

comments = dataset['comments']

verbatim_list = no_duplicate.values.tolist()

min_count = 2
size = 50
window = 4

model = Word2Vec(verbatim_list, min_count=min_count, size=size, window=window)

X = model[model.vocab]

clusters_number = 28
kclusterer = KMeansClusterer(clusters_number,  distance=nltk.cluster.util.cosine_distance, repeats=25)

assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))

kmeans = cluster.KMeans(n_clusters = clusters_number)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

clusters = {}
for commentaires, label in zip(verbatim_list, labels):
    try:
        clusters[str(label)].append(verbatim)
    except:
       clusters[str(label)] = [verbatim]
pprint.pprint(clusters)

Output:

Traceback (most recent call last):

File "kmwv.py", line 37, in

X = model[model.vocab]

AttributeError: 'Word2Vec' object has no attribute 'vocab'

I need a clustering that works with word2vec, but every time I try something, I have this error. Is there any way to do a clustering with word2vec?

1

1 Answers

6
votes

As Davide said, try this:

X = model[model.wv.vocab]