I have trained my word2vec
model from gensim
and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:
top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777
It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?
Note I am using model.similarity(t1, t2)
This is how I trained my Word2Vec Model:
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
t1 = time.time()
docs = read_files(TEXT_DIRS, nb_docs=5000)
t2 = time.time()
print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
print('Number of documents: %i' % len(docs))
# Training the model
model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
if not os.path.exists(MODEL_DIR):
os.makedirs(MODEL_DIR)
model.save(os.path.join(MODEL_DIR, 'word2vec'))
weights = model.wv.vectors
index_words = model.wv.index2word
vocab_size = weights.shape[0]
embedding_dim = weights.shape[1]
print('Shape of weights:', weights.shape)
print('Vocabulary size: %i' % vocab_size)
print('Embedding size: %i' % embedding_dim)
Below is the read_files function I defined:
def read_files(text_directories, nb_docs):
"""
Read in text files
"""
documents = list()
tokenize = lambda x: gensim.utils.simple_preprocess(x)
print('started reading ...')
for path in text_directories:
count = 0
# Read in all files in directory
if os.path.isdir(path):
all_files = os.listdir(path)
for filename in all_files:
if filename.endswith('.txt') and filename[0].isdigit():
count += 1
with open('%s/%s' % (path, filename), encoding='utf-8') as f:
doc = f.read()
doc = clean_text_arabic_style(doc)
doc = clean_doc(doc)
documents.append(tokenize(doc))
if count % 100 == 0:
print('processed {} files so far from {}'.format(count, path))
if count >= nb_docs and count <= nb_docs + 200:
print('REACHED END')
break
if count >= nb_docs and count <= nb_docs:
print('REACHED END')
break
return documents
I tried this thread but it won't help me because I rather have arabic
and misspelled text
Update I tried the following: (getting the similarity between the exact same word)
print(model.similarity('الاحتلال','الاحتلال'))
and it gave me the following result:
1.0000001
1.0
? For example, doesmodel.similarity('الاحتلال', 'الاحتلال')
show that value? (That is: the exact same string with itself? I may have mis-pasted, as I'm not a reader of Arabic nor fully-aware of R-to-L language issues.) – gojomogensim
's side ? I edited my question above – Perl