2
votes

I want to use fasttext pre-trained models to compute similarity a sentence between a set of sentences. can anyone help me? what is the best approach?

I computed the similarity between sentences by train a tfidf model. write code like this. is it possible to change it and use fasttext pre-trained models? for example use vectors to train a tfidf model?

def generate_tfidf_model(sentences):
    print("generating TfIdf model")
    texts = [[sentence for sentence in doc.split()] for doc in sentences]
    dictionary = gensim.corpora.Dictionary(texts)    
    feature_cnt = len(dictionary.token2id)
    mycorpus = [dictionary.doc2bow(doc, allow_update=True) for doc in texts]
    tfidf_model = gensim.models.TfidfModel(mycorpus)
    index = gensim.similarities.SparseMatrixSimilarity(tfidf_model[mycorpus]
                                                        , num_features = feature_cnt)
    return tfidf_model, index, dictionary

def query_search(query, tfidf_model, index, dictionary):
    query = normal_stemmer_sentence(query)
    query_vector = dictionary.doc2bow(query.split())
    similarity = index[tfidf_model[query_vector]]
    return similarity
1

1 Answers

4
votes

I think that computing TfIdf could not be necessary, if you can use word embeddings.

A simple but effective method consists in:

  1. Compute two vectors which represent your two strings, using pretrained word embeddings for your language (eg FastText - get_sentence_vector https://fasttext.cc/docs/en/python-module.html#model-object)

  2. Compute cosine similarity between two vectors (1: equal strings; 0: really different strings; read https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html).