Python Gensim FastText Saving and Loading Model

Question

I am working with Gensim FASTText modeling and have the following questions.

The output of "ft_model.save(BASE_PATH + MODEL_PATH + fname)" saves the following 3 files. Is this correct? is there a way to combine all three files?

ft_gensim-v3
ft_gensim-v3.trainables.vectors_ngrams_lockf.npy
ft_gensim-v3.wv.vectors_ngrams.npy

When I attempt to load the training file and then use it, I get the following error from if model.wv.similarity(real_data, labelled['QueryText'][i]) > maxSimilaity:

'function' object has no attribute 'wv'

Finally, both models, is there a way not to have to store the output of def read_train(path,label_path) and def lemmetize(df_col)so I do not have to run this part of the code every time I want to train the model or compare?

Thanks for the assistance.

Here is my FastText Train Model

import os
import logging
from config import BASE_PATH, DATA_PATH, MODEL_PATH
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from pprint import pprint as print
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

#Read Training data
import pandas as pd
def read_train(path,label_path):
    d = []
    #e = []
    df = pd.read_excel(path)
    labelled = pd.read_csv(label_path)
    updated_col1 = lemmetize(df['query_text'])
    updated_col2 = lemmetize(labelled['QueryText'])
    for i in range(len(updated_col1)):
        d.append(updated_col1[i])
        #print(d)
    for i in range(len(updated_col2)):
        d.append(updated_col2[i])
    return d


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string
from nltk.stem import PorterStemmer

def lemmetize(df_col):
    df_updated_col = pd.Series(0, index = df_col.index)
    stop_words = set(stopwords.words('english'))
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    ps = PorterStemmer()
    for i, j in zip(df_col, range(len(df_col))):
        lem = []
        t = str(i).lower()
        t = t.replace("'s","")
        t = t.replace("'","")
        translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
        t = t.translate(translator)
        word_tokens = word_tokenize(t)
        for i in range(len(word_tokens)):
            l1 = lemmatizer.lemmatize(word_tokens[i])
            s1 = ps.stem(word_tokens[i])
            if list(l1) != [''] and list(l1) != [' '] and l1 != '' and l1 != ' ':
                lem.append(l1)
        filtered_sentence = [w for w in lem if not w in stop_words]
        df_updated_col[j] = filtered_sentence
    return df_updated_col

#read test data
def read_test(path):
    return pd.read_excel(path)


#Read labelled data
def read_labelled(path):
    return pd.read_csv(path)


word_tokenized_corpus = read_train('Train Data.xlsx','SMEQueryText.csv')


#Train fasttext model
import tempfile
import os

from gensim.models import FastText
from gensim.test.utils import get_tmpfile
fname = get_tmpfile("ft_gensime-v3")

def train_fastText(data, embedding_size = 60, window_size = 40, min_word = 5, down_sampling = 1e-2, iter=100):
    ft_model = FastText(word_tokenized_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

    #with tempfile.NamedTemporaryFile(prefix=BASE_PATH + MODEL_PATH + 'ft_gensim_v2-', delete=False) as tmp:
    #    ft_model.save(tmp.name, separately=[])
    ft_model.save(BASE_PATH + MODEL_PATH + fname)
    return ft_model


# main function to output
def main(test_path, train_path, labelled):
    test_data = read_test(test_path)
    train_data = read_train(train_path,labelled)
    labelled = read_labelled(labelled)
    output_df = pd.DataFrame(index = range(len(test_data)))
    output_df['test_query'] = str()
    output_df['Similar word'] = str()
    output_df['category'] = str()
    output_df['similarity'] = float()
    model = train_fastText(train_data)

# run main
if __name__ == "__main__":
    output = main('Test Data.xlsx','Train Data.xlsx','QueryText.csv')

Here is my Usage Model

import pandas as pd
from gensim.models import FastText
import gensim
from config import BASE_PATH, DATA_PATH, MODEL_PATH

#Read Training data
def read_train(path,label_path):
    d = []
    #e = []
    df = pd.read_excel(path)
    labelled = pd.read_csv(label_path)
    updated_col1 = lemmetize(df['query_text'])
    updated_col2 = lemmetize(labelled['QueryText'])
    for i in range(len(updated_col1)):
        d.append(updated_col1[i])
    for i in range(len(updated_col2)):
        d.append(updated_col2[i])
    return d

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string
from nltk.stem import PorterStemmer

def lemmetize(df_col):
    df_updated_col = pd.Series(0, index = df_col.index)
    stop_words = set(stopwords.words('english'))
    lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
    ps = PorterStemmer()
    for i, j in zip(df_col, range(len(df_col))):
        lem = []
        t = str(i).lower()
        t = t.replace("'s","")
        t = t.replace("'","")
        translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
        t = t.translate(translator)
        word_tokens = word_tokenize(t)
        for i in range(len(word_tokens)):
            l1 = lemmatizer.lemmatize(word_tokens[i])
            s1 = ps.stem(word_tokens[i])
            if list(l1) != [''] and list(l1) != [' '] and l1 != '' and l1 != ' ':
                lem.append(l1)
        filtered_sentence = [w for w in lem if not w in stop_words]
        df_updated_col[j] = filtered_sentence
    return df_updated_col

#read test data
def read_test(path):
    return pd.read_excel(path)

#Read labelled data
def read_labelled(path):
    return pd.read_csv(path)

def load_training():
    return FT_gensim.load(BASE_PATH + MODEL_PATH +'ft_gensim-v3')

#compare similarity
def compare_similarity(model, real_data, labelled):
    maxWord = ''
    category = ''
    maxSimilaity = 0
    #print("train data",labelled[1])
    for i in range(len(labelled)):
        if model.similarity(real_data, labelled['QueryText'][i]) > maxSimilaity:
            #print('labelled',labelled['QueryText'][i], 'i', i)
            maxWord = labelled['QueryText'][i]
            category = labelled['Subjectmatter'][i]
            maxSimilaity = model.similarity(real_data, labelled['QueryText'][i])

    return maxWord, category, maxSimilaity

# Output from Main to excel
from pandas import ExcelWriter
def export_Excel(data, aFile = 'FASTTEXTOutput.xlsx'):
    df = pd.DataFrame(data)
    writer = ExcelWriter(aFile)
    df.to_excel(writer,'Sheet1')
    writer.save()

# main function to output
def main(test_path, train_path, labelled):
    test_data = read_test(test_path)
    train_data = read_train(train_path,labelled)
    labelled = read_labelled(labelled)
    output_df = pd.DataFrame(index = range(len(test_data)))
    output_df['test_query'] = str()
    output_df['Similar word'] = str()
    output_df['category'] = str()
    output_df['similarity'] = float()
    model = load_training
    for i in range(len(test_data)):
        output_df['test_query'][i] = test_data['query_text'][i]
        #<first change>
        maxWord, category, maxSimilaity = compare_similarity(model, str(test_data['query_text'][i]), labelled)
        output_df['Similar word'][i] = maxWord
        output_df['category'][i] = category
        output_df['similarity'][i] = maxSimilaity
    #<second change>    
    return output_df

# run main
if __name__ == "__main__":
    output = main('Test Data.xlsx','Train Data.xlsx','SMEQueryText.csv')
    export_Excel(output)

Here is the full tracible error message

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-22-57803b59c0b9> in <module>
      1 # run main
      2 if __name__ == "__main__":
----> 3     output = main('Test Data.xlsx','Train Data.xlsx','SMEQueryText.csv')
      4     export_Excel(output)

<ipython-input-21-17cb88ee0f79> in main(test_path, train_path, labelled)
     13         output_df['test_query'][i] = test_data['query_text'][i]
     14         #<first change>
---> 15         maxWord, category, maxSimilaity = compare_similarity(model, str(test_data['query_text'][i]), labelled)
     16         output_df['Similar word'][i] = maxWord
     17         output_df['category'][i] = category

<ipython-input-19-84d7f268d669> in compare_similarity(model, real_data, labelled)
      6     #print("train data",labelled[1])
      7     for i in range(len(labelled)):
----> 8         if model.wv.similarity(real_data, labelled['QueryText'][i]) > maxSimilaity:
      9             #print('labelled',labelled['QueryText'][i], 'i', i)
     10             maxWord = labelled['QueryText'][i]

AttributeError: 'function' object has no attribute 'wv'

It might be easier to debug if you posted the full traceback of your error message as well — bug_spray
I have added the full error message to the original question. — emie

gojomo gojomo · Accepted Answer · 2020-05-12T20:51:25

You've got three separate, only-vaguely-related questions here. Taking each in order:

Why are there 3 files, and can they be combined?

It's more efficient to store the big raw arrays separately from the main 'pickled' model – and for models above a few gigabytes in size, necessary to work-around 'pickle' implementation limits. So I'd recommend just keeping the default behavior, and keeping the habit of managing/moving/copying the sets of files together.

If your model is small enough, there is something you can try, though. The .save() method has an optional parameter sep_limit which controls the threshold array size, over which arrays are stored as separate files. By setting that much larger, say sep_limit=2*1024*1024*1024 (2GiB), smaller models should save a single file. (But, loading will be slower, you won't have the sometimes-useful option of memory-map loading, and saving may break on oversized models.)

Why is there a AttributeError: 'function' object has no attribute 'wv' error?

Your line of code model = load_training assigns an actual function to the model variable, rather than what you probably intended, the return-value of calling that function with some arguments. That function has no .wv attribute, hence the error. If model were an actual instance of FastText, you'd not get that error.

Can the corpus text be stored to avoid repeat preprocessing and conversion from pandas formats?

Sure, you can just write the text to a file. Roughly:

with open('mycorpus.txt', mode='w') as corpusfile:
    for text in word_tokenized_corpus:
        corpusfile.write(' '.join(text))
        corpusfile.write('\n')

Though in fact, gensim offers a utility function, utils.save_as_line_sentence(), that can do this (& explicitly handles some extra encoding concerns). See:

https://radimrehurek.com/gensim/utils.html#gensim.utils.save_as_line_sentence

The LineSentence utility class in gensim.models.word2vec can stream texts from such a file back for future re-use:

https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence

Python Gensim FastText Saving and Loading Model

1 Answers