Train Fastext on non-english data set

Question

I'm into a new project which I desire to represent words as vectors, I read about Fasttext library and I saw that they have pre-trained models for language which is not English. The purpose is to predict closeness between different words

https://fasttext.cc/docs/en/crawl-vectors.html

what I want to know is can I train a Fasttext model on non-English data and like articles of news sites, to achieve better results for specific genres like politics and nowadays topics, and so.

Can I train it on non-English data sets?
How long does it take to train a model for 10 GB of text? is it big enough?
There are any better solutions?

Thanks in advance!

Amir Amir · Accepted Answer · 2019-01-25T19:13:41

Can I train it on non-English data sets?

Of course, you can. Fasttext provide a list of available pre-trained models on 157 different languages at their webiste, you can download them as well.

How long does it take to train a model for 10 GB of text?

It depends on your system and implementation. e.g on Mac-pro with 16Gb ram with facebook implementation it takes about 8-10 hours.

is it big enough?

If 10Gb is the file size after cleaning and preprocessing yeah that is fair enough.

There are any better solutions?

What do mean by better solutions? If I were in your shoes, I try the pre-trained models first.

Train Fastext on non-english data set

1 Answers