Spacy train ner using multiprocessing

Question

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations. It is taking more than 2 hours to train completely.

Is there any way to train using multiprocessing? Will it improve the training time?

I'm not sure if this can be done or not, but I want to mention that (it's somehow related to your question): spacy.io/usage/training#tips — colidyre

Jon Betts Jon Betts · Accepted Answer · 2020-02-28T14:24:23

Short answer... probably not

It's very unlikely that you will be able to get this to work for a few reasons:

The network being trained is performing iterative optimization
- Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
- Any parallel training would be creating divergent networks...
- ...which you would then somehow have to merge

Long answer... there's plenty you can do!

There are a few different things you can try however:

Get GPU training working if you haven't
- It's a pain, but can speed up training time a bit
- It will dramatically lower CPU usage however
Try to use spaCy command line tools
- The JSON format is a pain to produce but...
- The benefit is you get a well optimised algorithm written by the experts
- It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
- Each of these may train faster
- These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
- Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
- Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
- spaCy NER extraction requires a surprisingly small amount of context to work
- You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs

Final thoughts... when is your network "done"?

I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.

However 90% of the increase in performance is captured in the first 10% of training.

Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?

If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.

You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.

Good luck!

Spacy train ner using multiprocessing

1 Answers

Short answer... probably not

Long answer... there's plenty you can do!

Final thoughts... when is your network "done"?