I'm using spacy to recognize street addresses on web pages.
My model is initialized basically using spacy's new entity type sample code found here: https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py
My training data consists of plain text webpages with their corresponding Street Address entities and character positions.
I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.
My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.
After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:
doc = nlp(plain_text_webpage)
if len(doc.ents) > 0:
print ("found entity")
Questions:
How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.
Will the speed of entity recognition improve as my model becomes more accurate?
Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?
I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?
Update:
I managed to significantly cut down the processing time by:
Using a custom tokenizer (used the one in the docs)
Disabling other pipelines that aren't for Named Entity Recognition
Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters
My updated code to load the model:
nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)
However, it is still too slow (20X slower than I need it)
Questions:
Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?
I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?