Python NER: add custom text and labels to update the NER model

Question

I'm using NER to essentially scrub text so that each named entity is replaced with its label (PERSON, ORG, etc.). So "John works at Apple" would become "PERSON works at ORG."

clause_text is my list of sentences. I used the ner-d package to build my NER model and scrub text as follows:

for text in clause_text:
    input_text = text
    doc = ner.name(input_text, language='en_core_web_sm')
    text_label = [(X.text, X.label_) for X in doc]

    # replace all named entities with their label (PERSON, ORG, etc)
    for text, label in text_label:
       input_text = input_text.replace(text, label)
    scrubbed_text.append(input_text)

Now, I am trying to add custom training data. Basically I want to be able to add a sentence with labels and update the NER model to make it more accurate/specific to what I need it to do. Right now I have this:

nlp = spacy.load('en_core_web_sm')

if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe('ner')

from spacy.gold import GoldParse
from spacy.pipeline import EntityRecognizer

doc_list = [] 
doc = nlp('This EULA stipulates a contract for Hamilton Enterprises.') 
doc_list.append(doc) 
gold_list = [] 
gold_list.append(GoldParse(doc, [u'O', u'O', u'O', u'O', u'O', u'O', u'ORG'])) 
  
ner = EntityRecognizer(nlp.vocab, entity_types = ['ORG']) 
ner.update(doc_list, gold_list)

But when I run this, I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-92c53f5c90b1> in <module>
      9 
     10 ner = EntityRecognizer(nlp.vocab, entity_types = ['ORG'])
---> 11 ner.update(doc_list, gold_list)

nn_parser.pyx in spacy.syntax.nn_parser.Parser.update()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.require_model()

ValueError: [E109] Model for component 'ner' not initialized. Did you forget to load a model, or forget to call begin_training()?

Does anyone have any insight on how to best fix this code, or if there's a better way to add custom entries to update the NER model? Thanks so much!

Alex L Alex L · Accepted Answer · 2020-07-21T17:29:49

You're definitely on the right track. The spaCy documentation solves your problem with an incredibly clear guide. Check it out at https://spacy.io/usage/training.

I recommend reading the entire post to really understand the API, but the section you'll be most interested in is Training the named entity recognizer. It spells out how to add new training data to fine-tune an existing (or blank) spaCy NER model. With code examples too!

Note, they have their training data hard-coded, but if I were you I'd pull that out into it's own pipeline. They also recommend a few hundred observations to get the most effect in fine-tuning.

Python NER: add custom text and labels to update the NER model

1 Answers