1
votes

This is a second part of another question I posted. However, they are different enough to be seperate questions, but could be related.

Previous question Building a Custom Named Entity Recognition with Spacy , using random text as a sample

I have built a custom Named Entity Recognition (NER) using the method described in the previous question. From here, I just copied the method to build the NER from the Spacy website (under "Named Entity Recognizer" at this website https://spacy.io/usage/training#ner)

the custom NER works, sorta. If I sentence tokenize the text, lemmatize the words (so "strawberries" become "strawberry"), it can pick up an entity. However, it stops there. It sometimes picks up two entities, but very rarely.

Is there anything I can do to improve its accuracy?

Here is the code (I have TRAIN_DATA in this format, but for food items

  TRAIN_DATA = [
            ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
            ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]

)

The data is in the object train_food

import spacy
import nltk
nlp = spacy.blank("en")

#Create a built-in pipeline components and add them in the pipeline
if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last =True)
else:
    ner =nlp.get_pipe("ner")


##Testing for food
for _, annotations in train_food:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])



# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

model="en"
n_iter= 20

# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
    # show warnings for misaligned entity spans once
    warnings.filterwarnings("once", category=UserWarning, module='spacy')

    # reset and initialize the weights randomly – but only if we're
    # training a new model
    
    nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(train_food)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_food, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.5,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)

text = "mike went to the supermarket today. he went and bought a potatoes, carrots, towels, garlic, soap, perfume, a fridge, a tomato, tomatoes and tuna."

After this, and using text as a sample, I ran this code

def text_processor(text):
    text = text.lower()
    token = nltk.word_tokenize(text)
    ls = []
    for x in token:
        p = lemmatizer.lemmatize(x)
        ls.append(f"{p}")
    new_text = " ".join(map(str,ls))

    return new_text

def ner (text):
    new_text = text_processor(text)
    tokenizer = nltk.PunktSentenceTokenizer()
    sentences = tokenizer.tokenize(new_text)
    for sentence in sentences:
        doc = nlp(sentence)
        for ent in doc.ents:
            print(ent.text, ent.label_)
ner(text)

This results in

potato FOOD
carrot FOOD

Running the following code

ner("mike went to the supermarket today. he went and bought garlic and tuna")

Results in

garlic FOOD

Ideally, I want the NER to pick up potato, carrot and garlic. Is there anything I can do?

Thank you

Kah

1

1 Answers

0
votes

while you are training your model, You can try some information retrieval techniques such as:

1-lower casing all of the words

2-replace words with their synonyms

3-removing stop words

4-rewrite sentences(it can be done automatically using back-translation aka translating into Arabic, then translating it back into English)

also, consider using better models such as:

http://nlp.stanford.edu:8080/corenlp

https://huggingface.co/models