4
votes

I am trying to train a new entity type 'HE INST'--to recognize colleges. That is the only new label. I have a long document as raw text. I ran NER on it and saved the entities to the TRAIN DATA and then added the new entity labels to the TRAIN_DATA( i replaced in places where there was overlap).

The training loop is constant at a loss value(~4000 for all the 15 texts) and (~300) for a single data. Why does this happen, how do I train the model properly. I have around 18 texts with 40 annotated new entities.Even after all iterations, the model still doesn't predict the output correctly.

I haven't changed the script much. Just added en_core_web_lg, the new label and my TRAIN_DATA

I am trying to tag institutes from resume(C.V) data:

This would be one of my text in TRAIN_DATA: (soory for the long text) I have around 18 such texts concantenated to form TRAIN_DATA

[("To perform better in my work each day. To increase my knowledge. To bring out my best by hardworking and improving my skills. To serve my parents and my family. To contribute my skills to my country. Marital ; Single Status Nationality \xe2\x80\x94: Indian Known . Parr . English, Malayalam, Hindi, Tamil Languages Hobby Playing cricket and football, Listening to music, Movies, Games. Father's ; V.N. Balappan Nair Name Mother's ; Saraswathy B Nair Name Believers Church Caarmel Engineering College R-Perunad Btech Electronics and communication engineering 6.09(Upto S6) 2015 - 2019 Marthoma Senior Secondary School Kozhencherry All India Senior School Certificate Examination 75% 2014 - 2015 Marthoma Senior Secondary School Kozhencherry Secondary School Examination 8.2 2012 - 2013 s@ INTERESTS Electronics, Sports s@ PERSONAL STRENGTHS Hardworking Loyal Good Team Spirit Good in mathematics ees IAA eM LANL NUL e (2 Problem Solving Skills rg DUS \\ TRAININGS completed the Vocational Industrial Training on Long Distance Communication Systems conducted by Southern Telecom Region, Bharat Sanchar Nigam Limited. Completed the internship training in Power Electronics Group(PEG), Tool Room, Fabrication Shop, Transform Winding, Electro Plating, Security And Surveillance Group(SSG), Special Products Group(SPG), Search And Rescue Beacon(SRB), Intelligent Tracking and Communication Project and Technology Development Center of Keltron Equipment Complex, Thiruvananthapuram. PROJECTS Final Year Project: Life Detection Using Quadcopter This project is useful at the time of natural calamities like flood earthquake etc... And can also be used in military applications as this device detects life signals using a PIR sensor and a thermal sensor. The components used in this are: PIR sensor, Thermal sensor, Arduino Nano, BEC, ESC, Quadcopter. Design project: Wireless Power Bank Wireless Power Bank enables us to charge our phone wordlessly. It can charge a device which is kept 10m(maximum) away from the adaptor without any obstacles in between. It uses the IR technology for power transmission. ACHIEVEMENTS & AWARDS Participated in Pecardio Debugging Conducted as a part of NAKSHATRA 2019, The Annual National Level Techno Cultural Fest held at Saingits College of Engineering, kottayam. Volunteered in Alexa One day workshop on Artificial intelligence. Completed a period of two year tenue with a total of 240 hours in the National Service Scheme activities and has attended NSS Annual Special Camp. Participant in Cricket and football at the Annual Sports Meets. DECLARATION do here by confirm that the information given in this form is true to the best of my knowledge and belief.", {'entities': [(29, 37, 'DATE'), (210, 223, 'ORG'), (241, 247, 'NORP'), (256, 260, 'PERSON'), (263, 270, 'LANGUAGE'), (272, 281, 'PERSON'), (283, 288, 'PERSON'), (290, 295, 'NORP'), (362, 375, 'EVENT'), (388, 401, 'PERSON'), (402, 420, 'PERSON'), (423, 445, 'PERSON'), (446, 490, 'HE INST'), (563, 574, 'DATE'), (575, 620, 'ORG'), (625, 668, 'ORG'), (669, 672, 'PERCENT'), (673, 684, 'DATE'), (685, 717, 'ORG'), (764, 775, 'DATE'), (779, 800, 'ORG'), (890, 893, 'ORG'), (909, 910, 'CARDINAL'), (963, 997, 'ORG'), (1001, 1036, 'ORG'), (1050, 1073, 'ORG'), (1075, 1103, 'ORG'), (1142, 1169, 'ORG'), (1172, 1181, 'ORG'), (1183, 1199, 'ORG'), (1201, 1218, 'ORG'), (1220, 1235, 'ORG'), (1275, 1301, 'ORG'), (1304, 1332, 'ORG'), (1335, 1355, 'ORG'), (1360, 1415, 'ORG'), (1419, 1444, 'ORG'), (1446, 1464, 'LOC'), (1475, 1494, 'EVENT'), (1797, 1809, 'GPE'), (1811, 1814, 'GPE'), (1816, 1819, 'ORG'), (1821, 1831, 'ORG'), (1849, 1888, 'ORG'), (1969, 1980, 'CARDINAL'), (2050, 2052, 'ORG'), (2088, 2122, 'ORG'), (2126, 2154, 'ORG'), (2168, 2182, 'EVENT'), (2188, 2194, 'DATE'), (2239, 2270, 'HE INST'), (2297, 2302, 'GPE'), (2303, 2310, 'DATE'), (2358, 2369, 'DATE'), (2370, 2378, 'DATE'), (2401, 2410, 'TIME'), (2414, 2441, 'ORG'), (2470, 2493, 'ORG'), (2534, 2557, 'EVENT')]})]

The script is given below: (Note:- eval function is used to parse the TRAIN_DATA to list after reading it as string from text file-----you most probably know that but just in case)

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
import en_core_web_lg
from spacy.util import minibatch, compounding


# new entity label
LABEL = "HE INST"

with open('train_dump-backup.txt', 'r') as i_file:
    t_data = i_file.read()
TRAIN_DATA=eval(t_data)

@plac.annotations(
    model=("en_core_web_lg", "option", "m", str),
    new_model_name=("NLP_INST", "option", "nm", str),
    output_dir=("/home/drbinu/Downloads/NLP_INST", "option", "o", Path),
    n_iter=("30", "option", "n", int),
)

def main(model=None, new_model_name="animal", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "B.Tech from Believers Church Caarmel Engineering College CGPA of 8.9"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)
1
This data is not enough at all you need thousands of annotated training texts - Kitwradr
Yes that is true but it should atleast overfit and decrease loss for less number of texts. - Abin K Paul
I think this issue can be solved by ensuring there are no double spaces,newline characters etc in the text. But even then the training loss is stuck at 15-30. - Abin K Paul

1 Answers

4
votes

Losses appear to be increasing because pipeline components increment the loss as part of the update step:

https://github.com/explosion/spaCy/blob/ae4af52ce7dd9dda0eb0f1b8eeb0cba7d20facdf/spacy/pipeline/pipes.pyx#L989

At the start of each epoch, you may want to snapshot the total cumulative loss; at the end of the epoch, you could compute the average loss over the data observed.