Large Training Set For NER

Question

I have a project that involves taking property descriptions and labeling key data elements. I decided to use spaCy in an effort to train my own NER pipe since these descriptions are not written like conventional sentences. However, when I go to train it gets to about 20% and then crashes and i am unable to find an explanation.

How it's set up

Below is a sample of my JSON. The full JSON is 2.6 MB and contains >1000 descriptions ranging from 40 - ~500 tokens. The file contains ~54000 tokens in total. (Be advised the below data has been changed to not reflect the actual property)

[{
    "id": 1, "paragraphs": 
    [{
    "raw": "Lots 1 and 2 of Block 1, in the City of Santa Clarita, County of Los Angeles, State of California, as per Map recorded in Book 1, Page 1 of Miscellaneous Maps, in the Office of the County Recorder of said County "
        , "sentences": 
        [{
            "tokens": 
            [
                        {"id": 1, "orth": "Lots", "ner": "B-LOT"}
                        , {"id": 2, "orth": "1", "ner": "I-LOT"}
                        , {"id": 3, "orth": "and", "ner": "I-LOT"}
                        , {"id": 4, "orth": "2", "ner": "L-LOT"}
                        , {"id": 5, "orth": "of", "ner": "O"}
                        , {"id": 6, "orth": "Block", "ner": "B-BLOCK"}
                        , {"id": 7, "orth": "1,", "ner": "L-BLOCK"}
                        , {"id": 8, "orth": "in", "ner": "O"}
                        , {"id": 9, "orth": "the", "ner": "O"}
                        , {"id": 10, "orth": "City", "ner": "O"}
                        , {"id": 11, "orth": "of", "ner": "O"}
                        , {"id": 12, "orth": "Santa", "ner": "O"}
                        , {"id": 13, "orth": "Clarita,", "ner": "O"}
                        , {"id": 14, "orth": "County", "ner": "O"}
                        , {"id": 15, "orth": "of", "ner": "O"}
                        , {"id": 16, "orth": "Los", "ner": "O"}
                        , {"id": 17, "orth": "Angeles,", "ner": "O"}
                        , {"id": 18, "orth": "State", "ner": "O"}
                        , {"id": 19, "orth": "of", "ner": "O"}
                        , {"id": 20, "orth": "California,", "ner": "O"}
                        , {"id": 21, "orth": "as", "ner": "O"}
                        , {"id": 22, "orth": "per", "ner": "O"}
                        , {"id": 23, "orth": "Map", "ner": "O"}
                        , {"id": 24, "orth": "recorded", "ner": "O"}
                        , {"id": 25, "orth": "in", "ner": "O"}
                        , {"id": 26, "orth": "Book", "ner": "B-BOOK"}
                        , {"id": 27, "orth": "1,", "ner": "L-BOOK"}
                        , {"id": 28, "orth": "Page", "ner": "B-PAGE"}
                        , {"id": 29, "orth": "1", "ner": "L-PAGE"}
                        , {"id": 30, "orth": "of", "ner": "O"}
                        , {"id": 31, "orth": "Miscellaneous", "ner": "B-MAPTYPE"}
                        , {"id": 32, "orth": "Maps,", "ner": "L-MAPTYPE"}
                        , {"id": 33, "orth": "in", "ner": "O"}
                        , {"id": 34, "orth": "the", "ner": "O"}
                        , {"id": 35, "orth": "Office", "ner": "O"}
                        , {"id": 36, "orth": "of", "ner": "O"}
                        , {"id": 37, "orth": "the", "ner": "O"}
                        , {"id": 38, "orth": "County", "ner": "O"}
                        , {"id": 39, "orth": "Recorder", "ner": "O"}
                        , {"id": 40, "orth": "of", "ner": "O"}
                        , {"id": 41, "orth": "said", "ner": "O"}
                        , {"id": 42, "orth": "County", "ner": "O"}
            ]
        }]
    }]
}]

I took the Train.py file that comes with spaCy in the cli folder and created my own version for this process. I left the core functionality of the file in tact and just added a few things such as some new labels for my data set and a custom tokenizer that works with white space instead of the conventional tokenizer. The function is below:

def NERTrain(lang
          , output_dir
          , train_data
          , dev_data
          , n_iter=30
          , n_sents=0
          , parser_multitasks=''
          , entity_multitasks=''
          , use_gpu=-1
          , vectors=None
          , gold_preproc=False
          , version="0.0.0"
          , meta_path=None
          , verbose=False
          , newLabels = None):
    """
    Train a model. Expects data in spaCy's JSON format.
    """
    util.fix_random_seed()
    util.set_env_log(True)
    n_sents = n_sents or None
    output_path = util.ensure_path(output_dir)
    train_path = util.ensure_path(train_data)
    dev_path = util.ensure_path(dev_data)
    meta_path = util.ensure_path(meta_path)
    if not output_path.exists():
        output_path.mkdir()
    if not train_path.exists():
        prints(train_path, title=Messages.M050, exits=1)
    if dev_path and not dev_path.exists():
        prints(dev_path, title=Messages.M051, exits=1)
    if meta_path is not None and not meta_path.exists():
        prints(meta_path, title=Messages.M020, exits=1)
    meta = util.read_json(meta_path) if meta_path else {}
    if not isinstance(meta, dict):
        prints(Messages.M053.format(meta_type=type(meta)),
               title=Messages.M052, exits=1)
    meta.setdefault('lang', lang)
    meta.setdefault('name', 'unnamed')

    pipeline = ['ner']

    # Take dropout and batch size as generators of values -- dropout
    # starts high and decays sharply, to force the optimizer to explore.
    # Batch size starts at 1 and grows, so that we make updates quickly
    # at the beginning of training.
    dropout_rates = util.decaying(util.env_opt('dropout_from', 0.2),
                                  util.env_opt('dropout_to', 0.2),
                                  util.env_opt('dropout_decay', 0.0))
    batch_sizes = util.compounding(util.env_opt('batch_from', 1),
                                   util.env_opt('batch_to', 16),
                                   util.env_opt('batch_compound', 1.001))
    max_doc_len = util.env_opt('max_doc_len', 5000)
    corpus = GoldCorpus(train_path, dev_path, limit=n_sents)
    n_train_words = corpus.count_train()

    lang_class = util.get_lang_class(lang)
    nlp = lang_class()

    if "ner" in nlp.pipe_names:
        nlp.remove_pipe("ner")

    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, first=True)

    meta['pipeline'] = pipeline
    nlp.meta.update(meta)
    if vectors:
        print("Load vectors model", vectors)
        util.load_model(vectors, vocab=nlp.vocab)
        for lex in nlp.vocab:
            values = {}
            for attr, func in nlp.vocab.lex_attr_getters.items():
                # These attrs are expected to be set by data. Others should
                # be set by calling the language functions.
                if attr not in (CLUSTER, PROB, IS_OOV, LANG):
                    values[lex.vocab.strings[attr]] = func(lex.orth_)
            lex.set_attrs(**values)
            lex.is_oov = False
#    for name in pipeline:
#        nlp.add_pipe(nlp.create_pipe(name), name=name)
    if parser_multitasks:
        for objective in parser_multitasks.split(','):
            nlp.parser.add_multitask_objective(objective)
    if entity_multitasks:
        for objective in entity_multitasks.split(','):
            nlp.entity.add_multitask_objective(objective)
    optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
    nlp._optimizer = None
    nlp.tockenizer=WTok(nlp)
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

    if(newLabels != None):
        for l in newLabels:
            ner.add_label(l)

    print("Itn.  Dep Loss  NER Loss  UAS     NER P.  NER R.  NER F.  Tag %   Token %  CPU WPS  GPU WPS")
    try:
        train_docs = corpus.train_docs(nlp, projectivize=True, noise_level=0.0,
                                       gold_preproc=gold_preproc, max_length=0)
        train_docs = list(train_docs)
        with nlp.disable_pipes(*other_pipes):
            for i in range(n_iter):
                with tqdm.tqdm(total=n_train_words, leave=False) as pbar:
                    losses = {}
                    for batch in minibatch(train_docs, size=batch_sizes):
                        batch = [(d, g) for (d, g) in batch if len(d) < max_doc_len]
                        if not batch:
                            continue
                        docs, golds = zip(*batch)
                        nlp.update(docs, golds, sgd=optimizer,
                                   drop=next(dropout_rates), losses=losses)
                        pbar.update(sum(len(doc) for doc in docs))

                with nlp.use_params(optimizer.averages):
                    util.set_env_log(False)
                    epoch_model_path = output_path / ('model%d' % i)
                    nlp.to_disk(epoch_model_path)
                    nlp_loaded = util.load_model_from_path(epoch_model_path)
                    dev_docs = list(corpus.dev_docs(
                                    nlp_loaded,
                                    gold_preproc=gold_preproc))
                    nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs)
                    start_time = timer()
                    scorer = nlp_loaded.evaluate(dev_docs, verbose)
                    end_time = timer()
                    if use_gpu < 0:
                        gpu_wps = None
                        cpu_wps = nwords/(end_time-start_time)
                    else:
                        gpu_wps = nwords/(end_time-start_time)
                        with Model.use_device('cpu'):
                            nlp_loaded = util.load_model_from_path(epoch_model_path)
                            dev_docs = list(corpus.dev_docs(
                                            nlp_loaded, gold_preproc=gold_preproc))
                            start_time = timer()
                            scorer = nlp_loaded.evaluate(dev_docs)
                            end_time = timer()
                            cpu_wps = nwords/(end_time-start_time)
                    acc_loc = (output_path / ('model%d' % i) / 'accuracy.json')
                    with acc_loc.open('w') as file_:
                        file_.write(json_dumps(scorer.scores))
                    meta_loc = output_path / ('model%d' % i) / 'meta.json'
                    meta['accuracy'] = scorer.scores
                    meta['speed'] = {'nwords': nwords, 'cpu': cpu_wps,
                                     'gpu': gpu_wps}
                    meta['vectors'] = {'width': nlp.vocab.vectors_length,
                                       'vectors': len(nlp.vocab.vectors),
                                       'keys': nlp.vocab.vectors.n_keys}
                    meta['lang'] = nlp.lang
                    meta['pipeline'] = pipeline
                    meta['spacy_version'] = '>=%s' % about.__version__
                    meta.setdefault('name', 'model%d' % i)
                    meta.setdefault('version', version)

                    with meta_loc.open('w') as file_:
                        file_.write(json_dumps(meta))
                    util.set_env_log(True)
                print_progress(i, losses, scorer.scores, cpu_wps=cpu_wps,
                               gpu_wps=gpu_wps)
    finally:
        print("Saving model...")
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / 'model-final'
            nlp.to_disk(final_model_path)

What I've Done

When the attempting to run the full JSON file failed, I attempted the same with a smaller sample of 100. The process was able to run all the way through with no issues. Now before I go chopping up my dataset into bite sized chunks of 100 (which i really don't want / shouldn't have to do) i wanted to see if anyone could take a look and see if this is possibly some sort of 1. limit in spacy i have somehow hit, 2. memory issue, 3. or some sort of code issue i overlooked.

Please be advised that this process is being run on my local machine which is speced as follows:

PC specs

Windows 10
Intel(R) Core(TM) i7-6600U CPU @ 2.6GHz 2.81 GHz
16.0 GB Ram
Python 3.7.4
spaCy 2.0.16

Any help would be greatly appreciated thank you

EDIT 1:

After i asked this question, i figured in the mean time i would attempt to process my files in small batches of 100. Interestingly enough, one of the files caused the process to crash. Immediately i thought it was a data issue, so i added a "print" to the training function so i can see which text was causing it. But after i added the "print" the file completed without error. I am not sure what to make of this but just some added information.

EDIT 2:

I was finally able to get an error message related to the crash. Unhandled exception at 0x00007FF8EB9E2BE2 (ner.cp37-win_amd64.pyd) in python.exe: 0xC0000005: Access violation reading location 0x000001C4213D1FE4. occurred The error is marked on invoke_main() within exe_common.inl I've attempted to find more information about this error and only found very little. It appears to be some kind of Windows error? Any help is appreciated.

no sir, when running from a IDE it simply just stops. When i run from the command line it says "Python has stopped working" and gives me an option to debug or close. When i choose debug nothing loads. — Brendt McKnight

Brendt McKnight Brendt McKnight · Accepted Answer · 2019-10-24T14:57:50

In the end this turned out to be version incompatibilities between some of spaCy's dependents. This appears to have been caused by several uninstalls and re-installs of older and newer versions of spaCy. I got a fresh environment made and installed only the most current version of spaCy and everything works great. If you are using Anaconda Navigator i would not trust the package installer from the UI. It appears to be linked to older versions and you are much better off using PIP from the terminal.

Large Training Set For NER

1 Answers