Tokenizer Training with StanfordNLP

Question

So my requirement is verbally simple. I need StanfordCoreNLP default models along with my custom trained model, based on custom entities. In a final run, I need to be able to isolate specific phrases from a given sentence (RegexNER will be used)

Following are my efforts :-

EFFORT I :- So I wanted to use the StanfordCoreNLP CRF files, tagger files and ner model files, along with my custom trained ner models. I tried to find if there is any official way of doing this, but didnt get anything. There is a property "ner.model" for StanfordCoreNLP pipeline, but it will skip the default ones if used.

EFFORT II :- Next (might not be the smartest thing ever. Sorry! Just a guy trying to make ends meet!) , I extracted the model

stanford-corenlp-models-3.7.0.jar

, and copied all :-


*.ser.gz (Parser Models)
*.tagger (POS Tagger)
*.crf.ser.gz (NER CRF Files)

and tried to put Comma Separated Values with properties "parser.model", "pos.model" and "ner.model" respectively, as follows :-


parser.model=models/ner/default/anaphoricity_model.ser.gz,models/ner/default/anaphoricity_model_conll.ser.gz,models/ner/default/classification_model.ser.gz,models/ner/default/classification_model_conll.ser.gz,models/ner/default/clauseSearcherModel.ser.gz,models/ner/default/clustering_model.ser.gz,models/ner/default/clustering_model_conll.ser.gz,models/ner/default/english-embeddings.ser.gz,models/ner/default/english-model-conll.ser.gz,models/ner/default/english-model-default.ser.gz,models/ner/default/englishFactored.ser.gz,models/ner/default/englishPCFG.caseless.ser.gz,models/ner/default/englishPCFG.ser.gz,models/ner/default/englishRNN.ser.gz,models/ner/default/englishSR.beam.ser.gz,models/ner/default/englishSR.ser.gz,models/ner/default/gender.map.ser.gz,models/ner/default/md-model-dep.ser.gz,models/ner/default/ranking_model.ser.gz,models/ner/default/ranking_model_conll.ser.gz,models/ner/default/sentiment.binary.ser.gz,models/ner/default/sentiment.ser.gz,models/ner/default/truecasing.fast.caseless.qn.ser.gz,models/ner/default/truecasing.fast.qn.ser.gz,models/ner/default/word_counts.ser.gz,models/ner/default/wsjFactored.ser.gz,models/ner/default/wsjPCFG.ser.gz,models/ner/default/wsjRNN.ser.gz
ner.model=models/ner/default/english.all.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.all.3class.distsim.crf.ser.gz,models/ner/default/english.all.3class.nodistsim.crf.ser.gz,models/ner/default/english.conll.4class.caseless.distsim.crf.ser.gz,models/ner/default/english.conll.4class.distsim.crf.ser.gz,models/ner/default/english.conll.4class.nodistsim.crf.ser.gz,models/ner/default/english.muc.7class.caseless.distsim.crf.ser.gz,models/ner/default/english.muc.7class.distsim.crf.ser.gz,models/ner/default/english.muc.7class.nodistsim.crf.ser.gz,models/ner/default/english.nowiki.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.nowiki.3class.nodistsim.crf.ser.gz
pos.model=models/tagger/default/english-left3words-distsim.tagger

But, I get the following exception :-


Caused by: edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file)
Caused by: java.io.StreamCorruptedException: invalid stream header: EFBFBDEF

EFFORT III :- I thought I will be able to handle with RegexNER, and I was successful to some extent. Just that the entities that it learns through RegexNER, it doesn't apply to forthcoming expressions. Eg: It will find the entity "CUSTOM_ENTITY" inside a text, but if i put a RegexNER like

 ( [ {ner:CUSTOM_ENTITY} ] /with/ [ {ner:CUSTOM_ENTITY} ] )

it never succeeds in finding the right phrase.

Really need help here!!! I don't wanna train the complete model again, Stanford guys got over a GB of model information which is useful to me. Just that I want to add custom entities too.

Sorry for the wrong heading! How do I change it ? This is for StanfordNLP, not OpenNLP — user3279692
Not really an answer, so I'm just putting it as a comment, but: it's in practice significantly easier to just include the corenlp-models.jar file in your classpath, and then you don't have to specify custom paths for the models. The problem you're encountering seems to be a corrupted POS model file. It's unclear to me exactly what's wrong though (perhaps it's expecting a gzip file?). — Gabor Angeli

StanfordNLPHelp StanfordNLPHelp · Accepted Answer · 2017-05-05T04:52:13

First of all make sure your CLASSPATH has the proper jars in it.

Here is how you should include your custom trained NER model:

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model <csv-of-model-paths> -file example.txt

-ner.model should be set to a comma separated list of all models you want to use.

Here is an example of what you could put:

edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz,/path/to/custom_model.ser.gz

Note in my example that all of the standard models will be run, and then finally your custom model will be run. Make sure your custom model is in the CLASSPATH.

You also probably need to add this to your command: -ner.combinationMode HIGH_RECALL. By default the NER combination will only use the tags for a particular class from the first model. So if you have model1,model2,model3 only model1's LOCATION will be used. If you set things to HIGH_RECALL then model2 and model3's LOCATION tags will be used as well.

Another thing to keep in mind, model2 can't overwrite decisions by model1. It can only overwrite "O". So if model1 says that a particular token is a LOCATION, model2 can't say it's an ORGANIZATION or a PERSON or anything. So the order of the models in your list matters.

If you want to write rules that use entities found by previous rules, you should look at my answer to this question:

TokensRegex rules to get correct output for Named Entities

Tokenizer Training with StanfordNLP

2 Answers