Train Stanford NER with big gazette, memory issue

Question

I have previously trained a german classifier using the Stanford NER and a training-file with 450.000 tokens. Because I had almost 20 classes, this took about 8 hours and I had to cut a lot of features short in the prop file.

I now have a gazette-file with 16.000.000 unique tagged tokens. I want to retrain my classifier under use of those tokens, but I keep running into memory issues. The gazette-txt is 386mb and mostly contains two-token objects (first + second name), all unique.

I have reduced the amount of classes to 5, reduced the amount of tokens in the gazette by 4 million and I've removed all the features listed on the Stanford NER FAQ-site from the prop-file but I still run into the out of memory: java heap space error. I have 16gb of ram and start the jvm with -mx15g -Xmx14g.

The error occurs about 5 hours into the process.

My problem is that I don't know how to further reduce the memory usage without arbitrarily deleting entries from the gazette. Does someone have further suggestions on how I could reduce my memory-usage?

My prop-file looks like this:

trainFile = ....tsv
serializeTo = ...ser.gz
map = word=0,answer=1

useWordPairs=false
useNGrams=false
useClassFeature=true
useWord=true
noMidNGrams=true
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
saveFeatureIndexToDisk=true
qnSize=2
printFeatures=true
useObservedSequencesOnly=true

cleanGazette=true
gazette=....txt

Hopefully this isnt too troublesome. Thank you in advance!

I would try using other ways to train the classifier chokkan.org/software/crfsuite might be faster than using Stanford's implementation of CRF (disclaimer: i haven't actually tested the speed to compare them) — alvas
Thank you for your reply. I have cut the size of the gazette in half and training completed in about an hour. I had to disable almost all properties though, so the results were worse than without the gazette. So the speed isnt the main issue, the issue is that the process runs out of memory. I will have a look at that CRFsuite tho, so thank you! — user2299050
Did you try OpenNLP? My experience with StanfordNLP isn't so good: poor performance and results with OpenNLP have higher F score (of course, F score depends on your domain). — schrieveslaach
I havent tried OpenNLP yet. I came across it when I decided which NER-tagger to use but I read that it has no implementation for german yet, which is why I am using the Stanford NER. I was offered to use the PC at my universeity to train the classifier so I will try that with the full gazette & without disabling as many features and I will report back afterwards. — user2299050

StanfordNLPHelp StanfordNLPHelp · Accepted Answer · 2016-01-26T10:40:02

RegexNER could help you with this:

http://nlp.stanford.edu/static/software/regexner/

Some thoughts:

start with 1,000,000 entries and see how big of a gazetteer you can handle, or if 1,000,000 is too large shrink it down more.
sort the entries by how frequent they are in a large corpus and eliminate the infrequent ones
Hopefully a lot of the rarer entries in your gazetteer aren't ambiguous, so you can just use RegexNER and have a rule based layer in your system that automatically tags them as PERSON

Train Stanford NER with big gazette, memory issue

2 Answers