I have previously trained a german classifier using the Stanford NER and a training-file with 450.000 tokens. Because I had almost 20 classes, this took about 8 hours and I had to cut a lot of features short in the prop file.
I now have a gazette-file with 16.000.000 unique tagged tokens. I want to retrain my classifier under use of those tokens, but I keep running into memory issues. The gazette-txt is 386mb and mostly contains two-token objects (first + second name), all unique.
I have reduced the amount of classes to 5, reduced the amount of tokens in the gazette by 4 million and I've removed all the features listed on the Stanford NER FAQ-site from the prop-file but I still run into the out of memory: java heap space error. I have 16gb of ram and start the jvm with -mx15g -Xmx14g.
The error occurs about 5 hours into the process.
My problem is that I don't know how to further reduce the memory usage without arbitrarily deleting entries from the gazette. Does someone have further suggestions on how I could reduce my memory-usage?
My prop-file looks like this:
trainFile = ....tsv
serializeTo = ...ser.gz
map = word=0,answer=1
useWordPairs=false
useNGrams=false
useClassFeature=true
useWord=true
noMidNGrams=true
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
saveFeatureIndexToDisk=true
qnSize=2
printFeatures=true
useObservedSequencesOnly=true
cleanGazette=true
gazette=....txt
Hopefully this isnt too troublesome. Thank you in advance!