I am retraining the Stanford NER model on my own training data for extracting organizations. But, whether I use a 4GB RAM machine or an 8GB RAM machine, I get the same Java heap space error.
Could anyone tell what is the general configuration of machines on which we can retrain the models without getting these memory issues?
I used the following command :
java -mx4g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop newdata_retrain.prop
I am working with training data (multiple files - each file has about 15000 lines in the following format) - one word and its category on each line
She O is O working O at O Microsoft ORGANIZATION
Is there anything else we could do to make these models run reliably ? I did try with reducing the number of classes in my training data. But that is impacting the accuracy of extraction. For example, some locations or other entities are getting classified as organization names. Can we reduce specific number of classes without impact on accuracy ?
One data I am using is the Alan Ritter twitter nlp data : https://github.com/aritter/twitter_nlp/tree/master/data/annotated/ner.txt
The properties file looks like this:
#location of the training file
trainFile = ner.txt
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model-twitter.ser.gz
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
The error I am getting : the stacktrace is this :
CRFClassifier invoked on Mon Dec 01 02:55:22 UTC 2014 with arguments:
-prop twitter_retrain.prop
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk=true
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=ner-model-twitter.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=ner.txt
map=word=0,answer=1
useWord=true
useTypeSeqs=true
[1000][2000]numFeatures = 215032
setting nodeFeatureIndicesMap, size=149877
setting edgeFeatureIndicesMap, size=65155
Time to convert docs to feature indices: 4.4 seconds
numClasses: 21 [0=O,1=B-facility,2=I-facility,3=B-other,4=I-other,5=B-company,6=B-person,7=B-tvshow,8=B-product,9=B-sportsteam,10=I-person,11=B-geo-loc,12=B-movie,13=I-movie,14=I-tvshow,15=I-company,16=B-musicartist,17=I-musicartist,18=I-geo-loc,19=I-product,20=I-sportsteam]
numDocuments: 2394
numDatums: 46469
numFeatures: 215032
Time to convert docs to data/labels: 2.5 seconds
Writing feature index to temporary file.
numWeights: 31880772
QNMinimizer called on double function of 31880772 variables, using M = 25.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:923)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:885)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:879)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:91)
at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1911)
at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1718)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:759)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:747)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2937)