2
votes

I have previously trained a german classifier using the Stanford NER and a training-file with 450.000 tokens. Because I had almost 20 classes, this took about 8 hours and I had to cut a lot of features short in the prop file.

I now have a gazette-file with 16.000.000 unique tagged tokens. I want to retrain my classifier under use of those tokens, but I keep running into memory issues. The gazette-txt is 386mb and mostly contains two-token objects (first + second name), all unique.

I have reduced the amount of classes to 5, reduced the amount of tokens in the gazette by 4 million and I've removed all the features listed on the Stanford NER FAQ-site from the prop-file but I still run into the out of memory: java heap space error. I have 16gb of ram and start the jvm with -mx15g -Xmx14g.

The error occurs about 5 hours into the process.

My problem is that I don't know how to further reduce the memory usage without arbitrarily deleting entries from the gazette. Does someone have further suggestions on how I could reduce my memory-usage?

My prop-file looks like this:

trainFile = ....tsv
serializeTo = ...ser.gz
map = word=0,answer=1

useWordPairs=false
useNGrams=false
useClassFeature=true
useWord=true
noMidNGrams=true
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
saveFeatureIndexToDisk=true
qnSize=2
printFeatures=true
useObservedSequencesOnly=true

cleanGazette=true
gazette=....txt

Hopefully this isnt too troublesome. Thank you in advance!

2
I would try using other ways to train the classifier chokkan.org/software/crfsuite might be faster than using Stanford's implementation of CRF (disclaimer: i haven't actually tested the speed to compare them) - alvas
Thank you for your reply. I have cut the size of the gazette in half and training completed in about an hour. I had to disable almost all properties though, so the results were worse than without the gazette. So the speed isnt the main issue, the issue is that the process runs out of memory. I will have a look at that CRFsuite tho, so thank you! - user2299050
Did you try OpenNLP? My experience with StanfordNLP isn't so good: poor performance and results with OpenNLP have higher F score (of course, F score depends on your domain). - schrieveslaach
I havent tried OpenNLP yet. I came across it when I decided which NER-tagger to use but I read that it has no implementation for german yet, which is why I am using the Stanford NER. I was offered to use the PC at my universeity to train the classifier so I will try that with the full gazette & without disabling as many features and I will report back afterwards. - user2299050

2 Answers

1
votes

RegexNER could help you with this:

http://nlp.stanford.edu/static/software/regexner/

Some thoughts:

  1. start with 1,000,000 entries and see how big of a gazetteer you can handle, or if 1,000,000 is too large shrink it down more.

  2. sort the entries by how frequent they are in a large corpus and eliminate the infrequent ones

  3. Hopefully a lot of the rarer entries in your gazetteer aren't ambiguous, so you can just use RegexNER and have a rule based layer in your system that automatically tags them as PERSON

0
votes

Heres an update to what I've been doing: First I tried to train the Classifier using all available data on our universities Server with 128gb RAM available. But since the progress was incredibly slow (~120 Iterations of optimization after 5 days) I decided to filter the gazetteer.

I checked the german wikipedia for all n-Grams in my gazeteer and only kept those that occured more than once. This reduced the amount of PER from ~12 mio. to 260k. I only did this for my PER-list at first and retrained my classifier. This resulted in an F-Value increase of 3% (from ~70,5% to 73,5%). By now I filtered the ORG and LOC lists as well but I am uncertain whether or not I should use those.

The ORG-list contains a lot of acronyms. Those are all written in capital letters but I dont know whether or not the training process takes capitalization into account. Because if it didnt this would lead to tons of unwanted ambiguity between the acronyms and actual words in german.

I also noticed that whenever I uses either the unfiltered ORG or the unfiltered LOC-list the F-Value of that one class might have risen a bit, but the F-Values of other classes went down somewhat significantly. This is why for now I am only using the PER-list.

This is my progress so far. Thanks again to everyone that helped.