0
votes

I'm trying to train the Stanford NER classifier to identify specific things in text data bases.I have made a new .prop file and a training file, and I get results, but they are the default results that I would get if I just ran the classifier without training. Anything I can do to fit this?

This is my code:

import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.util.StringUtils;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Properties;public class NLP_train {


   public static void main(String[] args) throws IOException {

       Properties props = StringUtils.propFileToProperties("C:/Users/Admin/Desktop/trainingfile.prop");

       StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


       // read some text in the text variable
       File inputFile = new File("C:/Users/Admin/Desktop/target.txt");
       // create an empty Annotation just with the given text
       Annotation document = new Annotation(IOUtils.slurpFileNoExceptions(inputFile));

       // run all Annotators on this text
       pipeline.annotate(document);

       List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

       for (CoreMap sentence : sentences) {
           // traversing the words in the current sentence
           // a CoreLabel is a CoreMap with additional token-specific methods
           for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
               // this is the text of the token
               String word = token.get(CoreAnnotations.TextAnnotation.class);
               // this is the POS tag of the token
               String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
               // this is the NER label of the token
               String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);






               System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
           }
       }
   }
}

Here is my .prop file:

trainFile = C:/Users/Admin/Desktop/trainingfile.tsv

serializeTo = C:/Users/Admin/Desktop/ner-model.ser.gz

map = word=0,answer=1

useClassFeature=true

useWord=true

useNGrams=true

noMidNGrams=true

useDisjunctive=true

maxNGramLeng=6

usePrev=true

useNext=true

useSequences=true

usePrevSequences=true

maxLeft=1

the next 4 deal with word shape features

useTypeSeqs=true

useTypeSeqs2=true

useTypeySequences=true

wordShape=chris2useLC

And an excerpt of my training file:

The 0

Type Radar

347G Radar

`` 0

Rice 0

Bowl 0

'' 0

1

1 Answers

0
votes

To train a new NER model, you need to train it directly using the edu.stanford.nlp.ie.crf.CRFClassifier class. You cannot train new models within CoreNLP. Also, while both use properties files, the files are different in that a properties file for an NER run directly gives properties to the CRFClassifier class while a CoreNLP properties file can give properties to all sorts of things. as a result, property names are placed into their own namespaces, and so a property for use by NER would have a name like: ner.model.

So, what you have to do is first train a new NER model using CRFClassifier, using roughly the data and properties file you show. That will give you a serialized NER model file. The CRF FAQ has some instructions. Then you need to make a properties file for CoreNLP that specifies for NER to run the new model. For example if your new model is /Users/manning/ner/brands.crf.ser.gz then you might use the property: ner.model = edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz,/Users/manning/ner/brands.crf.ser.gz