I learned how to customize Stanford NER (Named Entity Recognizer) in Java from here:
http://nlp.stanford.edu/software/crf-faq.shtml#a
But I am developing my project with Python and here I need to train my classier with some custom entities.
I searched a lot for a solution but could not find any. Any idea? If it is not possible, is there any other way to train my classifier with custom entities, i.e, with nltk or others in python?
EDIT: Code addition This is what I did to set up and test Stanford NER which worked nicely:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\classifiers\english.all.3class.distsim.crf.ser"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
This code worked successfully. Then, I downloaded the sample austen.prop file and both jane-austen-emma-ch1.tsv and jane-austen-emma-ch2.tsv file and put it in a custom folder in NerTragger library folder. I modified the jane-austen-emma-ch1.tsv file with my custom entity tags. The code of austen.prop file has link to jane-austen-emma-ch1.tsv file. Now, I modified the above code as follow but it is not working:
from nltk.tag.stanford import StanfordNERTagger
path_to_model = "C:\..\stanford-ner-2016-10-31\custom/austen.prop"
path_to_jar = "C:\..\stanford-ner-2016-10-31\stanford-ner.jar"
nertagger=StanfordNERTagger(path_to_model, path_to_jar)
query="Show me the best eye doctor in Munich"
print(nertagger.tag(query.split()))
But this code is producing the following error:
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 236C6F63
raise OSError('Java command failed : ' + str(cmd))
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1507)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3017)
Caused by: java.io.StreamCorruptedException: invalid stream header: 236C6F63
OSError: Java command failed : ['C:\\Program Files\\Java\\jdk1.8.0_111\\bin\\java.exe', '-mx1000m', '-cp', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-javadoc.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0-sources.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner-3.7.0.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\stanford-ner.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\joda-time.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\jollyday-0.4.9.jar;C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31\\lib\\stanford-ner-resources.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', 'C:/Users/HP/Desktop/Downloads1/Compressed/stanford-ner-2016-10-31/stanford-ner-2016-10-31/custom/austen.prop', '-textFile', 'C:\\Users\\HP\\AppData\\Local\\Temp\\tmppk8_741f', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:808)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:301)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1462)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1494)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1505)
... 1 more