I have been trying to train a model for Named Entity Recognition for a specific domain, and with new entities. It seems there is not a completed suitable pipeline for this, and there is the need to use different packages.
I would like to give a chance to NLTK. My question is, how can I train a the NLTK NER to classify and match new entities using the ieer corpus?
I will of course provide training data with the IOB-Format like:
We PRP B-NP
saw VBD O
the DT B-NP
yellow JJ I-NP
dog NN I-NP
I guess I will have to tag the tokens by myself.
What do I do next when I have a text file in this format, what are the steps to train my data with the ieer corpus, or with a better one, conll2000?
I know there is some documentation out there, but it is not clear for me what to do after you have a training corpus tagged.
I want to go for NLTK because I then want to use the relextract() function.
Please any advise.
Thanks