5
votes

I am building a Named Entity Recognizer with a Conditional Random Field and am looking for two things:

A) An open source, English NER dataset for Person, Location, and Organization entities

B) A list of English NER features

I have already looked at the CoNLL-2003 corpus and found this is exactly what I want but it is not readily available. I have been unsuccessful in finding a list of NER features; I am trying to avoid having to hand design these features.

Thanks

2
So I take it you're looking for something free, right? :) I think there might be a few on this list that could help: cs.technion.ac.il/~gabr/resources/data/ne_datasets.html - dmn

2 Answers

2
votes

You'll find a summarized and very informative study of what is needed for NER in this paper from Ratinov & Roth. In addition, their system is completely open-source, and includes lists of named entities gathered from Wikipedia.

1
votes

A) Besides the MUC corpora you should check out the manually annotated sub-corpus here: http://www.americannationalcorpus.org/MASC/About.html It's free and has various document genres. It comes with tools for parsing the format in NLTK, GATE and UIMA: http://www.anc.org/MASC/Download

B) This is a very general question.. You can try n-grams, word capitalization, using word strings as features, parts of speech, etc. You can start with reading about the Stanford parser approach with CRF: http://nlp.stanford.edu/software/CRF-NER.shtml