I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
- Dbpedia Spotlight, the demo looks very promising
- OpenNLP requires training. Which training data to use?
- OpenNLP tools
- Stanbol
- NLTK
- balie
- UIMA
- GATE -> example code
- Apache Mahout
- Stanford CRF-NER
- maui-indexer
- Mallet
- Illinois Named Entity Tagger Not open source but free
- wikipedianer data
My questions:
- Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
- Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
- How can they be integrated with Lucene?
Here are some questions related to that subject: