Unconventional named-entity recognition

Question

I'm trying to design a somewhat unconventional NER system that marks certain multiword strings as single units/tokens.

There are a lot of cool NER tools out there, but I have a few special needs that make it pretty much impossible to use something straight out of the box:

First, the entities can't just be extracted and printed out in a list--they need to be marked in some way and consolidated into tokens.

Second, categorization is not important--Person/Organization/Location doesn't matter (at least in the output).

Third, these aren't just your typical ENAMEX named entities we're looking for. We want companies and organizations, but also concepts like 'climate change' and 'gay marriage.' I've seen tags like these on some tools out there, but all of them were 'extraction-style'.

How would I got about getting this type of functionality? Would training the Stanford tagger on my own, hand-annotated dataset do the job (where 'climate change'-esque phrases are labeled MISC or something)? Or am I better off just making a shortlist of the 'weird' entities and checking the text against that after it's been run through a regular NER system?

Thanks so much!

"climate change" and "gay marriage" aren't named entities, in the sense of conventional NER. They're more like collocations or fixed expressions. Some algorithm based on mutual information might be able to pick them up. — Fred Foo
@larsmans Yes I've dabbled with something similar. Chunk first, find the Noun Phrases, then run collocation statistics to find the 'interesting' (unlikely) phrases. This latter step takes fine tuning, and I'm not there yet. Better stats might be the answer. — winwaed
@winwaed: an alternative would be string matching with the Wikipedia to find the articles that are used as anchor text; that also gives you the "meaning" of the phrase. I've been doing that with Meij's algorithm lately and it works quite well. — Fred Foo
@all Ahh thanks! And people are doing some amazing stuff with Wikipedia and NLP. There's a research group at University of Sydney that's using it to extract really big, automatically-annotated training corpora. — jjdubs

Christopher Manning Christopher Manning · Accepted Answer · 2012-06-27T21:29:31

The underlying CRF model of a named entity tagger such as Stanford NER can actually be used to recognize anything, not just named entities. There are certainly people who have used them quite successfully to pick out various kinds of terminological phrases. The software can certainly give you marked up token sequences in context.

There is, however, a choice as to whether to approach this in a "more unsupervised" way, where something like NP chunking and collocation statistics are used, or the fully supervised way of a straightforward CRF, where you're providing lots of annotated data of the kind of phrases you'd like to get out.

Unconventional named-entity recognition

1 Answers