I'm trying to design a somewhat unconventional NER system that marks certain multiword strings as single units/tokens.
There are a lot of cool NER tools out there, but I have a few special needs that make it pretty much impossible to use something straight out of the box:
First, the entities can't just be extracted and printed out in a list--they need to be marked in some way and consolidated into tokens.
Second, categorization is not important--Person/Organization/Location doesn't matter (at least in the output).
Third, these aren't just your typical ENAMEX named entities we're looking for. We want companies and organizations, but also concepts like 'climate change' and 'gay marriage.' I've seen tags like these on some tools out there, but all of them were 'extraction-style'.
How would I got about getting this type of functionality? Would training the Stanford tagger on my own, hand-annotated dataset do the job (where 'climate change'-esque phrases are labeled MISC or something)? Or am I better off just making a shortlist of the 'weird' entities and checking the text against that after it's been run through a regular NER system?
Thanks so much!