I am trying to extract named entities from text using Stanford-NER. I have read all related threads regarding chunking and did not find anything to solve the problem I am having.
Input:
The united nations is holding a meeting in the united states of America.
Expected Output:
united nations/organization
united states of America/location
I was able to get this output, but it doesn't combine tokens for multi-work named entities:
[('The', 'O'), ('united', 'ORGANIZATION'), ('nations', 'ORGANIZATION'), ('is', 'O'), ('holding', 'O'), ('a', 'O'), ('meeting', 'O'), ('in', 'O'), ('the', 'O'), ('united', 'LOCATION'), ('states', 'LOCATION'), ('of', 'LOCATION'), ('America', 'LOCATION'), ('.', 'O')]
or in a tree format:
(S
The/O
united/ORGANIZATION
nations/ORGANIZATION
is/O
holding/O
a/O
meeting/O
in/O
the/O
united/LOCATION
states/LOCATION
of/LOCATION
America/LOCATION
./O)
I am looking for this output:
[('The', 'O'), ('united nations', 'ORGANIZATION'), ('is', 'O'), ('holding', 'O'), ('a', 'O'), ('meeting', 'O'), ('in', 'O'), ('the', 'O'), ('united states of America', 'LOCATION'), ('.', 'O')]
When I tried some of the code I found in other threads to join named entities in the tree format, it returned an empty list.
import nltk
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os
java_path = "C:\Program Files (x86)\Java\jre1.8.0_251/java.exe"
os.environ['JAVAHOME'] = java_path
st = StanfordNERTagger(r'stanford-ner-4.0.0/stanford-ner-4.0.0/classifiers/english.all.3class.distsim.crf.ser.gz',
r'stanford-ner-4.0.0/stanford-ner-4.0.0/stanford-ner.jar',
encoding='utf-8')
text = 'The united nations is holding a meeting in the united states of America.'
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
namedEnt = nltk.ne_chunk(classified_text, binary = True)
#this line makes the tree return an empty list
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.label() == "NE"]
print(np)
print(classified_text)
namedEnt
? - sophros