0
votes

I am using OpenNLP JAVA API for Sentence Tokenization and it is using space character to tokenize the sentence and is splitting every word.

Is there any way to i can skip the word splitting or tokenization for some specific words.

For Example in a sentence. "A quick brown fox jumping over the lazy dog". OpenNLP split/tokenize the sentence as

a
quick
brown
fox
jumping
over
the lazy
dog

i want to skip tokenization for the word "quick brown fox" and "lazy dog" , so the expected output will be

a
quick brown fox
jumping
over
the
lazy dog

1
By what criteria do you want to determine whether to split or not?qqilihq
i can have a list of words to skip from another method or another way is to add tags before the words like #skip# quick brown fox #skip#, #skip# lazy dog #skip#Abbas

1 Answers

0
votes

One thought, since it appears you want to skip noun phrases, is to use the SentenceChunker to identify noun phrases. You can use the same spans/tokens in the sentence chunker that you get back from the tokenizer, and then adjust your array of tokens based on the chunk type. Take a look at this

How to identify PP-tags/NP-tags/VP-tags in openNLP chunker?