4
votes

I want to use the synonym tokenfilter in Elasticsearch for an index. I downloaded the Prolog version of WordNet 3.0, and found the wn_s.pl file that Elasticsearch can understand. However, it seems that the file contains synonyms for all sorts of words and phrases, while I am really only interested in supporting synonyms for nouns. Is there a way to extract those type of entries?

1
Are you asking if there is a computer that can tell if a word is a noun or not? Could you post some examples... - ramseykhalaf
No, I am asking if there is a way to reduce the size of the file such that only nouns remain. For example, if I search universe (noun), results related to cosmos will be a part of the hits, but if I search study (verb), results that only have the word learn will not be a part of the hits. - flamecto
a sample of the code you're using would help! - arturomp

1 Answers

9
votes

Given that the format of wn_s.pl is

s(112947045,1,'usance',n,1,0).
s(200001742,1,'breathe',v,1,25).

A very raw way of doing that would be to execute the following in your terminal to only take the lines from that file that have the ',n,' string.

grep ",n," wn_s.pl > wn_s_nouns_only.pl

The file wn_s_nouns_only.pl will only have the entries that are marked as nouns.