5
votes

I'm using Lucene to get frequency of terms in documents, i.e. number of occurrences of some term in each document. I use IndexReader.termDocs() for this purpose, and it works fine for single-word terms, but since all words are stored in index separately, it doesn't work for multi-word terms.

Example (taken from this question): I'm interested in frequency of term "basket-ball" (or even "basket ball"), but after tokenizing there will be two words, and I'll be able to get frequency of term "basket" and term "ball", but not of term "basket-ball".

I know all multi-word terms I want to get frequency for, also I'm not interested in storing original text - only in getting statistics. So, my first approach was to just concatenate words in a term. E.g. "I played basket ball yesterday" becomes "I played basketball yesterday" and "My favorite writer is Kurt Vonnegut" becomes "My favorite writer is KurtVonnegut". This one works: concatenated terms are treated as any other single word, so I can easily get frequency. But this method is ugly and, more importantly, very slow. So I came to another one.

My second approach is to write special token filter, which will capture tokens and check if they are part of terms to be replaced (something like SynonymFilter from Lucene in Action). In our case, when filter will see word "basket" it will read one more token, and if it is "ball", filter will place one term ("basketball") instead of two ("basket" and "ball") in an output token stream. Advantage of this method compared to the previous is that it searches matches between complete words and doesn't scan full text for substrings. In fact, most tokens will have different lengths and so will be discarded without even checking for correspondence of any letter in them. But such a filter isn't easy to write, moreover, I'm not sure it will be fast enough to fit my needs.

Third approach I can think about is to play around with positions of two words in same documents. But most probably it will involve iterating through TermDocs during getting frequency time, which costs much more then indexing time.

So, finally, my question is: is there a way to efficiently index and get frequency of multi-word terms in Lucene?

1

1 Answers

6
votes

Look up shingling.. This indexes groups of terms. It's in the solr 1.4 book. and here

So if you have the string : "Basket ball started in the early 1900's .

You would get back all the individual terms indexed, but then also "

"basket ball" "ball started" "started in" early 1900's" etc...

and through configuration, also

"basket ball started" "ball started in" "the early 1900's" etc...