Lucene: indexing documents based on dictionary terms/ implementing custom Analyzer

Question

I have a dump of University webpages (documents), and my goal is to use Wikipedia's term dictionary for finding those terms in the given documents. Eventually, I'm supposed to calculate the document frequency of each Wikipedia term. (Term frequency for each document is not required)

Wikipedia (multi-word) dictionary entries look like the following -

<t id="34780065">Years of the 20th century in Mauritania</t>
<t id="34780066">1960 International Gold Cup</t>
<t id="34780067">Roman Lob songs</t>

I'm trying to use Lucene to achieve this.

Approach 1 : Use ShingleAnalyzer to index n-gram tokens from the documents. n-grams because the dictionary contains multi-word terms. Then loop through each of the dictionary terms to find their document frequency from the index.

Approach 2 : Using the technique suggested here, implement an Analyzer that looks up the Wikipedia dictionary for indexing. And then index token streams in the documents using this analyzer.

Question : Which of the 2 approaches is more efficient? If I go with 2nd approach, how do I implement this custom Analyzer. I haven't found any good resource to help explain such an implementation.

Mark Leighton Fisher Mark Leighton Fisher · Accepted Answer · 2014-02-15T00:09:08

I think you want to use Approach 1, as Approach 2 looks like you have to look up the Wikipedia dictionary for each word, then each 2 words, then each 3 words, ... (or in reverse order) for each n-gram. N-gram indexing as in Approach 1, then throwing out the n-grams not in the Wikipedia dictionary, I think would get you there faster as you look at each n-gram once (O(n) * Wikipedia-dictionary-lookup performance if I understand the problem correctly).

Lucene: indexing documents based on dictionary terms/ implementing custom Analyzer

1 Answers