I have a dump of University webpages (documents), and my goal is to use Wikipedia's term dictionary for finding those terms in the given documents. Eventually, I'm supposed to calculate the document frequency of each Wikipedia term. (Term frequency for each document is not required)
Wikipedia (multi-word) dictionary entries look like the following -
<t id="34780065">Years of the 20th century in Mauritania</t>
<t id="34780066">1960 International Gold Cup</t>
<t id="34780067">Roman Lob songs</t>
I'm trying to use Lucene to achieve this.
Approach 1 : Use ShingleAnalyzer to index n-gram tokens from the documents. n-grams because the dictionary contains multi-word terms. Then loop through each of the dictionary terms to find their document frequency from the index.
Approach 2 : Using the technique suggested here, implement an Analyzer that looks up the Wikipedia dictionary for indexing. And then index token streams in the documents using this analyzer.
Question : Which of the 2 approaches is more efficient? If I go with 2nd approach, how do I implement this custom Analyzer. I haven't found any good resource to help explain such an implementation.