1
votes

I'm a Lucene newbie and am thinking of using it to index the words in the title and description elements of RSS feeds so that I can record counts of the most popular words in the feeds.

Various search options are needed, some will have keywords entered manually by users, whereas in other cases popular terms would be generated automatically by the system. So I could have Lucene use query strings to return the counts of hits for manually entered keywords and TermEnums in automated cases?

The system also needs to be able to handle new data from the feeds as they are polled at regular intervals.

Now, I could do much / all of this using hashmaps in Java to work out counts, but if I use Lucene, my question concerns the best way to store the words for counting. To take a single RSS feed, is it wise to have Lucene create a temporary index in memory, and pass the words and hit counts out so other programs can write them to database?

Or is it better to create a Lucene document per feed and add new feed data to it at polling time? So that if a keyword count is required between dates x and y, Lucene can return the values? This implies I can datestamp Lucene entries which I'm not sure of yet.

Hope this makes sense.

Mr Morgan.

3

3 Answers

2
votes

From the description you have given in the question, I think Lucene alone will be sufficient. (No need of MySQL or Solr). Lucene API is also easy to use and you won't need to change your frontend code.

From every RSS feed, you can create a Document having three fields; namely title, description and date. The date must preferably be a NumericField. You can then append every document to the lucene index as the feeds arrive.

How frequently do you want the system to automatically generate the popular terms? For eg. Do you want to show the users, "most popular terms last week", etc.? If so, then you can use the NumericRangeFilter to efficiently search the date field you have stored. Once you get the documents satisfying a date range, you can then find the document frequency of each term in the retrieved documents to find the most popular terms. (Do not forget to remove the stopwords from your documents (say by using the StopAnalyzer) or else the most popular terms will be the stopwords)

0
votes

I can recommend you check out Apache Solr. In a nutshell, Solr is a web enabled front end to Lucene that simplifies integration and also provides value added features. Specifically, the Data Import Handlers make updating/adding new content to your Lucene index really simple.

Further, for the word counting feature you are asking about, Solr has a concept of "faceting" which will exactly fit the problem you're are describing.

If you're already familiar with web applications, I would definitely consider it: http://lucene.apache.org/solr/

0
votes

Solr is definitely the way to go although I would caution against using it with Apache Tomcat on Windows as the install process is a bloody nightmare. More than happy to guide you through it if you like as I have it working perfectly now.

You might also consider the full text indexing capabilities of MySQL, far easier the Lucene.

Regards