1
votes

I am using Lucene to index my documents. In my case, each document is rather in small size but having a large quantity (~2GB). And in each document, there are many repeating words or terms. I am wondering if it is the right way for me to do index using Lucene or what preprocessing I should do on the document before indexing.

The following are a couple of examples of my documents (each column is a field, the first row is the field name, and starting from 2nd row, each row is one document):

ID     category     track     keywords
id1    cat1         track1    mode=heat treatment;repeat=true;Note=This is an apple
id2    cat1         track2    mode=cold treatment;repeat=true;Note=This is an orange

I want to index all documents, perform a search on the 3 fields (category, track and keywords) and return the unique id1.

If I directly index this, will the repeating terms affect the searching performance? Do you have a good idea how I should do the indexing and searching? Thanks a lot in advance.

1

1 Answers

3
votes

Repeated terms may affect the search performance by forcing the scorer to consider a large set of documents. If you have terms that are not that discriminating between documents, I suggest preprocessing the documents in order to remove these terms. However, you may want to start by indexing everything (say for a sample of 10000-20000 documents) and see how you fare with regard to relevance and performance.

From the way you describe this, you will need to index the category, track and keywords fields, maybe using a KeywordAnalyzer for the category and track fields. You only need to store the id field. You may want a custom analyzer for the keywords field, or alternatively to preprocess it before the actual indexing.