I am using Lucene to index my documents. In my case, each document is rather in small size but having a large quantity (~2GB). And in each document, there are many repeating words or terms. I am wondering if it is the right way for me to do index using Lucene or what preprocessing I should do on the document before indexing.
The following are a couple of examples of my documents (each column is a field, the first row is the field name, and starting from 2nd row, each row is one document):
ID category track keywords
id1 cat1 track1 mode=heat treatment;repeat=true;Note=This is an apple
id2 cat1 track2 mode=cold treatment;repeat=true;Note=This is an orange
I want to index all documents, perform a search on the 3 fields (category, track and keywords) and return the unique id1.
If I directly index this, will the repeating terms affect the searching performance? Do you have a good idea how I should do the indexing and searching? Thanks a lot in advance.