Indexing texts with many numbers in Lucene

Question

Is it OK to create a term for each number in a text? Example text:

I got 2295910 unique terms.

The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number of unique terms. It does not feel right to have the same number of unique terms as documents. Lucene memory usage grows with the number of unique terms.

Is there a special analyzer or a trick for texts with numbers? The StandardAnalyzer creates a term for each unique number.

The needs:

The numbers should remain searchable. There could be multiple numbers in a document. The memory usage is the issue. I have 800M documents in multiple index directories. The memory usage forces me to close the least recently used IndexSearchers.

Untested ideas:

Use a special analyzer. It would split the numbers into chunks. 123456 would become "123 456". The query parser would use a phrase search to find a number.
Change Lucene code to use a bigger termInfosIndexDivisor when seeing numeric terms.

Maybe I'm reinventing the wheel. Was it solved by somebody already?

bajafresh4life bajafresh4life · Accepted Answer · 2011-01-19T14:33:27

Are you currently having a memory problem? It is true that Lucene memory usage grows with the number of unique terms, but it's still a relatively minuscule amount of memory even for indices that have a lot a terms.

If memory is an issue and you've profiled your code to ensure that it is indeed Lucene that is the problem, you can create another Analyzer that throws away numeric terms. If you do that, obviously, you won't be able to search for documents using numbers.

Indexing texts with many numbers in Lucene

3 Answers