Is it OK to create a term for each number in a text? Example text:
I got 2295910 unique terms.
The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number of unique terms. It does not feel right to have the same number of unique terms as documents. Lucene memory usage grows with the number of unique terms.
Is there a special analyzer or a trick for texts with numbers? The StandardAnalyzer creates a term for each unique number.
The needs:
The numbers should remain searchable. There could be multiple numbers in a document. The memory usage is the issue. I have 800M documents in multiple index directories. The memory usage forces me to close the least recently used IndexSearchers.
Untested ideas:
- Use a special analyzer. It would split the numbers into chunks. 123456 would become "123 456". The query parser would use a phrase search to find a number.
- Change Lucene code to use a bigger termInfosIndexDivisor when seeing numeric terms.
Maybe I'm reinventing the wheel. Was it solved by somebody already?