1
votes

I am trying to build a little file and email search engine. I'd like also to use more advanced search queries for the full text search. Hence I am looking at lucene indexes. From what I have seen, there are two approaches - node_auto_index and apoc.index.addNode.

Setting the index up works fine, and indexing nodes with small properties works. When trying to index nodes with properties that are larger then 32k, neo4j fails (and get's into an unusable state).

The error message boils down to:

WARNING: Failed to invoke procedure apoc.index.addNode: Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text_e" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101, 111, 32, 110, 101]...', original message: bytes can be at most 32766 in length; got 40000

I have checked this on 3.1.2 and 3.1.0+ apoc 3.1.0.3

A much longer description of the problem can be found at https://baach.de/Members/jhb/neo4j-full-text-indexing.

Is there any way to fix this? E.g. have I done anything wrong, or is there something to configure?

Thx a lot!

1
Just a quick update: its not bolt, python, or cypher. It breaks the same when using the REST API as well :-( baach.de/Members/jhb/neo4j-full-text-indexing#section-5Joerg Baach
if you detect such fields, could you just split them?Michael Hunger
@Michael Hunger: thanks for your suggestion. I am afraid this would'nt work for a number of queries, e.g. a proximity search "jakarta apache"~10, phrase search etc. It would also mess up relevance (because this depends on document frequency) etc. But I see your question as a confirmation that it really breaks at 32k?Joerg Baach

1 Answers

3
votes

neo4j does not support index values that are longer then ~32k because of underlying lucene limitation. For some details around that area You can look at: https://github.com/neo4j/neo4j/pull/6213 and https://github.com/neo4j/neo4j/pull/8404. You need to split such longer values into multiple terms.