I've experience of working on project where full-text search speed was boosted by replacement of ElasticSearch with Lucene + Hazelcast.
What may be the reasons of ElasticSearch overhead over Lucene + Hazelcast? Which ElasticSearch configs may cause for significant slowdown with the same resources?
Provided arguments for Lucene + Hazelcast
- ElasticSearch has big overhead over Lucene
- Lucene is more flexible in indexing than ElasticSearch
My considerations
- Which overheads? As I know you can hack ElasticSearch to communicate with him through internal TCP API instead of REST. Any other overheads? Are they only about replication (you can turn off initial load replication)? OR about index auto-merging? Maybe due to ElasticSearch tried to merge indexes automatically and made them so big they doesn't feet FS cache?
- Why Lucene API is more flexible? AFAIK, ElasticSearch has all the same indexes plus additional features like parent-child or nested objects. Since it's not a case for this project. (See indexing/querying schema)
Lucene + Hazelcast indexing/querying schema:
- You have 100-10.000 of huge string files compressed as AVRO in HDFS (in summary gigabytes or even terabytes of data). You should index them that way that you can find all files containing specific string.
- Submit index task with Hazelcast to each cluster node
- Each index task use
IndexWriter
to write separate index for each node working only with a local file system. Means each AVRO file will form one index per node. Each file row is a separateStringField
- After indexing is finished on all nodes - indexes are never changed. Means no write payloads anymore. The amount of indexes equals to the amount of files. Files a pretty big and their amount is not so hight - so no merging of indexes.
- Search with simple Term query specifying paths to all indexes where the data may be present.