3
votes

I've experience of working on project where full-text search speed was boosted by replacement of ElasticSearch with Lucene + Hazelcast.

What may be the reasons of ElasticSearch overhead over Lucene + Hazelcast? Which ElasticSearch configs may cause for significant slowdown with the same resources?

Provided arguments for Lucene + Hazelcast

  1. ElasticSearch has big overhead over Lucene
  2. Lucene is more flexible in indexing than ElasticSearch

My considerations

  1. Which overheads? As I know you can hack ElasticSearch to communicate with him through internal TCP API instead of REST. Any other overheads? Are they only about replication (you can turn off initial load replication)? OR about index auto-merging? Maybe due to ElasticSearch tried to merge indexes automatically and made them so big they doesn't feet FS cache?
  2. Why Lucene API is more flexible? AFAIK, ElasticSearch has all the same indexes plus additional features like parent-child or nested objects. Since it's not a case for this project. (See indexing/querying schema)

Lucene + Hazelcast indexing/querying schema:

  1. You have 100-10.000 of huge string files compressed as AVRO in HDFS (in summary gigabytes or even terabytes of data). You should index them that way that you can find all files containing specific string.
  2. Submit index task with Hazelcast to each cluster node
  3. Each index task use IndexWriter to write separate index for each node working only with a local file system. Means each AVRO file will form one index per node. Each file row is a separate StringField
  4. After indexing is finished on all nodes - indexes are never changed. Means no write payloads anymore. The amount of indexes equals to the amount of files. Files a pretty big and their amount is not so hight - so no merging of indexes.
  5. Search with simple Term query specifying paths to all indexes where the data may be present.
1
I'm voting to close this question as off-topic because it's not a question about programming at allP.J.Meisch
@P.J.Meisch I disagree with you. This question is about understanding of internals of Lucene and ElasticSearch, difference between them and ElasticSearch server configurations. This is part of common programming tasks that almost anyone have facedVB_
There are so many knobs you can tune in ES that it is hard to compare one solution over the other without some cold hard numbers. You're talking about "overhead", but we have no idea of the magnitude. Also, we have no idea how much effort you spent into tuning the performance of ES. This question is way too open-ended and lacking some concrete numbers to get any meaningful answer in my opinion.Val
You write: "Each file row is a separate StringField". So what is your UnitOfRetrieval? Do you update/overwrite UoR? Do you use replication and you do you ensure consistency in your application?Karsten R.

1 Answers

1
votes

My reasons for using ES in this case would be

  • Future needs for project to explore data in more ways

  • Feature rich Aggregations API

  • Support for Indexing using Spark / Hive etc - very easy to do and we can use pre processing of data efficiently.

  • Auto Scaling / Adjust # of replications based on demand

and of course , not maintaining codebase to do all these. This thread will be good discussion if you can add some expectations on flexibility from your end.