I know ElasticSearch is built upon Apache Lucene but I want to know the significant differences between the two.
4 Answers
Lucene is a Java library. You can include it in your project and refer to its functions using function calls.
Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate Lucene instance. So to summarize
- Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
- Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure.
- Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data monitoring API, Cluster management, etc.
In addition to @Vineeth Mohan words:
High Availability: Elasticsearch is distributed, so that it can manage data replication, which means having multiple copies of data in your cluster. This enables high availability.
Powerful Query DSL:Elasticsearch offers us, JSON interface for reading and writing queries on top of Lucene. Thanks to Elasticsearch, you can write complex queries without knowing Lucene syntax.
Schemaless (Schema-Free): Fields(name,value pairs) for schema
do not have to be defined before. When you index data, elasticsearch can create schema automatically at runtime, like magic.
I'll answer from a usage perspective.
Lucene is a search engine library. You'd want to use it to build your own search engine: either a new Elasticsearch or Solr competitor or something narrow for your use-case (e.g. text analysis).
Elasticsearch is a search engine. Most people use it for log aggregation, product search, or a variant of these two (e.g. social media analysis or finding relevant people for some search criteria). It's built on top of Lucene, so it exposes most (though not all) of its features. It also adds a lot on top, most significantly:
- REST API
- query DSL
- distributed system (sharding, replication, cluster management)
- facets/aggregations
- additional features for common usage (e.g. ingest processing) and management (APIs for monitoring its relevant metrics, backup and restore, etc)
I'll add another angle to the discussion.
Elasticsearch index Vs Lucene index.
The Elasticsearch index is a chunk of documents just like databases consist of tables in relational world.
In order to achieve scaling we spread the Elasticsearch Indices into multiple physical nodes / servers.
For that, we break the Elasticsearch Indices into smaller units which are called shards.
Question: How it is related to Lucene index?
If we want to search for a specific term (for example: "Cake" or "Cookie") we'll have to go over each shard and look for it (lets put aside how shards are being located and replicated on each node).
This operation will take a lot of time - so we need to use an efficient data structure for this search - this is where Lucene's index comes into play.
Each Elasticsearch shard is based on the Lucene index structure and stores statistics about terms in order to make term-based search more efficient.
(!) This is quiet confusing because of the word "index" and the fact that an Elasticsearch shard is a portion of Elasticsearch index BUT is based on a data structure of Lucene index .
Bonus - Lucene's index as a inverted index
As can be seen in the example below , Lucene's index stores the original document’s content plus additional information, such as term dictionary and term frequencies, which increase searching efficiency:
Term Document Frequency
Cake doc_id_1, doc_id_8 4 (2 in doc_id_1, 2 in doc_id_8)
Cookie doc_id_1, doc_id_6 3 (2 in doc_id_1, 1 in doc_id_6)
Spaghetti doc_id_12 1 (1 in doc_id_12)
Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it.
This is the inverse of the natural relationship, in which documents list terms.
(Reminder) How did we reached from a Shard to a term?
(1) Shard is a directory of files which contains documents.
(2) A document is a sequence of fields.
(3) A field is a named sequence of terms.