Using ElasticSearch Bulk to update and create documents dynamically?

Question

I'm currently using elasticsearch and running a cron job every 10 minutes that will find newly created/updated data from my DB and sync it with elasticsearch. However, I want to use bulk to sync instead of making and arbitrary amount of requests to update/create documents in an index. I'm using the elasticsearch.js library created by elasticsearch.

I face 2 challenges that I'm uncertain about how to handle:

How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.

It depends on how you identify a document for the update/insert... Also, for bulk updates or inserts if it doesn't exist there is the upsert option: elastic.co/guide/en/elasticsearch/reference/current/…. All depends on how you associate a DB record with an ES documents... — Andrei Stefan

pickypg pickypg · Accepted Answer · 2016-06-06T19:37:04

The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.

Not all SQL schemes make this easy, so for your specific questions:

How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.

Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:

index
create
update
delete

The first, index, is the most commonly used option. It means that you want to index (the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id, then it will replace it. The rest are probably a bit more obvious.

Each one of the sub-requests behaves like the individual option that they're associated with (so update is an UpdateRequest under the hood, delete is a DeleteRequest, and index is an IndexRequest). In the case of create, it is a specialization of index, which effectively says "add this if it doesn't exist, but fail it if is does exist".

How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.

You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.

Using ElasticSearch Bulk to update and create documents dynamically?

1 Answers