The best option when trying to stream in data from an SQL database is to use Logstash's JDBC Input to do it for you (the documentation). This can hopefully just do it all for you.
Not all SQL schemes make this easy, so for your specific questions:
How to use bulk to update a document if it exists and create a document if it doesn't within bulk without knowing if it exists in the index.
Bulk currently accepts four different types of sub-requests, which behave differently than you probably expect coming from an SQL world:
index
create
update
delete
The first, index
, is the most commonly used option. It means that you want to index
(the verb) something to the Elasticsearch index (the noun). However, if it already exists in the index given the same _id
, then it will replace it. The rest are probably a bit more obvious.
Each one of the sub-requests behaves like the individual option that they're associated with (so update
is an UpdateRequest
under the hood, delete
is a DeleteRequest
, and index
is an IndexRequest
). In the case of create
, it is a specialization of index
, which effectively says "add this if it doesn't exist, but fail it if is does exist".
How to format a large amount of JSON to run through bulk to update/create the document because bulk api expects the body to be formatted a certain way.
You should look into using either the Logstash approach or any of the existing client language libraries, such as the Python client, which should work well from cron. The clients will take care of the formatting for you. One for your preferred language most likely already exists.