0
votes

We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.

  1. Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
  2. Transform:- We have a custom jar deployed for this step
  3. Load:- Data from Transform stage is Loaded in Columnstore

I have below doubts:

What Happens to the Message polled from Kafka, if the Job fails in Transform stage - Does Memsql takes care of loading that Message again - Or, the data is Lost

If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?

2

2 Answers

0
votes

As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.

0
votes

If you haven't yet, you should check out MemSQL 5.5 Pipelines. http://blog.memsql.com/pipelines/

This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.

The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.