I am actually deploying a Spark / Kafka / Cassandra application and I'm facing an issue with different solutions, so I'm here to take your advice.
I have a long time running application in Spark streaming which consist to deal with Avro message in Kafka. Depend on the nature of the message, I can do some different case and finally save a record in Cassandra, so just a basic use case of those technologies.
I have a second job, which consist in a Spark job, it gets some data in Cassandra, do some transformations ... I have not still defined the frequency of the job but it will be from 1 time per hour to 1 time per day, So typically a big batch job.
So I'm looking for the best practice in the way to execute my batch job. Since the spark streaming job is taking all the resources in the cluster while running, in my opinion I have two solutions :
Including the Spark batch in a spark streaming "micro" batch with an interval of one hour for example
Pro : Easy to do, optimize resource allocation
Cons : Not very clean, big interval for the micro batch (what is the Spark behaviour in this case?)Keep some resources for the Spark job in the cluster
Pro : Clean
Cons : Resources allocation not optimized cause some processor will not do anything for a while
So I'm really interested to get your advice and some experience you get in similar cases.