So I started learning spark and cassandra a month ago. I had this problem where I have to pre aggregate my data from sensor using spark and then sink it to cassandra table.
Here's my flow of apps
Sensor Data -> Kafka -> Spark Structured Streaming -> Sink to Cassandra
The thing is, I need to aggregate the data as per second, minute, hour, day, month until per year. That leads me to create more than 90 aggregation table in cassandra.
As far as I progressing, I discovered that I had to sink each aggregate to each cassandra table using one writestream query per aggregate, and that leads me to create this bulky spark jobs that had 90+ writestream query in it. Is it normal ? or at least 'okay' for spark ?
Any help appreciated, thanks !!
Edit. Example :
I have this sensor that detect network attack on a network. I have this kind of aggregation : - Event count for each sensor per second/minute/hour/day/month/year
Example per second aggregate
Sensor year month day hour minute second hit
S1 2018 8 12 3 22 45 98182
S1 2018 8 12 3 22 46 992814
...
Example per minute aggregate
Sensor year month day hour minute hit
S1 2018 8 12 3 22 212458182
S1 2018 8 12 3 23 5523192814
And this apply to the rest of metric (9 metric total) with each metric has +- 12 aggregate table ...