1
votes

So I started learning spark and cassandra a month ago. I had this problem where I have to pre aggregate my data from sensor using spark and then sink it to cassandra table.

Here's my flow of apps

Sensor Data -> Kafka -> Spark Structured Streaming -> Sink to Cassandra

The thing is, I need to aggregate the data as per second, minute, hour, day, month until per year. That leads me to create more than 90 aggregation table in cassandra.

As far as I progressing, I discovered that I had to sink each aggregate to each cassandra table using one writestream query per aggregate, and that leads me to create this bulky spark jobs that had 90+ writestream query in it. Is it normal ? or at least 'okay' for spark ?

Any help appreciated, thanks !!

Edit. Example :

I have this sensor that detect network attack on a network. I have this kind of aggregation : - Event count for each sensor per second/minute/hour/day/month/year

Example per second aggregate

Sensor  year  month    day  hour   minute   second  hit
S1      2018  8        12   3      22       45      98182
S1      2018  8        12   3      22       46      992814
...

Example per minute aggregate

Sensor  year  month    day  hour   minute    hit
S1      2018  8        12   3      22        212458182
S1      2018  8        12   3      23        5523192814

And this apply to the rest of metric (9 metric total) with each metric has +- 12 aggregate table ...

2

2 Answers

0
votes

Thats a super General question, really depends on how you accomplish it. but generally if you need to write to 90 tables you cant really avoid 90 writestreams and it should be okay. But depends on OOO issues.

Anyway if it works it works.

0
votes

That depends on what type of aggregations are you doing. If you can give us an example or 2, it will be helpful.