Apache Cassandra and Spark

Question

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.

So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.

But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.

Thanks in advance.

RussS RussS · Accepted Answer · 2016-05-18T18:31:30

Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.

C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.

Apache Cassandra and Spark

2 Answers