1
votes

I am an experienced RDBMD's developer and admin. But I am new to Apache Cassandra and Spark. I learned Cassandra's CQL, and the documentation says that CQL does not support joins and sub-queries because it would be too inefficient in Cassandra because of its distributed data nature.

So, I concluded that in distributed data env., joins and sub-queries are not supported because they will affect performance badly.

But then I learned Spark, which also works with distributed data, but Spark supports all SQL features including joins and sub-queries. Even though Spark is not database system and thus does not even have indexes... So, my question is how Spark does support joins and sub-queries on distributed data?, and does it do it efficiently?.

Thanks in advance.

2

2 Answers

2
votes

Spark does the "hard work" required to do a join on distributed data. It performs large shuffles to align data on keys before actually performing joins. This basically means that any join requires a very large amount of data movement unless the original data sources are partitioned based on the keys used for joining.

C* does not allow for generic joins like this because of the cost involved, it is geared towards OLTP workloads and requiring a full data shuffle is inherently OLAP.

1
votes

Apache spark has a concept of RDD(Resilient Distributed DataSet)which gets created in memory.

Its basically a fundamental data structure in spark.

Joins, queries are performed on this RDDs and as it operates in memory ,that`s the reason it is very efficient.

Please go through the docs below for getting some idea on Resilient Dataset

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds