Query Cassandra UDT via Spark SQL

Question

We'd like to query data from Cassandra DB via SparkSQL. The problem is that data is stored in cassandra as UDT. The structure of UDT is deeply nested and it contains arrays of variable length, so it would be very difficult to decompose data to flat structure. I couldn't find any working example how to to query such UDTs via SparkSQL - especially to filter the results based on UDT values.

Alternatively, could you suggest different ETL pipeline (Query engine, Storage engine, ...), which would be more suitable for our use-case ?

Our ETL pipeline:

Kafka (duplicated events) -> Spark streaming -> Cassandra (deduplication to store only latest event) <- Spark SQL <- analytics platform (UI)

Solutions we've tried so far:

1) Kafka -> Spark -> Parquet <- Apache Drill

Everything worked well, we could query and filter arrays and nested data structures.

Problem: couldn't deduplicate data (rewrite parquet files with latest events)

2) Kafka -> Spark -> Cassandra <- Presto

Solved problem 1) with data deduplication.

Problem: Presto doesn't support UDT types (presto doc, presto issue)

Our main requirements are:

support for data deduplication. We may receive many events with same ID and we need to store only the latest one.
storing deeply nesteed data structure with arrays
distributed storage, scalable for future expansion
distributed query engine with SQL-like query support (for connection with Zeppelin, Tableau, Qlik, ... ). The query doesn't have to run in real time, few minutes delay is acceptable.
support for schema evolution (AVRO style)

Thank your for any suggestions

Alex Ott Alex Ott · Accepted Answer · 2018-12-29T14:40:00

You can just use the dot-syntax to perform queries on the nested elements. For example, if I have following CQL definitions:

cqlsh> use test;
cqlsh:test> create type t1 (id int, t text);
cqlsh:test> create type t2 (id int, t1 frozen<t1>);
cqlsh:test> create table nudt (id int primary key, t2 frozen<t2>);
cqlsh:test> insert into nudt (id, t2) values (1, {id: 1, t1: {id: 1, t: 't1'}});
cqlsh:test> insert into nudt (id, t2) values (2, {id: 2, t1: {id: 2, t: 't2'}});
cqlsh:test> SELECT * from nudt;

 id | t2
----+-------------------------------
  1 | {id: 1, t1: {id: 1, t: 't1'}}
  2 | {id: 2, t1: {id: 2, t: 't2'}}

(2 rows)

Then I can load that data as following:

scala> val data = spark.read.format("org.apache.spark.sql.cassandra").
     options(Map( "table" -> "nudt", "keyspace" -> "test")).load()
data: org.apache.spark.sql.DataFrame = [id: int, t2: struct<id: int, t1: struct<id: int, t: string>>]

scala> data.cache
res0: data.type = [id: int, t2: struct<id: int, t1: struct<id: int, t: string>>]

scala> data.show
+---+----------+
| id|        t2|
+---+----------+
|  1|[1,[1,t1]]|
|  2|[2,[2,t2]]|
+---+----------+

And then query the data to select only specific values of field in UDT:

scala> val res = spark.sql("select * from test.nudt where t2.t1.t = 't1'")
res: org.apache.spark.sql.DataFrame = [id: int, t2: struct<id: int, t1: struct<id: int, t: string>>]

scala> res.show
+---+----------+
| id|        t2|
+---+----------+
|  1|[1,[1,t1]]|
+---+----------+

You can use either spark.sql, or corresponding .filter functions - depending on your programming style. This technique works with any struct type data, coming from different sources, like, JSON, etc.

But take into account that you won't get optimizations from Cassandra connector like you get when querying by partition key(s)/clustering column(s)

Query Cassandra UDT via Spark SQL

1 Answers