We'd like to query data from Cassandra DB via SparkSQL. The problem is that data is stored in cassandra as UDT. The structure of UDT is deeply nested and it contains arrays of variable length, so it would be very difficult to decompose data to flat structure. I couldn't find any working example how to to query such UDTs via SparkSQL - especially to filter the results based on UDT values.
Alternatively, could you suggest different ETL pipeline (Query engine, Storage engine, ...), which would be more suitable for our use-case ?
Our ETL pipeline:
Kafka (duplicated events) -> Spark streaming -> Cassandra (deduplication to store only latest event) <- Spark SQL <- analytics platform (UI)
Solutions we've tried so far:
1) Kafka -> Spark -> Parquet <- Apache Drill
Everything worked well, we could query and filter arrays and nested data structures.
Problem: couldn't deduplicate data (rewrite parquet files with latest events)
2) Kafka -> Spark -> Cassandra <- Presto
Solved problem 1) with data deduplication.
Problem: Presto doesn't support UDT types (presto doc, presto issue)
Our main requirements are:
- support for data deduplication. We may receive many events with same ID and we need to store only the latest one.
- storing deeply nesteed data structure with arrays
- distributed storage, scalable for future expansion
- distributed query engine with SQL-like query support (for connection with Zeppelin, Tableau, Qlik, ... ). The query doesn't have to run in real time, few minutes delay is acceptable.
- support for schema evolution (AVRO style)
Thank your for any suggestions