1
votes

I am new to Cassandra and I am struggling with some of the concepts. I see the advantage in having the same data duplicated across multiple tables (with different partition keys) to support queries, but how are ETL jobs typically set up?

Consider a scenario where the data from a single csv file has to be loaded to multiple tables.Would we run copy/sstableloader/cassandra-loader utility with the csv file multiple times, once for each table?

How is read consistency maintained when the data has been partially loaded to some of the tables but load script is still running? Clients connected to two different tables could potentially read two different values. Some online forums recommend using materialized views. Is that the only alternative?

Thanks!

1

1 Answers

0
votes

I'm also fairly new to Cassandra and based on what I've done so far it seems that materialized views is your best bet. If you don't go that route all CRUD statements you do with the data in Cassandra would have to manage the data in all tables. Materialized views get you out of the business of writing statements for each table, instead to manage the base table and the views manage themselves. You can find a good overview here.

BATCH is your best option for inserting data in bulk. However, it does not prevent dirty reads, like the user only getting some of the rows that you are still in the process of inserting. I have not seen anything that would do that in Cassandra and based on the distributed nature I'm not sure exactly how it could lock reads while the 'whole batch' finished across the cluster.