1
votes

I am starting with an initial idea of rewriting mammoth spark-kafka-hbase application with spark-kafka-cassandra(on kubernetes).

I have the following data models one supports all-time inserts and other one supports upserts

Approach 1:

create table test.inv_positions(
location_id int,
item bigint,
time_id timestamp,
sales_floor_qty int,
backroom_qty int,
in_backroom boolean,
transit_qty int,
primary key ((location_id), item,time_id) ) with clustering order by (item asc,time_id DESC);

This table keeps inserting as timeid is part of clustering col. I am thinking to read latest (timeid is desc) by fetch 1 and somehow delete the old record by either setting TTL on key cols or delete them overnight.

Concerns: TTL or delete the old records creates tombstones.

Approach 2:

create table test.inv_positions(
location_id int,
item bigint, time_id timestamp,
sales_floor_qty int,
backroom_qty int,
in_backroom boolean,
transit_qty int,
primary key ((location_id), item) ) with clustering order by (item asc);

This table if a new record comes for the same location and item, it upserts it. Its easy to read and no need to worry about purging old records

Concerns : I have another application on Cassandra that updates different col at different time and we still have read issues. That said, upserts also creates tombstones but how worse compared to approach 1? or any other better way to modeling it right?

1

1 Answers

2
votes

First approach seems good. TTL and delete, both create tombstones. you can refer compaction strategy for TTL based deletes. TWCS is better for TTL based deletes else you can use STCS for simple deletes. Also,configure gc_grace_seconds accordingly to clear tombstones smoothly because heavy tombstones leads read latency.