3
votes

Here's my scenario. I've got a table with 5 million + rows. One particular map column has two keys (some entries may be missing one or both keys, but any row has at most two keys for that column).

I'm looking to "clear" the values for that column across all rows. I don't want to get rid of the column, as I'll run something afterwards that'll set some values. I'd imagine simply doing update table set column.key=null ... would fail due to timeout.

What would be the most cassandra friendly way of achieving this? I have access to Spark. Would it be to use spark, read in rdds and issue update queries per row and do that in partitions?

Thanks, Ashic.

PS: Apache Cassandra 2.1.2, Spark 1.1.1.

========================

Edit: I can tolerate downtime.

1
Did you try to issue an update ?maasg
update tbl set col=null; fails without primary key.ashic
What do you mean by "One particular map column has two keys"? Do you have column of type map<text,text> or something like that?G Quintana
Yup. The column is map<text, double> and any row will have at max two keys - 'a' and 'f'. Some will have one, majority will have null.ashic

1 Answers

1
votes

Ended up simply creating a spark app, getting an rdd for the table and issuing async updates for each row per partition, waiting for the queries for each partition to finish. Took 8 minutes 52 seconds to update 5 million+ rows. Although not needed, ran a repair on the keyspace afterwards.