Duplicate rows/columns for the same primary key in Cassandra

Question

I have a table/columnfamily in Cassandra 3.7 with sensordata.

CREATE TABLE test.sensor_data (
    house_id int,
    sensor_id int,
    time_bucket int,
    sensor_time timestamp,
    sensor_reading map<int, float>,
    PRIMARY KEY ((house_id, sensor_id, time_bucket), sensor_time)
)

Now when I select from this table I find duplicates for the same primary key, something I thought was impossible.

cqlsh:test> select * from sensor_data;

 house_id | sensor_id | time_bucket | sensor_time                     | sensor_reading
----------+-----------+-------------+---------------------------------+----------------
        1 |         2 |           3 | 2016-01-02 03:04:05.000000+0000 |       {1: 101}
        1 |         2 |           3 | 2016-01-02 03:04:05.000000+0000 |       {1: 101}

I think part of the problem is that this data has both been written "live" using java and Datastax java driver, and it has been loaded together with historic data from another source using sstableloader.

Regardless, this shouldn't be possible. I have no way of connecting with the legacy cassandra-cli to this cluster, perhaps that would have told me something that I can't see using cqlsh.

So, the questions are:
* Is there anyway this could happen under known circumstances?
* Can I read more raw data using cqlsh? Specifically write time of these two rows. the writetime()-function can't operate on primary keys or collections, and that is all I have.

Thanks.

Update:

This is what I've tried, from comments, answers and other sources
* selecting using blobAsBigInt gives the same big integer for all identical rows
* connecting using cassandra-cli, after enabling thrift, is possible but reading the table isn't. It's not supported after 3.x
* dumping out using sstabledump is ongoing but expected to take another week or two ;)

No, cassandra-cli is not supported in 3.x. I could do sstabledump but the datafiles are huge and that tool offers no filtering. — Andreas Wederbrand
I believe sensor_time is different for both rows, but are truncated and shown as if the time is same. You can also ask at Cassandra mailing list too. — Nick
No. I changed to code to remove duplicates, if ever found. I guess the "bug", or whatever this is, is still there. — Andreas Wederbrand

Highstead Highstead · Accepted Answer · 2016-10-03T19:54:16

I don't expect to see nanoseconds in a timestamp field and additionally i'm of the impression they're fully not supported? Try this:

SELECT house_id, sensor_id, time_bucket, blobAsBigint(sensor_time) FROM test.sensor_data;

I WAS able to replicate it doing by inserting the rows via an integer:

INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800000); INSERT INTO sensor_data(house_id, sensor_id, time_bucket, sensor_time) VALUES (1,2,4,1451692800001);

This makes sense because I would suspect one of your drivers is using a bigint to insert the timestamp, and one is likely actually using the datetime.

Tried playing with both timezones and bigints to reproduce this... seems like only bigint is reproducable

 house_id | sensor_id | time_bucket | sensor_time              | sensor_reading
----------+-----------+-------------+--------------------------+----------------
        1 |         2 |           3 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-01 23:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 00:00:00+0000 |           null
        1 |         2 |           4 | 2016-01-02 01:01:00+0000 |           null

edit: Tried some shenanigans using bigint in place of datetime insert, managed to reproduce...

Duplicate rows/columns for the same primary key in Cassandra

2 Answers