Writing JSON data into Cassandra using python client, issue with primary key choice

Question

So I want to write data, which is coded as a JSON string into a Cassandra table. I did the following steps:

Create a Cassandra table containing columns with all the attributes of my JSON string. Here is the cql for that:

CREATE TABLE on_equipment (
  ChnID varchar,
  StgID varchar,
  EquipID varchar,
  SenID varchar,
  value1 float,
  value2 float,
  value3 float,
  electric_consumption float,
  timestamp float,
  measurement_location varchar,
  PRIMARY KEY ((timestamp))
) WITH comment = 'A table for the on equipment readings';

Write a python Cassandra client to write the data into Cassandra from a JSON payload. Here is the code snippet to make the INSERt query (msg.value is the json string):

session.execute('INSERT INTO ' + table_name + ' JSON ' + "'" + msg.value + "';")

I get no writing errors when doing this.

However, I ran into a problem:

The JSON data I have is from IoT sources, and one of the attributed I have is a unix timestamp. An example of a JSON record is as follows (notice the timestamp attribute):

{'timestamp': 1598279069.441547, 'value1': 0.36809349674042857, 'value2': 18.284579388599308, 'value3': 39.95615809003724, 'electric_consumption': 1.2468644044844224, 'SenID': '1', 'EquipID': 'MID-1', 'StgID': '1', 'ChnID': '1', 'measurement_location': 'OnEquipment'}

In order to insert many records, I have defined the timestamp value as the primary key of the data in the Cassandra table. The problem is that not all records are being written into Cassandra, only records who's timestamps fall into a certain group. I know this because I have produced around 100 messages and received zero write errors, yet the contents of the table only has 4 rows:

 timestamp  | chnid | electric_consumption | equipid | measurement_location | senid | stgid | value1   | value2   | value3
------------+-------+----------------------+---------+----------------------+-------+-------+----------+----------+----------
 1.5983e+09 |     1 |             0.149826 |   MID-1 |          OnEquipment |     1 |     1 | 0.702309 | 19.92813 | 21.47207
 1.5983e+09 |     1 |              1.10219 |   MID-1 |          OnEquipment |     1 |     1 | 0.141921 |  5.11319 | 78.17094
 1.5983e+09 |     1 |              1.24686 |   MID-1 |          OnEquipment |     1 |     1 | 0.368093 | 18.28458 | 39.95616
 1.5983e+09 |     1 |              1.22841 |   MID-1 |          OnEquipment |     1 |     1 | 0.318357 |  16.9013 |  71.5506

In other words, Cassandra is updating the values of these four rows, when it should be writing all the 100 messages.

My guess is that I incorrectly using the Cassandra primary key. The timestamp column is type float.

My questions: Does this behaviour make sense? Can you explain it? What can I use as the primary key to solve this? Is there a way to make the primary key a Cassandra writing or arrival time?

Thank you in advance for your help!

Andrew Andrew · Accepted Answer · 2020-08-24T18:36:58

You have defined the primary key as just the timestamp - if you insert data into a Cassandra table, and the data you are writing has the same primary key as data already in the table, you will overwrite it. All inserts are in effect insert/update, so when you use the same primary key value a 2nd time, it will update.

As to the solution - this is tricker - the primary key has to hold true to it's name - it is primary, e.g. unique - even if it was a timestamp instead of a float you should also have at least 1 other field (such as the IoT unique identifier) within the primary key so that 2 readings from two different devices made at the exact same time do not clash.

In Cassandra you model the data and the keys based on how you intend to access the data - without knowing that it would not be possible to know what the primary key (Partition + Clustering key) should be. You also ideally need to know something about the data cardinality and selectivity.

Identify and define the queries you intend to run against the data, that should guide your partition key and clustering key choices - which together make the primary key.

The specific issue here to add to the above is that the data is exceeding the precision that the float can be stored at - capping the value in effect and making them all identical. If you change the float to a double, it then stores the data without capping the values into the same value - which then causes the upsert instead of a new row inserted. (The JSON insert part is not relevant to the issue as it happens)

Recreating the issue as follows:

 CREATE TABLE on_equipment (
  ChnID varchar,
  timestamp float,
  PRIMARY KEY ((timestamp))
) ;

insert into on_equipment(timestamp, chnid) values (1598279061,'1');
insert into on_equipment(timestamp, chnid) values (1598279062,'2');
insert into on_equipment(timestamp, chnid) values (1598279063,'3');
insert into on_equipment(timestamp, chnid) values (1598279064,'4');

select count(*) from on_equipment;

1

select timestamp from on_equipment;

1.59827904E9

You can see the value has been rounded and capped, all 4 values capped the same, if you use smaller numbers for the timestamps it works, but isn't very useful to do so.

Changing it to a double:

CREATE TABLE on_equipment (
  ChnID varchar,
  timestamp double,
  PRIMARY KEY ((timestamp))
) ;

insert into on_equipment(timestamp, chnid) values (1598279061,'1');
insert into on_equipment(timestamp, chnid) values (1598279062,'2');
insert into on_equipment(timestamp, chnid) values (1598279063,'3');
insert into on_equipment(timestamp, chnid) values (1598279064,'4');

select count(*) from on_equipment;

4

Writing JSON data into Cassandra using python client, issue with primary key choice

1 Answers