1
votes

I am trying to understand Cassandra by playing with a public dataset. I had inserted 1.5M rows from CSV to a table on my local instance of Cassandra, WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }
The table was created with one field as a partition key, and one more as primary key

I had a confirmation that 1.5M rows were processed. COPY Completed

But when I run SELECT or SELECT COUNT(*) on the table, I always get a max of 182 rows.  Secondly, the number of records returned with clustered columns seem to higher than single columns which is not making sense to me. What am I missing from Cassandra's architecture and querying point of view.

Lastly I have also tried reading the same Cassandra table from pyspark shell, and it seems to be reading 182 rows too.

1
You may try nodetool tablestats us_accidents to get info about the total size of your namespace and tables. Maybe some of your primary keys exist on multiple rows and it keeps overwriting.Ersoy

1 Answers

1
votes

Your primary key is PRIMARY KEY (state, severity). With this primary key definition, all rows for accidents in the same state of same severity will overwrite each other. You probably only have 182 different (state, severity) combinations in your dataset.

You could include another clustering column to record the unique accident, like an accident_id

This blog highlights the importance of the primary key, and has some examples: https://www.datastax.com/blog/2016/02/most-important-thing-know-cassandra-data-modeling-primary-key