0
votes

I have to create a table which stores a big amount of data (like 400 columns and 5.000.000 to 40.000.000 rows). There is a counter "counter" which counts from 1 upwards. Right now this is my primary key. The other variables are int, float, and varchar type and repeating.

I need to do this for a database-comparison, so I have to use Cassandra, even if there could be other databases, that can do better in this specific problem.

On this table I want to execute some range queries. The queries should be like:

SELECT counter, val1, val2, val3 FROM table WHERE counter > 1000 AND counter < 5000;

Also there will be other filter-parameters:

... AND val54 = 'OK';

I think this is a problem in Cassandra, because "counter" is the PK. I will try running the token() function, but I guess this will be slow.

Right now I am learning about the data modelling in Cassandra but I hope somebody with experience in Cassandra got some hints for me, like how to organize the table and make the queries possible and fast? Perhaps just some topics I should learn about or links that will help me.

Have a nice day, Friedrich

1
You're basically stating how a problem was solved using a relational database without stating what the original problem was. If you try to take a relational solution and directly port it to a distributed model, then it's not going to work. For example in Cassandra, you would typically store items using a time series rather than an incrementing index. Cassandra can do range queries within a partition, but it isn't clear what your actual requirements are.Jim Meyer
Thank you for your answer. Would the query be possible or easier if I would use time series? I don't see a difference but this would be possible.Friedrich
Would it be possible to put the data in just one Partition and do the queries? My requirement is: I want to create a database in Cassandra and MSSQL and then compare some queries, even if its problematic in Cassandra. I would like to hear your answer!Friedrich
You could put all the data into one partition, but this would not be fair to Cassandra since the queries would then only hit one node. The power of Cassandra is to spread the workload across many nodes, so typically you want to structure your partition key to spread your data across the whole cluster. Partitions are buckets of related data grouped on one node. It's likely your use case does not really need all the data grouped into one bucket. It was just done that way because it was written for a relational database.Jim Meyer
@JimMeyer I really would like to use Cassandras benefits. But how can I organize my database/partitions/nodes that way? You're right, I already did this in a relational database, but how do I model my data buckets now? I need the range queries, because that's the purpose of the database. There is a technical device, that measures some values every 0,5 sec. So the PK is really a DateTime. These data needs to be stored in a database. MSSQL does a good job, but my supervisor also wants me to evaluate Cassandra. I hope you can help and if you need more information just tell me.Friedrich

1 Answers

1
votes

This sounds like a bad use case for Cassandra.

First, range queries are discouraged in Cassandra. This is because the range can't be resolved with out visiting every node in the cluster.

Second, you can't mix a counter type column with other column types. For a given table it can either have (and only have) counter columns or it can have all non-counter columns.

As far as Cassandra data modeling goes, if you want to create a successful data model, create your partitions around the exact thing you're going to query.