Usually, the SELECT * FROM TABLE WHERE col = something
is usually a very bad query for Cassandra until the col
is at least partition key. As Cassandra is distributed system it will need to fetch data from all nodes and return it. When you have relatively big amount of data in Cassandra cluster, most probably this query will just timeout.
You can still perform similar query, but it will be more complicated. You can either:
- Use Spark with Spark Cassandra Connector (see documentation in the
doc
folder);
- Perform effective scanning of data on all nodes by splitting your query into multiple, covering individual token ranges, so you can process data on individual nodes in parallel, without overloading the coordinating node. I have an example of Java code that you could use as a base, but Spark could be easier to implement.
P.S. I would recommend to start with the basics of Cassandra to understand how it works - it will make your life easier if you understand what you can do & what can't. For beginning you can start with DS201 course on DataStax Academy.