0
votes

I am working on a sensor data(timeseries). Number of columns in a table is 3000.
for eg: nodeid,timestamp,sen1,sen2,.....sen-n. nodeid and timestamp are primary key with clustering order by timestamp.
Number of records are 10000.

When a SELECT query for single column(SELECT timestamp,sen1 FROM <table>) requ is requsted through cassandra datastax java driver 3.0 it is replies in 15 sec; i.e If I want read all the tags, one a tag at a time for all 3000 tags requires 3000*15 sec = 12 to 13 hours aproximately. It is on single node cluster with 16GB RAM.
I allocated 10GB for JVM. Still response time is not changed. I used LevelCompactionStragy at the time of table creation.

Hardware: Intel Core i7 and Normal Hard disk not SSD,8GB RAM How to reduce that read or query time on the single node cluster?

2
Can you provide hardware specs ? Nb of CPU cores ? Spinning hard disk or SSD ? - doanduyhai
one node cluster is a bad idea. It is better to have a SQL solution. If you want to really test Cassandra, try to add more nodes to the cluster and think a bit more about the architecture. - Whitefret

2 Answers

0
votes

Obviously, there is problem with data modelling. IMO, a table with 3000 columns is bad. if your use case is like "SELECT timestamp,sen1 FROM ", then you should model it as " Primary Key(Timestamp, SensorId) ".

"SELECT timestamp,sen1" in your model, cassandra will still read all other column values from disk into memory.

I am not sure what is 'nodeId' in your case.. I hope it's not cassandra node id..

0
votes

(SELECT timestamp,sen1 FROM table)

This is like getting all the data at once(in your case 10000 records). So getting 1 column or 3000 columns will make Cassandra server to read through all the SSTables. The point is it won't be 12 or 13 hours.

Still 15 seconds seems unbelievable. Did you also include the network latency and client side write in this measure?

As mentioned in one of the answers your model seems to be bad (If you put timestamp as partion key, the data becomes two sparse and getting a range of data will need to read from more than one partition. If you use only node_id as partition key, the partition will host too much data and can cross the C* limitation of 2 Billion). My advise is

  • Redesign your partition key. Please check this tutorial for a start. https://academy.datastax.com/resources/getting-started-time-series-data-modeling
  • Add more no. of nodes and increase replication factor to see better read latencies.
  • Try to design your read query such that it reads from only one partition at once. eg: SELECT * from Table where sensor_node_id = abc and year = 2016 and month = June

Hope this helps!