23
votes

I'm doing a student project involving building and querying a Cassandra data cluster.

When my cluster load was light ( around 30GB ) my queries ran without a problem, but now that it's quite a bit bigger (1/2TB) my queries are timing out.

I thought that this problem might arise, so before I began generating and loading test data I had changed this value in my cassandra.yaml file:

request_timeout_in_ms (Default: 10000 ) The default timeout for other, miscellaneous operations.

However, when I changed that value to like 1000000, then cassandra seemingly hung on startup -- but that could've just been the large timeout at work.

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

queries :

SELECT  huntpilotdn 
FROM    project.t1 
WHERE   (currentroutingreason, orignodeid, origspan,  
        origvideocap_bandwidth, datetimeorigination)
        > (1,1,1,1,1)
AND      (currentroutingreason, orignodeid, origspan,    
         origvideocap_bandwidth, datetimeorigination)
         < (1000,1000,1000,1000,1000)
LIMIT 10000
ALLOW FILTERING;

SELECT  destcause_location, destipaddr
FROM    project.t2
WHERE   datetimeorigination = 110
AND     num >= 11612484378506
AND     num <= 45880092667983
LIMIT 10000;


SELECT  origdevicename, duration
FROM    project.t3
WHERE   destdevicename IN ('a','f', 'g')
LIMIT 10000
ALLOW FILTERING;

I have a demo keyspace with the same schemas, but a far smaller data size (~10GB) and these queries run just fine in that keyspace.

All these tables that are queried have millions of rows and around 30 columns in each row.

4
Can you post an example of your query?Aaron

4 Answers

18
votes

I'm going to guess that you are also using secondary indexes. You are finding out firsthand why secondary index queries and ALLOW FILTERING queries are not recommended...because those type of design patterns do not scale for large datasets. Rebuild your model with query tables that support primary key lookups, as that is how Cassandra is designed to work.

Edit

"The variables that are constrained are cluster keys."

Right...which means they are not partition keys. Without constraining your partition key(s) you are basically scanning your entire table, as clustering keys are only valid (cluster data) within their partition key.

Edit 20190731

So while may I have the "accepted" answer, I can see that there are three additional answers here. They all focus on changing the query timeout, and two of them outscore my answer (one by quite a bit).

As this question continues to rack-up page views, I feel compelled to address the aspect of increasing the timeout. Now, I'm not about to downvote anyone's answers, as that would look like "sour grapes" from a vote perspective. But I can articulate why I don't feel that solves anything.

First, the fact that the query times-out at all, is a symptom; it's not the main problem. Therefore increasing the query timeout is simply a bandaid solution, obscuring the main problem.

The main problem of course being, that the OP is trying to force the cluster to support a query that does not match the underlying data model. As long as this problem is ignored and subject to work-arounds (instead of being dealt with directly) this problem will continue to manifest itself.

Secondly, look at what the OP is actually trying to do:

My goal for data generation is 2TB. How do I query that large of space without running into timeouts?

Those query timeout limits are there to protect your cluster. If you were to run a full-table scan (which means full-cluster scan to Cassandra) through 2TB of data, that timeout threshold would be quite large. In fact, if you did manage to find the right number to allow that, your coordinator node would tip over LONG before most of the data was assembled in the result set.

In summary, increasing query timeouts:

  1. Gives the appearance of "helping" by forcing Cassandra to work against how it was designed.

  2. Can potentially crash a node, putting the stability of the underlying cluster at risk.

Therefore, increasing the query timeouts is a terrible, TERRIBLE IDEA.

57
votes

If you are using Datastax cqlsh then you can specify client timeout seconds as a command line argument. The default is 10.

$ cqlsh --request-timeout=3600

Datastax Documentation

9
votes

To change the client timeout limit in Apache Cassandra, there are two techniques:

Technique 1: This is a good technique:

1. Navigate to the following hidden directory under the home folder: (Create the hidden directory if not available)

    $ pwd
    ~/.cassandra


2. Modify the file cqlshrc in it to an appropriate time in seconds: (Create the file if not available)

    Original Setting:

        $ more cqlshrc
        [connection]
        client_timeout = 10
        # Can also be set to None to disable:
        # client_timeout = None
        $

    New Setting:

        $ vi cqlshrc
        $ more cqlshrc
        [connection]
        client_timeout = 3600
        # Can also be set to None to disable:
        # client_timeout = None
        $

    Note: Here time is in seconds. Since, we wanted to increase the timeout to one hour. Hence, we have set it to 3600 seconds.

Technique 2: This is not a good technique since, you are changing the setting in the client program (cqlsh) itself. Note: If you have already changed using technique 1 - then it will override the time specified using technique 2. Since, profile settings have highest priority.

1. Navigate to the path where cqlsh program is located. This you can find using the which command:

    $ which cqlsh
    /opt/apache-cassandra-2.1.9/bin/cqlsh
    $ pwd
    /opt/apache-cassandra-2.1.9/bin
    $ ls -lrt cqlsh
    -rwxr-xr-x 1 abc abc 93002 Nov  5 12:54 cqlsh


2. Open the program cqlsh and modify the time specified using the client_timeout variable. Note that time is specified in seconds.
$ vi cqlsh

In __init__ function:
    def __init__(self, hostname, port, color=False,
                 username=None, password=None, encoding=None, stdin=None, tty=True,
                 completekey=DEFAULT_COMPLETEKEY, use_conn=None,
                 cqlver=DEFAULT_CQLVER, keyspace=None,
                 tracing_enabled=False, expand_enabled=False,
                 display_time_format=DEFAULT_TIME_FORMAT,
                 display_float_precision=DEFAULT_FLOAT_PRECISION,
                 max_trace_wait=DEFAULT_MAX_TRACE_WAIT,
                 ssl=False,
                 single_statement=None,
                 client_timeout=10,
                 connect_timeout=DEFAULT_CONNECT_TIMEOUT_SECONDS):

In options.client_timeout setting:
    options.client_timeout = option_with_default(configs.get, 'connection', 'client_timeout', '10')

You can modify at both these places. The second line picks up client_timeout information from the cqlshrc file.
1
votes
  1. Increase the read_request_timeout_in_sec in cassandra.yaml file

  2. Modify the cqlsh.py program and change the variables value instead of changing in the function. DEFAULT_REQUEST_TIMEOUT_SECONDS=100 DEFAULT_CONNECT_TIMEOUT_SECONDS=100

It works for sure