0
votes

SELECT statements in PrestoDB v0.125 with a Cassandra connector to a Datastax Cassandra cluster only return 200 rows, even where table contains many more rows than that. Aggregate queries like SELECT COUNT() over the same table also return a result of just 200.

(This behaviour is identical when querying with pyhive connector & with base presto CLI).

Documentation isn't much help, but am guessing that the issue is pagination & a need to set environment variables (which the documentation doesn't explain): https://prestodb.io/docs/current/installation/cli.html

Does anyone know how I can remove this limit of 200 rows returned? What specific environment variable setting do I need?

1

1 Answers

1
votes

For those who come after - the solution is in the cassandra.properties connector configuration for presto. The key setting is:

  • cassandra.limit-for-partition-key-select

This needs to be set higher than the total number of rows in the table you are querying, otherwise select queries will respond with only a fraction of the stored data (not having located all of the partition keys).

Complete copy of my config file (which may help!):

connector.name=cassandra
# Comma separated list of contact points
cassandra.contact-points=host1,host2
# Port running the native Cassandra protocol
cassandra.native-protocol-port=9042
# Limit of rows to read for finding all partition keys.
cassandra.limit-for-partition-key-select=2000000000
# maximum number of schema cache refresh threads, i.e. maximum number of parallel requests
cassandra.max-schema-refresh-threads=10
# schema cache time to live
cassandra.schema-cache-ttl=1h
# schema refresh interval
cassandra.schema-refresh-interval=2m
# Consistency level used for Cassandra queries (ONE, TWO, QUORUM, ...)
cassandra.consistency-level=ONE
# fetch size used for Cassandra queries
cassandra.fetch-size=5000
# fetch size used for partition key select query
cassandra.fetch-size-for-partition-key-select=20000