1
votes

I have C* column family to store events-like data. Column family created in CQL3 in this way:

CREATE TABLE event (
  hour text,
  stamp timeuuid,
  values map<text, text>,
  PRIMARY KEY (hour, stamp)
) WITH CLUSTERING ORDER BY (stamp DESC)

Partitioner is the Murmur3 partitioner. Then I tried to build Spark query to that data through Calliope library. In results I receive two problems:

  1. In my case there are more than 1000 records for the clustering key ('hour' field), but response contains only first 1000 records per key. I can increase page size in query to receive more data, but as far as I understand it must be task of the paginator to go through the data and slice it.
  2. I receive each record more than once.

About first problem I get the answer from Calliope author that the CQL3 driver must paginate data. He recommends me to read the DataStax article. But I can't find the answer how to build query with right instructions to the driver.

About second problem I found that it was an issue with Hadoop connector in Cassandra < 1.2.11. But I use C* 2.0.3 and rebuild Spark with the required version of libraries. Also I use Calliope version 0.9.0-C2-EA.

Could you point me to the documentation or code samples which explains right way to solve these problems or demonstrate workarounds? I suppose that I use C*-to-Spark connector in improper way, but I can't find solution.

Thank you in advance.

1
It seems to be "WITH CLUSTERING ORDER" clause is the source of both problems. When I read from similar table (just without WITH CLUSTERING ORDER BY) neither limitations of the results nor record duplication appear.Stinger.911

1 Answers

0
votes

It's impossible right now to use non-default sorting for clustering keys. All works fine then the sorting order for clustering keys is default (ACS).

Workaround is to modify data-model to use the compound keys with default clustering order.