3
votes

I have a Cassandra "table" like the following:

CREATE TABLE example
(
    result_id INT,
    evaluator_id INT,
    score DOUBLE,
    PRIMARY KEY(result_id, evaluator_id));
);

And a query like the following:

SELECT result_id, evaluator_id, score FROM example;

I understand that when querying a single partition key, the results will be sorted by the clustering key in the defined order. However to support my data model I'm making the assumption that in the previous unrestricted query, the results will be grouped together by the partition_key "result_id", i.e.,

for row in queryResults:
    resultId = row['result_id']
    if resultId == lastResultId:
        # append the score and evaluator id to a data structure
    else:
        # do something with the data structure, assuming we've now
        # received all scores for the given result_id
    lastResultId = resultId

Is this a valid assumption? It makes sense given the storage details, and works in prototype, but doesn't seem to be explicitly guaranteed anywhere. E.g., if I'm pulling data from multiple nodes, could the rows with different result IDs be hypothetically intermixed?

1
Do you mean that you're worried about the duplication of keys that occurs in big data applications? If so, would sorting based on partition key as described here work for you? issues.apache.org/jira/browse/CASSANDRA-4536catpaws
No, I'm not worried about duplication. I'm worried about the ordering of results that comes from back from the CQL3 query. If I could sort by partition key that would resolve my issue, but that's more than I want (I only want to group by partition key, not necessarily sort), and that issue linked doesn't actually address sorting exactly, but simply selecting distinct partition keys (which doesn't solve my concern).David Eklund

1 Answers

2
votes

Is this a valid assumption?

Yes, results will always be grouped by partition key(s). That's because all CQL rows for a particular partition are stored together on disk. CQL rows with the same partition key will hash to the same token value, and will all be stored (together) on the nodes responsible for that particular token range.