Spark, Cassandra connector,

Question

We use Cassandra wide rows heavily to store time-series as they are perfect for that use-case.

create table user_events ( user_id text, timestmp timestamp, event text, primary key((user_id), timestmp));

Let’s assume I want to do the following aggregation using spark: select all user_ids which had at least 1 event during the last month.

What is the most efficient way to do that?

Firdousi Farozan Firdousi Farozan · Accepted Answer · 2015-06-17T04:50:22

Since you have user_id as primary key, for querying you need to know what are the user ids. But from your problem description, it is the other way.

Your table should be as follows:

create table user_events (
  start timestamp,
  eventtime timestamp,
  user_id text,
  event text,
  primary key (start, eventtime, userid)
);

You can partition the incoming events by day or week based on the events per second - so that you do not exceed the wide row limit of Cassandra.

And then in Spark, you can query using:

sc.cassandraTable(keyspace, table)
  .select("user_id")
  .where("start = ?", <a set of partitions>)
  .where("eventtime > ?", <one month before timestamp>)

Spark, Cassandra connector,

1 Answers