6
votes

I recently ran into a case where Cassandra fits in perfectly to store time based events with custom ttls per event type (the other solution would be to save it in hadoop and do the bookkeeping manually (ttls and stuff, IMHO a very complex idea) or switch to hbase). The question is how good the cassandra MapReduce support works out of the box without Datastax Enterprise edition.

It seems that they invested a lot in CassandraFS but I ask myself if the normal Pig CassandraLoader is actively maintained and actually scales (as it seems to do nothing more than to iterate over the rows in slices). Does this work for 100s of millions of rows?

2

2 Answers

1
votes

You can map/reduce using random partitioner but of course the keys you get are in random order. you probably want to use CL = 1 in cassandra so you don't ahve to read in from 2 nodes each time while doing map/reduce though and it should read the local data. I have not used Pig though.

-2
votes

Why not hbase? Hbase is more suitable for timeseries data. You can easily put billions of rows on very small cluster and get up to 500k rows per second on small 3node cluster (up to 50MB/s) with WAL enabled. Cassandra has several flaws:

  1. In cassandra you actually restricted by amount of keys (imagine, that in case of billions rows your repair would work forever). So you will design schema, which will 'shard' you time by, say, 1 hour, and actual timestamp will be placed as columns. But such scheme don't scale well due of high risk of 'huge columns'.
  2. Other problem - you can't mapreduce range of data in cassandra, except you use ordered partitioner, which is not an option at all, due of its inability to balance well.