I have an application that writes several billion records into Cassandra and removes duplicates by key. Then it groups them by other fields, such as title, in successive phases so that further processing can be done on groups of similar records. The application is distributed over a cluster of machines because I need it to finish in a reasonable time (hours not weeks).
One phase of the application works by writing the records into Cassandra using the hector client, and storing the records in a column family with the records' primary keys as the Cassandra keys. The timestamp is set to the record's last update date so that I only get the latest record for each key.
Later phases need to read everything back out of Cassandra, perform some processing on the records, and add the records back to a different column family using various other keys, so that the records can be grouped.
I accomplished this batch reading by using Cassandra.Client.describe_ring() to figure out which machine in the ring is master for which TokenRange. I then compare the master for each TokenRange against the localhost to find out which token ranges are owned by the local machine (remote reads are too slow for this type of batch processing). Once I know which TokenRanges are on each machine locally I get evenly sized splits using Cassandra.Client.describe_splits().
Once I have a bunch of nice evenly sized splits that can be read from the local Cassandra instance I start reading them as fast as I can using Cassandra.Client.get_range_slices() with ConsistencyLevel.ONE so that it doesn't need to do any remote reads. I fetch 100 rows at a time, sequentially through the whole TokenRange (I have tried various batch sizes and 100 seems to work best for this app).
This all worked great on Cassandra 0.7.0 with a little bit of tuning to memory sizes and column family configs. I could read between 4000 and 5000 records per second in this way, and kept the local disks working about as hard as they could.
Here is an example of the splits and the speed I would see under Cassandra 0.7.0:
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 20253030905057371310864605462970389448 : 21603066481002044331198075418409137847
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 21603066481002044331198075418409137847 : 22954928635254859789637508509439425340
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 22954928635254859789637508509439425340 : 24305566132297427526085826378091426496
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 24305566132297427526085826378091426496 : 25656389102612459596423578948163378922
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 25656389102612459596423578948163378922 : 27005014429213692076328107702662045855
10/12/20 20:13:08 INFO m4.BulkCassandraReader: split - 27005014429213692076328107702662045855 : 28356863910078000000000000000000000000
10/12/20 20:13:18 INFO m4.TagGenerator: 42530 records read so far at a rate of 04250.87/s
10/12/20 20:13:28 INFO m4.TagGenerator: 90000 records read so far at a rate of 04498.43/s
10/12/20 20:13:38 INFO m4.TagGenerator: 135470 records read so far at a rate of 04514.01/s
10/12/20 20:13:48 INFO m4.TagGenerator: 183946 records read so far at a rate of 04597.16/s
10/12/20 20:13:58 INFO m4.TagGenerator: 232105 records read so far at a rate of 04640.62/s
When I upgraded to Cassandra 0.7.2 I had to rebuild the configs because there were a few new options and such, but I took care to try and get all of the relevant tuning settings the same from the 0.7.0 configs that worked. However I can barely read 50 records per second with The new version of Cassandra.
Here is an example of the splits and the speed I see now under Cassandra 0.7.2:
21:02:29.289 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 50626015574749929715914856324464978537 : 51655803550438151478740341433770971587
21:02:29.290 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 51655803550438151478740341433770971587 : 52653823936598659324985752464905867108
21:02:29.290 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 52653823936598659324985752464905867108 : 53666243390660291830842663894184766908
21:02:29.290 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 53666243390660291830842663894184766908 : 54679285704932468135374743350323835866
21:02:29.290 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 54679285704932468135374743350323835866 : 55681782994511360383246832524957504246
21:02:29.291 [main] INFO c.p.m.a.batch.BulkCassandraReader - split - 55681782994511360383246832524957504246 : 56713727820156410577229101238628035242
21:09:06.910 [Thread-0] INFO c.p.m.assembly.batch.TagGenerator - 100 records read so far at a rate of 00000.25/s
21:13:00.953 [Thread-0] INFO c.p.m.assembly.batch.TagGenerator - 10100 records read so far at a rate of 00015.96/s
21:14:53.893 [Thread-0] INFO c.p.m.assembly.batch.TagGenerator - 20100 records read so far at a rate of 00026.96/s
21:16:37.451 [Thread-0] INFO c.p.m.assembly.batch.TagGenerator - 30100 records read so far at a rate of 00035.44/s
21:18:35.895 [Thread-0] INFO c.p.m.assembly.batch.TagGenerator - 40100 records read so far at a rate of 00041.44/s
As you can probably see from the logs the Code moved to a different package but other than that the code has not changed. It is running on the same hardware, and all memory settings are the same.
I could see some performance difference between versions of Cassandra, but something as earth shattering as this (100x performance drop) seems like I must be missing something fundamental. Even before tuning the column families and memory settings on 0.7.0 it was never THAT slow.
Does anyone know what could account for this? Is there some tuning setting that I might be missing that would be likely to cause this? Did something change with the Cassandra functions to support hadoop that is just undocumented? Reading through release notes I just can't find anything that would explain this. Any help on fixing this, or even just an explanation of why it may have stopped working would be appreciated.