Export large amount of data from Cassandra to CSV

Question

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
COPY - causes timeouts on quite fast EC2 instances for big number of records
CAPTURE - like above, causes timeouts
reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?

Have you considered making your own little contraption for this ? Using datastax driver you could easily make requests that extract you data then serialize them in csv file(s) with little to no java code ? This would ensure you to get the exact result you want (for a little effort though). — Ar3s
Moreover, I don't get neither the method nor the problem on the "reads with pagination". — Ar3s
reads with pagination - using python driver I tried to read content using limit (tested values 100 - 10000, based on TimeUuid) and offset, it was really slow, Cassandra was able to read about 1,5k of records per second on 3 instances and replication factor 2 I cannot imagine, that simply using driver will makes possible to build fast read, because for each row Cassandra has to check on which node data are stored. — KrzysztofZalasa

Alex Ott Alex Ott · Accepted Answer · 2020-06-11T07:54:08

Update for 2020th: DataStax provides a special tool called DSBulk for loading and unloading of data from Cassandra (starting with Cassandra 2.1), and DSE (starting with DSE 4.7/4.8). In simplest case, the command line looks as following:

dsbulk unload -k keyspace -t table -url path_to_unload

DSBulk is heavily optimized for loading/unloading operations, and has a lot of options, including import/export from/to compressed files, providing the custom queries, etc.

There is a series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Export large amount of data from Cassandra to CSV

3 Answers