I Have 2 cassandra clusters, on different datacenter (note that these are 2 different clusters, NOT a single cluster with multidc), and both clusters have the same keyspace and columnfamily models. I wish to copy data of columnfamily C from Cluster A to cluster B in the most efficient way. Some other ColumnFamily I was able to copy with get and put operations, since it was a time series and the keys sequential. But this other column family C, I coulnd copy. I'm using thrift and pycassa. I've ried the CQL COPY command, but unfortunately the CF is too large and I get a rpc_timeout. How can I accomplish this?
3 Answers
If you just want to do this as a one time thing, then take a snapshot and use the sstableloader to load that into the cluster. If you want to keep loading new data over time you will want to turn on incremental_backups, then take a snapshot to load for the initial data, and then periodically grab the sstables out of the incremental backups to sstableload to keep things up to date.
Time to time I also need to copy data from one cassandra cluster to another.
I use this tool https://github.com/masumsoft/cassandra-exporter.
export.js
script exports data to a json files, import.js
script imports exported data to a cassandra. You can do it for all tables in specified keyspace or for a particular table only. Target keyspace and tables should exist before import.
In js script you can adjust batch size and readTimeout if you get "read timeout error".
UPDATE: After a hint by Alex Ott I tried DSBulk tool. It works great but only for one table per-run. If you want to process full keyspace you need a script that runs DSBulk for each table.