0
votes

I want to load a large CSV file to my cassandra cluster (1 node at this moment).

Basing on: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
My data is transformed by CQLSSTableWriter to SSTables files, then I use SSTableLoader to load that SSTables to a cassandra table already containing some data.

That CSV file contains various partition keys.
Now lets assume that multi-node cassandra cluser is used.

My questions:
1) Is the loading procedure that I use correct in case of multinode cluster?
2) Will that SSTable files be splitted by SSTableLoader and send to nodes responsible for the specific partition keys?

Thank you

2
How big is your CSV file? - Raman Yelianevich
Lets assume that my CSV has 100*10^6 rows - I mean that it is a quite big file and using CQLSH COPY command is not recommended (as described here: datastax.com/documentation/cql/3.1/cql/cql_reference/…) "COPY FROM is intended for importing small datasets (a few million rows or less) into Cassandra. For importing larger datasets, use the Cassandra bulk loader." - fuggy_yama
I use Cassandra v2.0.11 - fuggy_yama

2 Answers

1
votes

1) Loading into a single-node cluster or 100-node cluster is the same. The only difference is that the data will be distributed around the ring if you have a multi-node cluster. The node where you run sstableloader becomes the coordinator (as @rtumaykin already stated) and will send the writes to the appropriate nodes.

2) No. As in my response above, the "splitting" is done by the coordinator. Think of the sstableloader utility as just another instance of a client sending writes to the cluster.

3) In response to your follow-up question, the sstableloader utility is not sending files to nodes but sending writes of the rows contained in those SSTables. The sstableloader reads the data and sends write requests to the cluster.

1
votes
  1. Yes
  2. It will be actually done by the coordinator node, not by the SSTableLoader.