4
votes

I have a single node instance of Cassandra. I have been using batch statements to insert a lot of data into it using the datastax driver in Java.

After a certain point during the insert I am presented with a NoHostAvailableException, however I can still connect to the node through cql and execute statements. The Cassandra logs warned me that the batches were too large, when I lowered the size down to the recommended level I still get the same error and there appears to be no other errors in the Cassandra log file.

Has anyone encountered this error before, I feel like there is something in the cassandra.yaml that I am missing.

2
You should not use large batchesjorgebg

2 Answers

7
votes

I had very similar issues of those you have, and resolved it here: Cassandra cluster with bad insert performance and insert stability.

The bottom solution is that you are just overloading your node, and that batch inserts are, controversially, not faster than async inserts. Of course, you should limit your async inserts with some technique. Also, make sure your network can support your insert. I was connected to a low-powered switch and about half of my problems vanished when I changed the route towards I connect my server (which is a few rooms from me).

If it does not help, you should use multiple nodes, depending on your insert rate.

2
votes

The batch keyword in Cassandra is not a performance optimization for batching together large buckets of data for bulk loads.

Batches are used to group together atomic operations (if one write fails they all fail), actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.

Using batches will probably not make your mass ingestion run faster

Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.

There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:

Read this post