0
votes

I am running a spark job where some data is loaded from cassandra table. From that data, I make some insert and delete statements. and execute them. (using forEach)

boolean deleteStatus= connector.openSession().execute(delete).wasApplied();
boolean  insertStatus = connector.openSession().execute(insert).wasApplied();
System.out.println(delete+":"+deleteStatus);
System.out.println(insert+":"+insertStatus);

When i run it locally, i see the respective results in the table.

However, when I run it on a cluster, sometimes the result is displayed and sometime the changes don't take place. I saw the stdout from web-ui of spark, and the query along with true was printed for both the queries.(Data was loaded correctly. But sometimes, only insert is reflected, sometimes only delete, sometimes both, and most of the times none.)

Specifications:

  1. spark slaves on same machines as the cassandra nodes.(each node has two instances of slaves.)
  2. spark master on a separate machine.
  3. Repair done on all nodes.
  4. Cassandra restarted
1

1 Answers

0
votes

boolean deleteStatus= connector.openSession().execute(delete).wasApplied();

boolean insertStatus = connector.openSession().execute(insert).wasApplied();

This is a known anti-pattern, you create a new Session object for each query, which is extremely expensive.

Just create the session once and re-use it for all the queries.

To see which queries are being executed and sent to Cassandra, use the slow query logger feature as a hack: http://datastax.github.io/java-driver/manual/logging/#logging-query-latencies

The idea is to set the threshold to a ridiculously low value so that every query will be considered slow and displayed in the log.

You should use this hack only for testing of course