0
votes

Right now I'm working on loading a table from a Cassandra cluster into a Spark cluster with the Datastax Cassandra Spark Connector. Right now the spark program performs a simple mapreduce job that counts the number of rows in the Cassandra table. Everything is set up and run locally.

The Spark program works for a small Cassandra table that has a String key as its only column. When we load in another table that has columns String id, and a blob that consists of file data, we get several errors (futures timeout error in the spark workers, java out of memory exception on the stdout of the driver program).

My question is whether Spark can load elements that contain blobs of around 1MB from Cassandra and perform mapreduce jobs on them, or if elements are supposed to be divided into much smaller pieces before being processed with a Spark mapreduce job.

1

1 Answers

0
votes

Originally I was using 'sbt run' to start the application.

Once I was able to use spark-submit to launch the application, everything worked fine. So yes, files under 10 MB can be stored as a column of type blob. The Spark MapReduce ran quickly with 200 rows.