How to control number of partition while reading data from Cassandra?

Question

I use:

cassandra 2.1.12 - 3 nodes
spark 1.6 - 3 nodes
spark cassandra connector 1.6

I use tokens in Cassandra (not vnodes).

I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70 million rows and it is taking 15 minutes for it.

When I am reading data and checking number of partition of a RDD is somewhere around 21000 which is too large. How to control this number?

I have tried splitCount, split.size.in.mbs but they show me the same number of partitions.

Any suggestions?

import org.apache.spark.{SparkContext, SparkConf} 
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
import java.sql.DriverManager
import java.sql.Connection


object Hi {
  def main(args: Array[String])
  {
    val conf = new  SparkConf(true).set("spark.cassandra.connection.host", "172.16.4.196").set("spark.cassandra.input.split.size_in_mb","64")
    val sc = new SparkContext(conf)

    val rdd =  sc.cassandraTable("cw","usedcareventsbydatecookienew")
    println("hello world" + rdd.partitions)
    println("hello world" + rdd.count)
  }

}

this is my code for the reference. I run nodetool compact now i am able to control number of partition but still the whole process is taking almost 6 minutes which is i think is too high any suggestion for improvements

chaitan64arun chaitan64arun · Accepted Answer · 2016-04-21T07:32:43

Are you looking for spark.cassandra.input.split.size?

spark.cassandra.input.split.size Default = 64. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.

How to control number of partition while reading data from Cassandra?

2 Answers