I use:
- cassandra 2.1.12 - 3 nodes
- spark 1.6 - 3 nodes
- spark cassandra connector 1.6
I use tokens in Cassandra (not vnodes).
I am writing a simple job of reading a data from a Cassandra table and displaying its count table is having around 70 million rows and it is taking 15 minutes for it.
When I am reading data and checking number of partition of a RDD is somewhere around 21000 which is too large. How to control this number?
I have tried splitCount
, split.size.in.mbs
but they show me the same number of partitions.
Any suggestions?
import org.apache.spark.{SparkContext, SparkConf}
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra.CassandraSQLContext
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
import java.sql.DriverManager
import java.sql.Connection
object Hi {
def main(args: Array[String])
{
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "172.16.4.196").set("spark.cassandra.input.split.size_in_mb","64")
val sc = new SparkContext(conf)
val rdd = sc.cassandraTable("cw","usedcareventsbydatecookienew")
println("hello world" + rdd.partitions)
println("hello world" + rdd.count)
}
}
this is my code for the reference. I run nodetool compact now i am able to control number of partition but still the whole process is taking almost 6 minutes which is i think is too high any suggestion for improvements