I have a requirement of scanning a table which contains 100 million record in Production. The search will be made on the first clustering key. The requirement is to find the unique partition keys where first clustering key is matching a condition. The table looks like the following -
employeeid, companyname , lastdateloggedin, floorvisted, swipetimestamp
Partition Key - employeeid Clustering Key - companyname , lastdateloggedin
I would like to get select distinct(employeeid),company, swipetimestamp where companyname = 'XYZ'. This is an SQL representation of what i would like to fetch from the table.
SparkConf conf = new SparkConf().set("spark.cassandra.connection.enabled", "true")
.set("spark.cassandra.auth.username", "XXXXXXXXXX")
.set("spark.cassandra.auth.password", "XXXXXXXXX")
.set("spark.cassandra.connection.host", "hostname")
.set("spark.cassandra.connection.port", "29042")
.set("spark.cassandra.connection.factory", ConnectionFactory.class)
.set("spark.cassandra.connection.cluster_name", "ZZZZ")
.set("spark.cassandra.connection.application_name", "ABC")
.set("spark.cassandra.connection.local_dc", "DC1")
.set("spark.cassandra.connection.cachedClusterFile", "/tmp/xyz/test.json")
.set("spark.cassandra.connection.ssl.enabled", "true")
.set("spark.cassandra.input.fetch.size_in_rows","10000") //
.set("spark.driver.allowMultipleContexts","true")
.set("spark.cassandra.connection.ssl.trustStore.path", "sampleabc-spark-util/src/main/resources/x.jks")
.set("spark.cassandra.connection.ssl.trustStore.password", "cassandrasam");
CassandraJavaRDD<CassandraRow> ctable = javaFunctions(jsc).cassandraTable("keyspacename", "employeedetails").
select("employeeid", "companyname","swipetimestamp").where("companyname= ?","XYZ");
List<CassandraRow> cassandraRows = ctable.distinct().collect();
This code run in non production with close 5 million data. Since it is production i would like to approach this query with caution. Questions -
- What are the config that should be present in my SparkConf ?
- Will the spark job ever bring down the db because of the large table ?
- Running that job might starve threads to cassandra at that moment ?