I'm trying to run some basic data exploration using Spark on a Hive table (hosted on CFS via DataStax 4.6). My dataset is about 3.1GO and I run the spark-shell with dse spark --executor-memory 16g (Yes I do have 16g available on my executors). So basically I would write into the spark-shell, the following:
val dataset = hc.sql("SELECT * FROM my_hive_table") ;
val data_sample = dataset.sample(false,.01,0) ;
data_sample.cache
and then I would try a count to actually cache something
data_sample.count
but when I check on the spark-shell web UI I see no RDD persisted and if I try a count again my whole dataset is read again from CFS.
So I tried accessing my dataset though CFS directly as a textfile as such
textFile.type = cfs:/user/hive/warehouse/my_hive_table/aaaammjj=20150526
and adapt the previous code to count the number of line after caching the RDD and this time the RDD is indeed cached using 7 GB across two workers ! From the web UI :
cfs:/user/hive/warehouse/my_hive_table/aaaammjj=20150526 Memory Deserialized 1x Replicated
Is there any reason why my schemaRDD is not cached using Hive ? That would be much pratical since schemaRDD provide ... well the schema.
Thx for any help.
val rdd_in_cache = data_sample.cache
? I tried this also with no success. – Manu.cache
and then.cache.setName("")
or.cache.setName("")
alone. – Manu