Processing Hive records in spark driver program

Question

In my use case, I have a hive table that contain 100 thousand records. Each record represent a raw data file that has to be processed. Processing of each raw data file generates a csv file whose size will vary between 10MB and 500MB.Ultimately, these CSV files then populated into HIve table as a separate process. In my enterprise cluster, it is still not advisable to generate huge amount of data in hdfs. Hence, i prefer to club these two separate processes into a single process so that they process, lets say, 5000 records by 5000 records.

My question:-

Given that my rdd refers to the entire hive table, how do i execute raw data processing step for every 5000 records? (something similar to for loop with 5000 records incremented every time)

dumitru dumitru · Accepted Answer · 2017-08-28T10:46:01

One way to do it is to use sliding capability of RDD. You can find that in the mllib package of apache spark. Here is how you can use it. Assume that we have an rdd with 1000 elements

val rdd = sc.parallelize(1 to 1000)
import org.apache.spark.mllib.rdd._
val newRdd = RDDFunctions.fromRDD(rdd)

// sliding by 10 (instead use 5000 or what you need)
val rddSlidedBy10 = newRdd.sliding(10, 10)

The result will look like this

Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), Array(11, 12, 13, 14, 15, 16, 17, 18, 19, 20), Array(21, 22, 23, 24, 25, 26, 27, 28, 29, 30), Array(31, 32, 33, 34, 35, 36, 37, 38, 39, 40), Array(41, 42, 43, 44, 45, 46, 47, 48, 49, 50), Array(51, 52, 53, 54, 55, 56, 57, 58, 59, 60), Array(61, 62, 63, 64, 65, 66, 67, 68, 69, 70), Array(71, 72, 73, 74, 75, 76, 77, 78, 79, 80)

The you can to a foreach on the array and process raw data to CSV

Processing Hive records in spark driver program

1 Answers