In my use case, I have a hive table that contain 100 thousand records. Each record represent a raw data file that has to be processed. Processing of each raw data file generates a csv file whose size will vary between 10MB and 500MB.Ultimately, these CSV files then populated into HIve table as a separate process. In my enterprise cluster, it is still not advisable to generate huge amount of data in hdfs. Hence, i prefer to club these two separate processes into a single process so that they process, lets say, 5000 records by 5000 records.
My question:-
Given that my rdd refers to the entire hive table, how do i execute raw data processing step for every 5000 records? (something similar to for loop with 5000 records incremented every time)