I have a text file containing tens of GBs of data, which I need to load from HDFS and parallelize as an RDD. This text file describes items with the following format. Note that the alphabetic strings are not present (the meaning of each row is implicit) and that each row can contain whitespaces to separate different values:
0001 (id)
1000 1000 2000 (dimensions)
0100 (weight)
0030 (amount)
0002 (id)
1110 1000 5000 (dimensions)
0220 (weight)
3030 (amount)
I reckon that the most inmediate approach to parallelize this file would be to upload it to the HDFS from the local filesystem and then to create an RDD by executing sc.textFile(filepath)
. However, in this case, the partitioning would depend on the HDFS splits corresponding to the file.
The problem with said approach is that each partition can contain incomplete items. For example:
Partition 1
0001 (id)
1000 1000 2000 (dimensions)
0100 (weight)
0030 (amount)
0002 (id)
1110 1000 5000 (dimensions)
Partition 2
0220 (weight)
3030 (amount)
Thus, when we call a method for each partition and pass its corresponding data block to it, it will be receiving an incomplete specification for the item identified as 0002. This will result in a wrong output for the calculations performed inside the called method.
Which would be the most efficient way to partition or repartition this RDD to avoid this problem? Can specify the number of lines of each partition to be multiple of 4? If so, should it be done by Hadoop or Spark?