I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point. I tried to achieve this in below 2 ways:
- Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
- Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag. The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?