0
votes

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point. I tried to achieve this in below 2 ways:

  1. Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
  2. Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.

While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag. The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.

Is there any efficient way to achieve this?

1

1 Answers

0
votes

You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.