1
votes

I have a JavaPairRDD of the following typing:

Tuple2<String, Iterable<Tuple2<String, Iterable<Tuple2<String, String>>>>>

that denotes the following object:
(Table_name, Iterable(Tuple_ID, Iterable(Column_name, Column_value)))

This means each record in the RDD will create one Parquet file.

The idea is, as you may have guessed, to save each object as a new Parquet table called Table_name. In this table, there is one column called ID that stores the value Tuple_ID, and each column Column_name stores the value Column_value.

The challenge I'm facing is that the table's columns (the schema) are collected on the fly on runtime, AND, as it is not possible to create nested RDDs in Spark, I can't create an RDD within the previous RDD (for each record) and save it finally to a Parquet file --after converting it to a DataFrame of course.

And I can't just convert the previous RDD to a DataFrame, for the obvious reason (need to iterate to get column/value).

As a temporarily workaround, I flattened the RDD into a list of the same typing as the RDD using collect(), but this is not the proper way as the data could be larger than the available disk space on the driver machine, causing an out of memory.

Any advice on how to achieve this? please let me know if the question is not clear enough.

1

1 Answers

0
votes

Take a look at answer for this [question][1]

[1]: Writing RDD partitions to individual parquet files in its own directory. I used this answer to create separate (one or more) parquet file for each partition. This technique I believe you can use the same to create separate file each with different schema if you like.