Processing semi-inhomogenous structured files with Spark (CSV, Parquet)

Question

I have several inhomogenous structured files stored in a Hadoop cluster. The files contain a header line but not all files contain the same columns.

file1.csv:

a,b,c
1,2,1

file2.csv:

a,b,d
2,2,2

What I need to do is looking for all data in column a or column c and process it further (possibly Spark SQL). So I expect something like:

a,b,c,d
1,2,1,,
2,2,,2

Just doing

spark.read.format("csv").option("header", "true").load(CSV_PATH)

will miss all columns not present in the "first" file read.

How can I do this? Is a conversion to Parquet and its dataset feature a better approach?

Chandan Ray Chandan Ray · Accepted Answer · 2018-06-27T07:46:59

Read two files separately and create a two dataframes. Then do an inner join between those two w.r.t join keys as a,b