I have several inhomogenous structured files stored in a Hadoop cluster. The files contain a header line but not all files contain the same columns.
file1.csv:
a,b,c
1,2,1
file2.csv:
a,b,d
2,2,2
What I need to do is looking for all data in column a or column c and process it further (possibly Spark SQL). So I expect something like:
a,b,c,d
1,2,1,,
2,2,,2
Just doing
spark.read.format("csv").option("header", "true").load(CSV_PATH)
will miss all columns not present in the "first" file read.
How can I do this? Is a conversion to Parquet and its dataset feature a better approach?