parse file with schema in apache spark

Question

below is my spark/SCALA program to read my source file. (CSV file)

val csv = spark.read
  .format("com.databricks.spark.csv")
  .option("header", "true") //reading the headers
 // .option("mode", "DROPMALFORMED")
  .option("inferSchema", "true")

  .load("C:\\TestFiles\\SAP_ENT_INVBAL.csv"); //.csv("csv/file/path") //spark 2.0 api


csv.show()



csv.printSchema()
csv.show()

}

The output contains the file header, but for my processing i need different naming convention rather than file header.

I have tried couple of options and works well.

Renaming the dataframe columns
Use add(StructField function

But i want to make my code to be generic. Just pass schema file while reading the file and create the dataframe with columns according to schema file.

Kindly help to solve this.

Jason Evans Jason Evans · Accepted Answer · 2017-04-15T13:50:08

If you just need to rename the columns, you can use the toDF method, passing it the new names of the columns, e.g.

val csv = spark.read.option("header", "true")
  .csv("C:\\TestFiles\\SAP_ENT_INVBAL.csv")
  .toDF("newColAName", "newColBName", "newColCName")

parse file with schema in apache spark

2 Answers