0
votes

below is my spark/SCALA program to read my source file. (CSV file)

val csv = spark.read
  .format("com.databricks.spark.csv")
  .option("header", "true") //reading the headers
 // .option("mode", "DROPMALFORMED")
  .option("inferSchema", "true")

  .load("C:\\TestFiles\\SAP_ENT_INVBAL.csv"); //.csv("csv/file/path") //spark 2.0 api


csv.show()



csv.printSchema()
csv.show()

}

The output contains the file header, but for my processing i need different naming convention rather than file header.

I have tried couple of options and works well.

  1. Renaming the dataframe columns
  2. Use add(StructField function

But i want to make my code to be generic. Just pass schema file while reading the file and create the dataframe with columns according to schema file.

Kindly help to solve this.

2

2 Answers

0
votes

If you just need to rename the columns, you can use the toDF method, passing it the new names of the columns, e.g.

val csv = spark.read.option("header", "true")
  .csv("C:\\TestFiles\\SAP_ENT_INVBAL.csv")
  .toDF("newColAName", "newColBName", "newColCName")
0
votes

Here is example from spark-csv documentation regarding how to specify custom schema-

You can manually specify the schema when reading data:

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val sqlContext = new SQLContext(sc)       
val customSchema = StructType(Array(
    StructField("year", IntegerType, true),
    StructField("make", StringType, true),
    StructField("model", StringType, true),
    StructField("comment", StringType, true),
    StructField("blank", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .schema(customSchema)
    .load("cars.csv")