0
votes

What does the Databricks Delta Lake mergeSchema option do if a pre-existing column is appended with a different data type?

For example, given a Delta Lake table with schema foo INT, bar INT, what would happen when trying to write-append new data with schema foo INT, bar DOUBLE when specifying the option mergeSchema = true?

2

2 Answers

0
votes

The write fails. (as of Delta Lake 0.5.0 on Databricks 6.3)

0
votes

I think this is what you are looking for.

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name

val customSchema = StructType(Array(
    StructField("field1", StringType, true),
    StructField("field2", StringType, true),
    StructField("field3", StringType, true),
    StructField("field4", StringType, true),
    StructField("field5", StringType, true),
    StructField("field6", StringType, true),
    StructField("field7", StringType, true)))

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "false")
    .option("sep", "|")
    .schema(customSchema)
    .load("mnt/rawdata/corp/ABC*.gz")
    .withColumn("file_name", input_file_name())

Just name 'field1', 'field2', etc., as your actual field names. Also, the 'ABC*.gz' does a wildcard search for files beginning with a specific string, like 'abc', or whatever, and the '*' character, which means any combination of characters, up the the backslash and the '.gz' which means it's a zipped file. Yours could be different, of course, so just change that convention to meet your specific needs.