Spark SQL: How to append new row to dataframe table (from another table)

Question

I am using Spark SQL with dataframes. I have an input dataframe, and I would like to append (or insert) its rows to a larger dataframe that has more columns. How would I do that?

If this were SQL, I would use INSERT INTO OUTPUT SELECT ... FROM INPUT, but I don't know how to do that with Spark SQL.

For concreteness:

var input = sqlContext.createDataFrame(Seq(
        (10L, "Joe Doe", 34),
        (11L, "Jane Doe", 31),
        (12L, "Alice Jones", 25)
        )).toDF("id", "name", "age")

var output = sqlContext.createDataFrame(Seq(
        (0L, "Jack Smith", 41, "yes", 1459204800L),
        (1L, "Jane Jones", 22, "no", 1459294200L),
        (2L, "Alice Smith", 31, "", 1459595700L)
        )).toDF("id", "name", "age", "init", "ts")


scala> input.show()
+---+-----------+---+
| id|       name|age|
+---+-----------+---+
| 10|    Joe Doe| 34|
| 11|   Jane Doe| 31|
| 12|Alice Jones| 25|
+---+-----------+---+

scala> input.printSchema()
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)


scala> output.show()
+---+-----------+---+----+----------+
| id|       name|age|init|        ts|
+---+-----------+---+----+----------+
|  0| Jack Smith| 41| yes|1459204800|
|  1| Jane Jones| 22|  no|1459294200|
|  2|Alice Smith| 31|    |1459595700|
+---+-----------+---+----+----------+

scala> output.printSchema()
root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- init: string (nullable = true)
 |-- ts: long (nullable = false)

I would like to append all the rows of input to the end of output. At the same time, I would like to set the output column of init to be an empty string '' and the ts column to be the current timestamp, e.g. 1461883875L.

Any help would be appreciated.

By the way, for distributed data there is no "start" nor "end", you shall forget of indexes. Besides, using vars in Scala is known as a bad practice. What is the default value for ts in the schema you show that it is not nullable — Alberto Bonsanto

zero323 zero323 · Accepted Answer · 2016-04-28T23:10:20

Spark DataFrames are immutable so it is not possible to append / insert rows. Instead you can just add missing columns and use UNION ALL:

output.unionAll(input.select($"*", lit(""), current_timestamp.cast("long")))

Spark SQL: How to append new row to dataframe table (from another table)

2 Answers