2
votes

I am trying to automate and load random data into a empty dataframe using spark scala

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD

val df = spark.sql("select * from test.test")
val emptyDF= spark.createDataFrame(spark.sparkContext.emptyRDD[Row], df.schema)

Here I am trying to create a empty dataframe with test table schema . In this case it is (id int, name string). I am trying to add a empty row to this dataframe.

val df2=Seq((1,2)).toDF("col1","col2")
emptyDF.union(df2)

But if I change the table name I have to do this operation manually in Seq(data) and toDF(columns),I want to change the code so that the data can be added by random and schema should infer from table, like example as below

val columninfo = "\""+emptyDF.columns.mkString("\",\"")+"\""
val columncount = emptyDF.columns.size
val x = (1 to columncount).toList.mkString(",")

var df1=Seq(x).toDF(columninfo)

But Its not working , Please let me know if there is any otherway to append the random data to the empty dataframe or how to automate the above operation, or any other approach with is suitable. Thanks in advance

1

1 Answers

1
votes

You can create a dummy DataFrame with one record (with a value that would be ignored), and just use select on that DF with the columns of the "empty" DataFrame as the column names and with running integers as the column values:

import org.apache.spark.sql.functions._
import spark.implicits._

emptyDF.show()
// +----+----+
// |col1|col2|
// +----+----+
// +----+----+

List(1).toDF("dummy")
  .select(emptyDF.columns.zipWithIndex.map { case (name, value) => lit(value) as name }: _*)
  .show()
// +----+----+
// |col1|col2|
// +----+----+
// |   0|   1|
// +----+----+

NOTE: this assumes that all columns in emptyDF are of type Int. If that assumption can't be supported, you'd need a more sophisticated solution that doesn't just use emptyDf.columns (which are just the names) but maps over emptyDf.schema.

As to your attempt:

  • It looks like you're trying to use code to write code... while this is technically possible (see: macros), it's almost never the right approach, and it's much more involved than just passing String arguments which contain code snippets as arguments to methods.
  • Also - you don't need the union - performing union with an empty DataFrame is meaningless