How to add new columns and the corresponding row specific values to a spark dataframe?

Question

I'm new to the Scala/Spark world.

I have a spark dataset(df with a case class) called person.

scala> val person_with_contact = person.map(r => (
     | r.id,
     | r.name,
     | r.age
     | )).toDF()

Now, I want to add a list of address attributes(like apt_no, street, city, zip) to each record of this dataset. The get the list of address attributes, I have a function which takes person's id as input and returns a map that contains all the address attributes and their corresponding values.

I tried the following and a few other Stack Overflow suggested approaches but I couldn't solve it yet. (Ref - static col ex - Spark, add new Column with the same value in Scala)

scala> val person_with_contact = person.map(r => (
    | r.id,
    | r.name,
    | r.age,
    | getAddress(r.id) 
    | )).toDF()

The final dataframe should have the following columns.

id, name, age, apt_no, street, city, zip

Does this answer your question? Spark Build Custom Column Function, user defined function — Shaido
@Shaido, Thanks for your reply. I have a UDF function already. I'm not sure how to return the list of address attributes from this UDF, so that those will be added as an individual column to the new dataframe. — Manas Mukherjee
@HristoIliev, thanks for your reply. Each person would have only one address which is represented by 4 attributes. I have a UDF function that takes a person's id as input and returns the 4 attributes as a map. I would like to join id, name, age with the address fields ie. apt_no, street, city, zip. Finally, it should be a single dataframe with all 7 attributes. — Manas Mukherjee
@ManasMukherjee, on the second read of your question I got that you are adding a list of attributes and is why I deleted my comment. Is person a DataFrame or an RDD? — Hristo Iliev
the person is a dataset that is created using a case class with id, name, and age as attributes. — Manas Mukherjee

wenjiangFu wenjiangFu · Accepted Answer · 2019-11-01T09:58:40

use udf

package yourpackage

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._


object MainDemo {

  def getAddress(id: Int): String = {
    //do your things
    "address id:" + id
  }

  def getCity(id: String): String = {
    //do your things
    "your city :" + id
  }

  def getZip(id: String): String = {
    //do your things
    "your zip :" + id
  }

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName(this.getClass.getSimpleName).master("local[3]").enableHiveSupport().getOrCreate()
    val person = Seq(Person(1, "name_m", 21), Person(2, "name_w", 40))
    import spark.implicits._
    val person_with_contact = person.map(r => (r.id, r.name, r.age, getAddress(r.id))).toDF("id", "name", "age", "street")
    person_with_contact.printSchema()
    //root
    // |-- id: integer (nullable = false)
    // |-- name: string (nullable = true)
    // |-- age: integer (nullable = false)
    // |-- street: string (nullable = true)
    val result = person_with_contact.select(
      col("id"),
      col("name"),
      col("age"),
      col("street"),
      udf { id: String =>
        val city = getCity(id)
        city
      }.apply(col("id")).as("city"),
      udf { id: String =>
        val city = getZip(id)
        city
      }.apply(col("id")).as("zip")
    )
    result.printSchema()
    //root
    // |-- id: integer (nullable = false)
    // |-- name: string (nullable = true)
    // |-- age: integer (nullable = false)
    // |-- street: string (nullable = true)
    // |-- city: string (nullable = true)
    // |-- zip: string (nullable = true)
    result.show()
    //+---+------+---+------------+------------+-----------+
    //| id|  name|age|      street|        city|        zip|
    //+---+------+---+------------+------------+-----------+
    //|  1|name_m| 21|address id:1|your city :1|your zip :1|
    //|  2|name_w| 40|address id:2|your city :2|your zip :2|
    //+---+------+---+------------+------------+-----------+
  }
}

How to add new columns and the corresponding row specific values to a spark dataframe?

2 Answers