spark scala : Convert DataFrame OR Dataset to single comma separated string

Question

Below is the spark scala code which will print one column DataSet[Row]:

import org.apache.spark.sql.{Dataset, Row, SparkSession}
val spark: SparkSession = SparkSession.builder()
        .appName("Spark DataValidation")
        .config("SPARK_MAJOR_VERSION", "2").enableHiveSupport()
        .getOrCreate()

val kafkaPath:String="hdfs:///landing/APPLICATION/*"
val targetPath:String="hdfs://datacompare/3"
val pk:String = "APPLICATION_ID" 
val pkValues = spark
        .read
        .json(kafkaPath)
        .select("message.data.*")
        .select(pk)
        .distinct() 
pkValues.show()

Output of about code :

+--------------+
|APPLICATION_ID|
+--------------+
|           388|
|           447|
|           346|
|           861|
|           361|
|           557|
|           482|
|           518|
|           432|
|           422|
|           533|
|           733|
|           472|
|           457|
|           387|
|           394|
|           786|
|           458|
+--------------+

Question :

How to convert this dataframe to comma separated String variable ?

Expected output :

val   data:String= "388,447,346,861,361,557,482,518,432,422,533,733,472,457,387,394,786,458"

Please suggest how to convert DataFrame[Row] or Dataset to one String .

SCouto SCouto · Accepted Answer · 2018-02-20T19:23:48

I don't think that's a good idea, since a dataFrame is a distributed object and can be inmense. Collect will bring all the data to the driver, so you should perform this kind operation carefully.

Here is what you can do with a dataFrame (two options):

df.select("APPLICATION_ID").rdd.map(r => r(0)).collect.mkString(",")
df.select("APPLICATION_ID").collect.mkString(",")

Result with a test dataFrame with only 3 rows:

String = 388,447,346

Edit: With DataSet you can do directly:

ds.collect.mkString(",")

spark scala : Convert DataFrame OR Dataset to single comma separated string

2 Answers