0
votes

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:

I have Spark DataFrame:

+---------+------------+
|protocol |   count    |
+---------+------------+
|      TCP|    8231    |
|     ICMP|    7314    |
|      UDP|    5523    |
|     IGMP|    4423    |
|      EGP|    2331    |
+---------+------------+

And I want to turn it into:

+----------------------------------------------------------+
|Aggregated                                                |
+----------------------------------------------------------+
|{tcp: 8231, icmp: 7314, udp: 5523, igmp: 4423, egp: 2331} |
+----------------------------------------------------------+

The aggregated column can be both list or map, or string. Is this possible with DataFrame functions or do I need to create my own udf to aggregate this ?

3
Do you want to use all the rows in the dataframe when doing this? Since there are not that many it would probably be easier to collect the data and use pure Scala to convert it.Shaido
@Shaido Yes, all of the rows, and yet the example that given is not the only rows that exist. The rows will keep increasing by the timeDimas Rizky

3 Answers

2
votes

pivot and toJSON will give you what you need

import org.apache.spark.sql.functions.first

df.groupBy().pivot("protocol").agg(first("count")).toJSON.show(false)
// +----------------------------------------------------------+                    
// |value                                                     |
// +----------------------------------------------------------+
// |{"EGP":2331,"ICMP":7314,"IGMP":4423,"TCP":8321,"UDP":5523}|
// +----------------------------------------------------------+
0
votes

Since you want to convert all columns to a single one and it does not seem to be many columns to begin with, you can collect the dataframe to the driver and use pure Scala code to convert it into the format you want.

The following will give you a Array[String]:

val res = df.as[(String, Int)].collect.map{case(protocol, count) => protocol + ": " + count}

To convert it into a single string, simply do:

val str = res.mkString("{", ", ", "}")
0
votes

Concat columns in dataframe and create a new column:

var new_df = df.withColumn("concat", concat($"protocol", lit(" : "), $"count"))

To aggregate it into a single row as a list you can do this.

var new_df = new_df.groupBy().agg(collect_list("concat").as("aggregated"))
new_df.show

If you want to get the data into a string instead of dataframe, you can collect it as following.

new_df.select("concat").collect.map(x=> x.get(0)).mkString("{", ",", "}")