Use collect_list()
as people have suggested above as well.
# Creating the DataFrame
df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4),
('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7),
('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9),
('A','b', 'e','time_1',0.2,0.3,0.6)],
['id_1','id_2','id_3','timestamp','thing1','thing2','thing3'])
df.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
| A| b| c| time_0| 1.2| 1.3| 2.5|
| A| b| c| time_1| 1.1| 1.5| 3.4|
| A| b| c| time_2| 2.2| 2.6| 2.9|
| A| b| d| time_0| 5.1| 5.5| 5.7|
| A| b| d| time_1| 6.1| 6.2| 6.3|
| A| b| e| time_0| 0.1| 0.5| 0.9|
| A| b| e| time_1| 0.2| 0.3| 0.6|
+----+----+----+---------+------+------+------+
In addition to using agg()
, you can write familiar SQL
syntax to operate on it, but first we have to register our DataFrame
as temporary SQL
view -
df.createOrReplaceTempView("df_view")
df = spark.sql("""select id_1, id_2, id_3,
collect_list(timestamp) as timestamp,
collect_list(thing1) as thing1,
collect_list(thing2) as thing2,
collect_list(thing3) as thing3
from df_view
group by id_1, id_2, id_3""")
df.show(truncate=False)
+----+----+----+------------------------+---------------+---------------+---------------+
|id_1|id_2|id_3|timestamp |thing1 |thing2 |thing3 |
+----+----+----+------------------------+---------------+---------------+---------------+
|A |b |d |[time_0, time_1] |[5.1, 6.1] |[5.5, 6.2] |[5.7, 6.3] |
|A |b |e |[time_0, time_1] |[0.1, 0.2] |[0.5, 0.3] |[0.9, 0.6] |
|A |b |c |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]|
+----+----+----+------------------------+---------------+---------------+---------------+
Note: The """
has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....'
that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.