output semicolon separated values in field in databricks SQL

Question

Desired outcome:

+---------+-----------------------------+
| ID PR   | Related Repeating Event(s)  |
+---------+-----------------------------+
| 1658503 | 1615764;1639329             |
+---------+-----------------------------+

Is there a way to write a query in sql / databricks without using a User-defined aggregate functions (UDAFs)? I've tried concat(), GROUP_CONCAT(), LISTAGG but none of these work or are not supported in databricks ("This function is neither a registered temporary function nor a permanent function registered in the database 'default'.".

I found this User-defined aggregate functions (UDAFs) description in the databricks documentation but don't know how to implement it (https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-udf-aggregate.html#user-defined-aggregate-functions-udafs&language-sql)

Would anybody have a hint for me or a link?

What I have is this basic query:

%sql
SELECT
  pr_id,
  data_field_nm,
  field_value
FROM
  gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl --(18)
WHERE
 pr_id = 1658503
 AND data_field_nm = 'Related Repeating Deviation(s)'

Which gives as output:

+---------+--------------------------------+-------------+
| pr_id   | data_field_nm                  | field_value |
+---------+--------------------------------+-------------+
| 1658503 | Related Repeating Deviation(s) | 1615764     |
| 1658503 | Related Repeating Deviation(s) | 1639329     |
+---------+--------------------------------+-------------+

Correct answer is (thanks to @Alex Ott):

%sql
SELECT
 pr_id AS IDPR,
  concat_ws(';', collect_list(field_value)) AS RelatedRepeatingDeviations
FROM
  gms_us_mart.txn_pr_addtl_data_detail_trkw_glbl
WHERE
  data_field_nm = 'Related Repeating Deviation(s)'
  AND pr_id = 1658503
GROUP BY
  pr_id,
  data_field_nm;

Gives desired outcome:

+---------+-----------------------------+
| IDPR    | RelatedRepeatingDeviations  |
+---------+-----------------------------+
| 1658503 | 1615764;1639329             |
+---------+-----------------------------+

so you have n rows and want to make an rev from them. pls show input and expected output — thebluephantom
I've updated the question with all info I have, see query and its output. — Wondarar

Alex Ott Alex Ott · Accepted Answer · 2021-02-09T18:05:41

just use group by with collect_list and concat_ws, like this:

get data

from pyspark.sql import Row
df = spark.createDataFrame([Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1615764}), 
                            Row(**{'pr_id':1658503, 'data_field_nm':'related', 'field_value':1639329})])
df.createOrReplaceTempView("abc")

and do the query:

%sql

select pr_id, 
  data_field_nm, 
  concat_ws(';', collect_list(field_value)) as combined 
from abc 
group by pr_id, data_field_nm

although this will give you the column with the fixed name (combined)

output semicolon separated values in field in databricks SQL

1 Answers