I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df .select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex, 1, 1 2, 2 1, 1 2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2") .agg( sum('value').as("mean"),
stddev('value).as("std. dev") ) .show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable CREATE TABLE mytable ( gender number(10) NOT NULL, age number(10) NOT NULL, value number(10) ); insert into mytable values (1,20,21); insert into mytable values(2,23,22); insert into mytable values (1,26,23); insert into mytable values (2,29,24); insert into mytable values (1,30,25); select * from mytable; SELECT A.VALUE AS MALE, B.VALUE AS FEMALE FROM (select value, rownum RN from mytable where gender = 1) A FULL OUTER JOIN (select value, rownum RN from mytable where gender = 2) B ON A.RN = B.RN
