0
votes

I have 2 dataframes:

val df1 = sc.parallelize(Seq((123, 2.23, 1.12), (234, 2.45, 0.12), (456, 1.112, 0.234))).toDF("objid", "ra", "dec")

val df2 = sc.parallelize(Seq((4567, 123, "name1", "val1"), (2322, 456, "name2", "val2"), (3324, 555, "name3", "val3"), (5556, 123, "name4", "val4"), (3345, 123, "name5", "val5"))).toDF("specid", "objid", "name", "value")

They look like below:

df1.show()

+-----+-----+-----+                                                             
|objid|   ra|  dec|
+-----+-----+-----+
|  123| 2.23| 1.12|
|  234| 2.45| 0.12|
|  456|1.112|0.234|
+-----+-----+-----+

df2.show()
+------+-----+-----+-----+
|specid|objid| name|value|
+------+-----+-----+-----+
|  4567|  123|name1| val1|
|  2322|  456|name2| val2|
|  3324|  555|name3| val3|
|  5556|  123|name4| val4|
|  3345|  123|name5| val5|
+------+-----+-----+-----+

Now I want to nest df2 inside df1 as a nested column so the schema should look like below:

val new_schema = df1.schema.add("specs", df2.schema)

new_schema: org.apache.spark.sql.types.StructType = StructType(StructField(objid,IntegerType,false), StructField(ra,DoubleType,false), StructField(dec,DoubleType,false), StructField(specs,StructType(StructField(specid,IntegerType,false), StructField(objid,IntegerType,false), StructField(name,StringType,true), StructField(value,StringType,true)),true))

The reason I wanted to do it this way was because there is a one to many relationship between df1 and df2, which means there are more than 1 specs per objid. And I am not going to join only these two tables. There are about 50 tables that I want to ultimately join together to create a mega table. Most of those tables have 1 to n relationships and I was just thinking about a way to avoid having a lot of duplicate rows and null cells in the ultimate join result.

The ultimate result would look something like:

+-----+-----+-----+----------------------+                                                              
|                 |          specs       |
|objid|   ra|  dec| specid| name  | value|
+-----+-----+-----+------+----+-------+  |
|  123| 2.23| 1.12| 4567  | name1 | val1 |
|                 | 5556  | name4 | val4 |
|                 | 3345  | name5 | val5 |
+-----+-----+-----+----------------------+
|  234| 2.45| 0.12|                      |
+-----+-----+-----+----------------------+
|  456|1.112|0.234| 2322  | name2 | val2 |
+-----+-----+-----+----------------------+

I was trying to add the column to df1 using .withColumn but ran into errors.

What I actually wanted to do was to select all the columns from df2 with the condition where df2.objid = df1.objid to match the rows and make that the new column in df1 but I am not sure if that's the best approach. Even if so, I am not sure how to do that.

Could someone please tell me how to do this?

2
can you show the expected output in the above example? (i.e. make a artificial output of show). I suppose what you want is a simple join - Raphael Roth

2 Answers

0
votes

As per my knowledge, you cannot have dataframe inside another dataframe(same is the case with RDDs).

What you need is a join between two dataframes. You can perform different types of joins and join the rows from two dataframes(this is where you make nest df2 columns inside df1)

0
votes

You need to join both the dataframes based on the column objid like below

val join = df1.join(df2, "objid")
join.printSchema()

output:

root
 |-- objid: integer (nullable = false)
 |-- ra: double (nullable = false)
 |-- dec: double (nullable = false)
 |-- specid: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- value: string (nullable = true)

and when we say

join.show()

the output will be

+-----+-----+-----+------+-----+-----+
|objid|   ra|  dec|specid| name|value|
+-----+-----+-----+------+-----+-----+
|  456|1.112|0.234|  2322|name2| val2|
|  123| 2.23| 1.12|  4567|name1| val1|
+-----+-----+-----+------+-----+-----+

for more details you can check here

Update:

I think you are looking for something like this

df1.join(df2, df1("objid") === df2("objid"), "left_outer").show()

and the output is:

+-----+-----+-----+------+-----+-----+-----+
|objid|   ra|  dec|specid|objid| name|value|
+-----+-----+-----+------+-----+-----+-----+
|  456|1.112|0.234|  2322|  456|name2| val2|
|  234| 2.45| 0.12|  null| null| null| null|
|  123| 2.23| 1.12|  4567|  123|name1| val1|
|  123| 2.23| 1.12|  5556|  123|name4| val4|
|  123| 2.23| 1.12|  3345|  123|name5| val5|
+-----+-----+-----+------+-----+-----+-----+