I have 2 dataframes:
val df1 = sc.parallelize(Seq((123, 2.23, 1.12), (234, 2.45, 0.12), (456, 1.112, 0.234))).toDF("objid", "ra", "dec")
val df2 = sc.parallelize(Seq((4567, 123, "name1", "val1"), (2322, 456, "name2", "val2"), (3324, 555, "name3", "val3"), (5556, 123, "name4", "val4"), (3345, 123, "name5", "val5"))).toDF("specid", "objid", "name", "value")
They look like below:
df1.show()
+-----+-----+-----+
|objid| ra| dec|
+-----+-----+-----+
| 123| 2.23| 1.12|
| 234| 2.45| 0.12|
| 456|1.112|0.234|
+-----+-----+-----+
df2.show()
+------+-----+-----+-----+
|specid|objid| name|value|
+------+-----+-----+-----+
| 4567| 123|name1| val1|
| 2322| 456|name2| val2|
| 3324| 555|name3| val3|
| 5556| 123|name4| val4|
| 3345| 123|name5| val5|
+------+-----+-----+-----+
Now I want to nest df2 inside df1 as a nested column so the schema should look like below:
val new_schema = df1.schema.add("specs", df2.schema)
new_schema: org.apache.spark.sql.types.StructType = StructType(StructField(objid,IntegerType,false), StructField(ra,DoubleType,false), StructField(dec,DoubleType,false), StructField(specs,StructType(StructField(specid,IntegerType,false), StructField(objid,IntegerType,false), StructField(name,StringType,true), StructField(value,StringType,true)),true))
The reason I wanted to do it this way was because there is a one to many relationship between df1 and df2, which means there are more than 1 specs per objid. And I am not going to join only these two tables. There are about 50 tables that I want to ultimately join together to create a mega table. Most of those tables have 1 to n relationships and I was just thinking about a way to avoid having a lot of duplicate rows and null cells in the ultimate join result.
The ultimate result would look something like:
+-----+-----+-----+----------------------+
| | specs |
|objid| ra| dec| specid| name | value|
+-----+-----+-----+------+----+-------+ |
| 123| 2.23| 1.12| 4567 | name1 | val1 |
| | 5556 | name4 | val4 |
| | 3345 | name5 | val5 |
+-----+-----+-----+----------------------+
| 234| 2.45| 0.12| |
+-----+-----+-----+----------------------+
| 456|1.112|0.234| 2322 | name2 | val2 |
+-----+-----+-----+----------------------+
I was trying to add the column to df1 using .withColumn
but ran into errors.
What I actually wanted to do was to select all the columns from df2 with the condition where df2.objid = df1.objid
to match the rows and make that the new column in df1 but I am not sure if that's the best approach. Even if so, I am not sure how to do that.
Could someone please tell me how to do this?
show
). I suppose what you want is a simplejoin
- Raphael Roth