What would be the best equivalent with Spark dataframes to SQL
update table1 set colx = "some value" where coly in (select coltab2 from table2 where [another condition)]
I have some working solution but I am really not very satisfied with it. Looks really cumbersome and I hope I miss a simplier way
First I get the value for the where clause (there could be thousands so I don't wand to use a collection)
val df2 = xxxx.select("coltab2")
df2: org.apache.spark.sql.DataFrame = [coltab2: string]
this dataframe contains all the values I want to retain in the where clause
Then I perform a left outer join with table1 to add coltab2 on df2.coltab2=df1.coly
. If the value of added coltab2 is not null this means that it was present in table2 so I use this condition to update another column from original table1 (df1) and then drop this added column coltab2 that served only as a condition to update another column
val df_updated = df1.join(df2, df1("coly") === df2("coltab2"), "left_outer").withColumn("colx", when(!isnull($"coltab2"), "some value").otherwise(col("colx"))).drop(col("coltab2"))
Hope I am completly wrong and there is a more efficient way to do it ;)