I have two Spark DataFrames where one of them has two cols, id and Tag. A second DataFrame has an id col, but missing the Tag. The first Dataframe is essentially a dictionary, each id appears once, while in the second DataFrame and id may appear several times. What I need is to create a new col in the second DataFrame that has the Tag as a function of the id in each row (in the second DataFrame). I think this can be done by converting to RDDs first ..etc, but I thought there must be a more elegant way using DataFrames (in Java). Example: given a df1 Row-> id: 0, Tag: "A", a df2 Row1-> id: 0, Tag: null, a df2 Row2-> id: 0, Tag: "B", I need to create a Tag col in the resulting DataFrame df3 equal to df1(id=0) = "A" IF df2 Tag was null, but keep original Tag if not null => resulting in df3 Row1-> id: 0, Tag: "A", df3 Row2-> id: 0, Tag: "B". Hope the example is clear.
| ID | No. | Tag | new Tag Col |
| 1 | 10002 | A | A |
| 2 | 10003 | B | B |
| 1 | 10004 | null | A |
| 2 | 10005 | null | B |
LEFT OUTER JOIN
won't work for you? – zero323