Spark Dataframe map function

Question

val df1 = Seq(("Brian", 29, "0-A-1234")).toDF("name", "age", "client-ID")
val df2 = Seq(("1234", 555-5555, "1234 anystreet")).toDF("office-ID", "BusinessNumber", "Address")

I'm trying to run a function on each row of a dataframe (in streaming). This function will contain a combination of scala code, and Spark dataframe api code. for example, I want to take the 3 features from df, and use them to filter a second dataframe called df2. My understanding is that a UDF can't accomplish this. Now I have all the filtering code working just fine, without the ability to apply it to each row of df.

My goal is to be able to do something like

df.select("ID","preferences").map(row => ( //filter df2 using row(0), row(1) and row(3) ))

The dataframes can't be joined, there is not a joinable relationship between them.

Although I'm using Scala, an answer in Java or Python would probably be fine.

I'm also fine with alternative ways of accomplishing this. If I could extract the data from the rows into separate variables (keep in mind this is streaming), that's also fine.

we still don't know what case was it ? Both streaming or what ? — eliasah

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-04-13T11:46:45

My understanding is that a UDF can't accomplish this.

It is correct, but neither can map (local Datasets seem to be an exception Why does this Spark code make NullPointerException?). A nested logic like this one can be expressed only using joins:

If both Datasets are streaming it has to be equijoin. It means that even though:

The dataframes can't be joined, there is not a joinable relationship between them.

You have to derive one in some way which approximates well filter condition.
If one Dataset is not streaming, you can brute force things with crossJoin followed by filter, but it is of course hardly recommended.

Spark Dataframe map function

1 Answers