2
votes
val df1 = Seq(("Brian", 29, "0-A-1234")).toDF("name", "age", "client-ID")
val df2 = Seq(("1234", 555-5555, "1234 anystreet")).toDF("office-ID", "BusinessNumber", "Address")

I'm trying to run a function on each row of a dataframe (in streaming). This function will contain a combination of scala code, and Spark dataframe api code. for example, I want to take the 3 features from df, and use them to filter a second dataframe called df2. My understanding is that a UDF can't accomplish this. Now I have all the filtering code working just fine, without the ability to apply it to each row of df.

My goal is to be able to do something like

df.select("ID","preferences").map(row => ( //filter df2 using row(0), row(1) and row(3) ))

The dataframes can't be joined, there is not a joinable relationship between them.

Although I'm using Scala, an answer in Java or Python would probably be fine.

I'm also fine with alternative ways of accomplishing this. If I could extract the data from the rows into separate variables (keep in mind this is streaming), that's also fine.

1
we still don't know what case was it ? Both streaming or what ?eliasah
both are streaming dataframes comming from kafka topics.Brian

1 Answers

1
votes

My understanding is that a UDF can't accomplish this.

It is correct, but neither can map (local Datasets seem to be an exception Why does this Spark code make NullPointerException?). A nested logic like this one can be expressed only using joins:

  • If both Datasets are streaming it has to be equijoin. It means that even though:

    The dataframes can't be joined, there is not a joinable relationship between them.

    You have to derive one in some way which approximates well filter condition.

  • If one Dataset is not streaming, you can brute force things with crossJoin followed by filter, but it is of course hardly recommended.