How to use join with gt condition in Java?

Question

I want to join two dataframes based on the following condition: if df1.col("name")== df2.col("name") and df1.col("starttime") is greater than df2.col("starttime").

the first part of the condition is ok, I use "equal" method of the column class in spark sql, but for the "greater than" condition, when I use the following syntax in java":

df1.col("starttime").gt(df2.col("starttime"))

It does not work, It seems "gt" function of column in spark sql, only accepts numerical value types, it does not work properly when you pass column type as its input parameter. The program finishes normally but the results are wrong, it does not find any rows in the dataframe that satisfy my condition, while I know that such rows exist in the dataframe.

any idea on how should I implement comparison between two column types in spark sql?(e.g. if one column is greater than other column in another dataframe)

Is not clear what you mean with "one column is greater than other column". Please add an example. — pheeleeppoo
How about using a join expression and passing a string with a hive/sql condition? — summerbulb
How can I use a join expression and pass a string with sql condition? — A.B.

Vidya Vidya · Accepted Answer · 2017-04-03T16:02:14

Try applying .gt after you have applied org.apache.spark.sql.functions.to_utc_timestamp to your columns first:

rdd1.toDF("date1", ...)
    .join(rdd2.toDF("date2", ...), to_utc_timestamp('date1, pattern).gt(to_utc_timestamp('date2, pattern))

where pattern provides the format of the timestamp string.

How to use join with gt condition in Java?

2 Answers