I have two large dataframes [a] one which has all events identified by an id [b] a list of ids. I want to filter [a] based on the ids in [b] using the stat.bloomFilter implementation in spark 2.0.0
However I don't see any operations in the dataset API to join the bloom filter to the data frame [a]
val in1 = spark.sparkContext.parallelize(List(0, 1, 2, 3, 4, 5))
val df1 = in1.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val in2 = spark.sparkContext.parallelize(List(0, 1, 2))
val df2 = in2.map(x => (x)).toDF("c1")
val expectedNumItems: Long = 1000
val fpp: Double = 0.005
val sbf = df.stat.bloomFilter($"c1", expectedNumItems, fpp)
val sbf2 = df2.stat.bloomFilter($"c1", expectedNumItems, fpp)
What is the best way to filter 'df1' based on values in df2?
Thanks!