I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()