Spark offers some great streaming functionalities. Recently https://spark.rstudio.com/guides/streaming/ R gained streaming capabilities via sparklyR using structured streaming.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html supports many JOIN variants (watermarked within a certain window)
How can I use these windowing capabilities with sparklyR?
edit
I am interested in two cases:
Windowed aggregation
(scala)
val windowedCounts = words
.withWatermark("timestamp", "10 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
(R)
stream_watermark(df, column="timestamp", threshold="10 minutes")
replaces
This .withWatermark("timestamp", "10 minutes")
Where can I find window($"timestamp", "10 minutes", "5 minutes"),
?
Streaming joins
How is https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking ported to sparklyR?
// Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
""")
)
trigger
How can I set triggers as defined in: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
stream_trigger_interval
can specify fixed trigger intervals, but what about unspecified or run once or continous?