1
votes

I'm new to Dataflow, so this is probably an easy question.

I want to try out the Sessions windowing strategy. According to the windowing documentation, windowing is not applied until we've done a GroupByKey, so I'm trying to do that.

However, when I look at my pipeline in Google Cloud Platform, I can see that MapElements returns elements, but no elements are returned by GroupByKey ("Elements Added: -"). What am I doing wrong when grouping by key?

Here's the most relevant part of the code:

events = events
  .apply(Window.named("eventsSessionsWindowing")
    .<MyEvent>into(Sessions.withGapDuration(Duration.standardSeconds(3)))
  );

PCollection<KV<String, MyEvent>> eventsKV = events
  .apply(MapElements
    .via((MyEvent e) -> KV.of(ExtractKey(e), e))
    .withOutputType(new TypeDescriptor<KV<String, MyEvent>>() {}));

PCollection<KV<String, Iterable<MyEvent>>> eventsGrouped = eventsKV.apply(GroupByKey.<String, MyEvent>create());
1
A few questions to help debug. Is this a batch or streaming pipeline? What runner are you using (direct, dataflow, spark, flink, ...?) Does the issue reproduce in Direct runner? Do you have other evidence that there are no elements in the GBK, apart from the "Elements added" message - e.g. if you add another ParDo following it that writes the grouped KVs to TextIO, does it end up with empty output?jkff
It's a streaming pipeline (we're reading from pubsub), using DataflowPipelineRunner. Haven't tried with Direct runner yet. I have tried to output debug information in a ParDo after the GroupByKey, but nothing is output. I'm using LOG.info(...) (which I know works) to debug. Although no element are added in GroupByKey, it sometimes says "1 elements/s" inside the GroupByKey box in the Google Cloud.Håkon Skaarud Karlsen
I now found out that the GroupByKey actually works - it's just very slow. I had to wait a couple of minutes for the log output to appear. Are GroupByKeys supposed to take that long, or is it more likely that I've done something stupid. At the moment, we're not giving Dataflow a lot of data.Håkon Skaarud Karlsen
I seem to have solved the delay issue by adding triggering.Håkon Skaarud Karlsen

1 Answers

0
votes

A GroupByKey fires according to a triggering strategy, which determines when the system thinks that all data for this key/window has been received and it's time to group it and pass to downstream transforms. The default strategy is:

The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window.

Please see Default Trigger for details. You were seeing a delay of a couple of minutes that corresponded to the progression of PubSub's watermark.