0
votes

Beam's GroupByKey groups records by key across all partitions and outputs a single iterable per-key-per-window. This "brings associated data together into one location"

Is there a way I can groups records by key locally, so that I still get a single iterable per-key-per-window as its output, but only over the local records in the partition instead of a global group-by-key over all locations?

1
How do you define your partition in Beam? - Rui Wang

1 Answers

0
votes

If I understand your question correctly, you don't want to transfer a data over network if a part of it (partition) was processed on the same machine and then can be grouped locally.

Normally, Beam doesn't provide you details where and how your code will be running since it may vary depending on runner/engine/resource manager. Though, if you can fetch some uniq information about your worker (like hostname, ip or mac address) then you can use it as a part of your key and group all related data by this. Quite likely that in this case these data partitions won't be moved to other machines since all needed input data is already sitting on the same machine and can be processed locally. Though, afaik, there is no 100% guarantee about that.