0
votes

I'm building a pipeline using Apache Beam Java SDK starting from PubSubToBigQuery template supplied by Google (the pipeline will be executed in Google Cloud Dataflow).

I'm using Windowing to aggregate data and save grouped data. For example:

1) a_id: 1 b_id: 2 c_id: 3 name: name1 value: 1
2) a_id: 1 b_id: 1 c_id: 3 name: name2 value: 1
3) a_id: 1 b_id: 2 c_id: 3 name: name3 value: 2
4) a_id: 1 b_id: 1 c_id: 3 name: name4 value: 1
5) a_id: 1 b_id: 1 c_id: 3 name: name5 value: 4
6) a_id: 2 b_id: 1 c_id: 3 name: name6 value: 1

I receive this block of data in my 1 minute Window, I want to group them by a_id, b_id and c_id and count rows, so I would expect this as aggregation result:

1) a_id: 1 b_id: 2 c_id: 3 count: 2
2) a_id: 1 b_id: 1 c_id: 3 count: 3
3) a_id: 2 b_id: 1 c_id: 3 count: 1

How can I use GroupByKey transform to make this kind of grouping? (With multiple keys)

1

1 Answers

1
votes

It looks like the records you wish to aggregate have 3 part keys. I am imagining a structure that contains:

  • a_id
  • b_id
  • c_id
  • name
  • value

When you perform aggregation on your data, we aggregate by converting the records into Key/Value pairs (KV).

It is completely up to you how you choose to compose your keys. To perform an aggregation as you desire, it would appear that we could create a key that is composed of the a_id, b_id and c_id fields. Consider using a ParDo or Mapper to convert your records to have a key of "[a_id]:[b_id]:[c_id]" (or your own choice of unique key structure composed of your desired fields).