0
votes

I have a PCollection which consists of an ID column and seven value columns. There are several rows for each ID.

I would like to compute the mean of the seven columns per unique ID.

Is there a way to achieve this without going through each element programmatically and creating key/value pair per each element?

1

1 Answers

0
votes
table_pcoll = ....

def per_column_average(rows, ignore_elms=[ID_INDEX]):
  return [sum([row[idx] if idx not in ignore_elms else 0 
               for row in rows])/len(row[0]) 
          for idx, _ in enumerate(rows[0])]

keyed_averaged_elm = (table_pcoll 
                      | beam.Map(lambda x: (x[ID_INDEX], x))
                      | beam.GroupByKey()
                      | beam.Map(lambda x: (x[0], per_column_average(rows))

Sorry about the nasty one-liner. I hope that helps.