0
votes

I am using Apache Beam with Cloud Dataflow. Imagine millions of products arriving to the pipeline. After some steps (filtering, mapping and so on) I want to partition the data by some field. I tried to use Partition transform, and I guess it does the partitioning right. However, I have no idea how to proceed further. Here's what confuses me:

My goal is to partition the data by some field and write all that data to different tables. Let's say that the partition transform sliced the data in 18 PCollections, then there should be 18 files. However, the partition transform returns PCollectionList, and I can't apply TextIO transform to it. I tried iterating it and applying TextIO transform to each PCollection, but it didn't work.

How do I write all the parts to file after Partition transform?

Thanks,

1
Can you add some code and point at the problem in code?boden

1 Answers

4
votes

The function used for PTransform Partition needs to output the index of the list to which the element needs to go. Following the pipeline should simply be referencing to that output. An example in python:

with beam.Pipeline() as p:
    even, odd = (p | "Create Numbers" >> Create(range(10))
                 | "Odd or Even" >> Partition(lambda n, partitions: n % 2, 2))
    # lambda x,y: which partition fn, number partitions
    # even would be when the fn outputs 0, odd when it outputs 1

    even | "even write" >> beam.io.textio.WriteToText('Output/Even')
    odd | "odd write" >> beam.io.textio.WriteToText('Output/Odd')