How do I write to multiple files in Apache Beam?

Question

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.

For example, let's say the result consists of

(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)

Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.

And in my case:

Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.

Any ideas?

Side outputs - beam.apache.org/documentation/programming-guide/… — Graham Polley
@GrahamPolley I think side outputs are decided at graph construction time. Buy my case requires pipeline running time. :-( — abcdabcd987
Yup, that's right. Beam doesn't support dynamic side outputs (or inputs) yet. — Graham Polley
@GrahamPolley yeah, I know. issues.apache.org/jira/browse/BEAM-92 still unsolved. So I'm wondering if there are some workarounds. — abcdabcd987

CasualT CasualT · Accepted Answer · 2017-04-11T21:15:18

Handily, I wrote a sample of this case just the other day.

This example is dataflow 1.x style

Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).

    ...
    PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
                .apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
        readyToWrite.apply(
                new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
    ...

And then the transform doing most of the work is:

public class PTransformWriteToGCS
    extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {

    private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);

    private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();

    private final String bucketName;

    private final SerializableFunction<String, String> pathCreator;

    public PTransformWriteToGCS(final String bucketName,
        final SerializableFunction<String, String> pathCreator) {
        this.bucketName = bucketName;
        this.pathCreator = pathCreator;
    }

    @Override
    public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {

        return input
            .apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {

                @Override
                public void processElement(
                    final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
                    throws Exception {
                    final String key = arg0.element().getKey();
                    final List<String> values = arg0.element().getValue();
                    final String toWrite = values.stream().collect(Collectors.joining("\n"));
                    final String path = pathCreator.apply(key);
                    BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
                        .setContentType(MimeTypes.TEXT)
                        .build();
                    LOG.info("blob writing to: {}", blobInfo);
                    Blob result = STORAGE.create(blobInfo,
                        toWrite.getBytes(StandardCharsets.UTF_8));
                }
            }));
    }
}

How do I write to multiple files in Apache Beam?

5 Answers