0
votes

a) Reading from an Bounded source, how big can a PCollection size be when running in Dataflow? b) When dealing with Big Data, say about 50 Million data of PCollection trying to lookup another PCollection of about 10 Million data of PCollection. Can that be done, and how good does Beam/Dataflow perform? In a ParDo function, given that we can pass only one input and get back one output, how can a look up be performed based on 2 input datasets? I am trying to look at Dataflow/Beam similar to any other ETL tool, where an easy look-up might be possible to create a new PCollection. Please provide with any code snippets, which might help.

I also have seen the side input functionality, but can side input really hold that big dataset, if that is how lookup can be accomplished?

1

1 Answers

2
votes

You can definitely do this with side inputs, as a side input may be arbitrarily large.

In Java you'd do something like this:

Pipeline pipeline = Pipeline.create(options);
PCollectionView<Map<...>> lookupCollection = pipeline
   .apply(new ReadMyLookupCollection())
   .apply(View.asMap());


PCollection<..> mainCollection = pipeline
    .apply(new ReadMyPCollection())
    .apply(
        ParDo.of(new JoinPCollsDoFn()).withSideInputs(lookupCollection));

class JoinPCollsDoFn<...> extends DoFn<...> {
  @ProcessElement
  public void processElement(ProcessContext c) {
    Map<...> siMap = c.sideInput(lookupCollection);
    String lookupKey = c.element().lookupKey;
    AugmentedElement result = c.element().mergeWith(siMap.get(lookupKey))
    c.output(result);
  }
}

FWIW, this is a bit pseudo-codey, but it is a snippet of what you'd like to do. Let me know if you want to have further clarification.