a) Reading from an Bounded source, how big can a PCollection size be when running in Dataflow? b) When dealing with Big Data, say about 50 Million data of PCollection trying to lookup another PCollection of about 10 Million data of PCollection. Can that be done, and how good does Beam/Dataflow perform? In a ParDo function, given that we can pass only one input and get back one output, how can a look up be performed based on 2 input datasets? I am trying to look at Dataflow/Beam similar to any other ETL tool, where an easy look-up might be possible to create a new PCollection. Please provide with any code snippets, which might help.
I also have seen the side input functionality, but can side input really hold that big dataset, if that is how lookup can be accomplished?