Loop through a PCollection to make a Graph data structure and then passing it as SideInput to pipeline transform

Question

I have a use case where I have to read a big query table into dataflow pipeline, then read each row in that PCollection to construct a graph data structure. And then pass the graph as SideInput to more transform steps that require this graph and another big query table PCollection as arguments. below is what I have right now:

Pipeline pipeline = Pipeline.create(options);

//Read from big query
PCollection<TableRow> bqTable = pipeline.apply("ReadFooBQTable", BigQueryIO.Read.from("Table"));

//Loop over PCollection create "graph" still need to figure this out


//pass the graph as side input 
pCol.apply("Process", ParDo.withSideInputs(graph).of(new BlueKai.ProcessBatch(graph))).apply("Write",
    Write.to(new DecoratedFileSink<String>(standardBucket, "csv", TextIO.DEFAULT_TEXT_CODER, null, null, WriterOutputGzipDecoratorFactory.getInstance())).withNumShards(numChunks));

Ben Chambers Ben Chambers · Accepted Answer · 2017-05-17T19:51:21

The problem is going to be how to serialize the graph to pass it between machines. If you define a Coder for how to serialize an element representing the graph, then you could use it as a side input as you describe.

Assuming the graph can be encoded, then you would just use it as a singleton side input. This assumes the number of rows can be processed on a single machine. You may need to define a CombineFn<TableRow, Graph, Graph> that computes the graph from the table rows. Assuming two graphs can be combined (eg., it is an associative and commutative operation), then you could use a combine plus asSingletonView.

An alternative, would be to use a List<TableRow> as the side input and have each machine construct the graph.

Loop through a PCollection to make a Graph data structure and then passing it as SideInput to pipeline transform

1 Answers