different pipelines based on files in compressed file

Question

I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.

The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.

I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.

However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.

greeness greeness · Accepted Answer · 2018-05-15T18:11:26

In my understanding, this is not achievable through any existing text IO in beam.

The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.

If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.

Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.

different pipelines based on files in compressed file

1 Answers