1
votes

I am trying to write an ETL job that will be scheduled to pickup CSV files from Google Cloud Storage, merge them and write to BigQuery.

I was able to figure out the Read part of CSV, and I am stuck at merging as Dataflow documentation is not helping to understand the merge options.

PCollection<String> File1 = p.apply(TextIO.Read.from("gs://**/DataFile1.csv"));
PCollection<String> File2 = p.apply(TextIO.Read.from("gs://**/DataFile2.csv"));

Merge the file1 and file2 contents and write to BigQuery Table that is already defined.

File 1 example:

Order,Status,Follow,substatus Order1, open, Yes, staged Order2, InProcess,No, withbackoffice

File 2 Example:

Order,Status,Follow,substatus Order3, open, Yes, staged Order4, InProcess,No, withbackoffice BigQuery table should have the able with columns

Order,Status,Follow,substatus - Order1, open, Yes, staged - Order2, InProcess,No, withbackoffice - Order3, open, Yes, staged - Order4, InProcess,No, withbackoffice

I know how to merge with plain Java, but am unable to figure out the proper PTransform that helps me do this in Cloud Dataflow. Kindly help! Thanks.

1
What exactly do you mean by merge? a cross join / Carthesian product? You can look into using the CoGroupByKey transform: cloud.google.com/dataflow/model/group-by-key#join.Tudor Marian
Thanks for the response. I mean an Union. As laid out in the example above. File 1 has n rows, File 2 has m rows. The CSV header is same, so schema is same. The output to BigQuery is columns with CSV header and rows n+m. Simple Union. Also, I looked at the link before as well. It doesnt provide examples or help on this topic.Raju
I believe you can simply write the two pcollections to bigquery as described in here: cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery (likely using BigQueryIO.Write.WriteDisposition.WRITE_APPEND).Tudor Marian

1 Answers

1
votes

Seems like you're asking just how to "concatenate" two PCollections into one. The answer to that is the Flatten transform. Then you can write the concatenated collection to BigQuery the usual way.