I am trying to write an ETL job that will be scheduled to pickup CSV files from Google Cloud Storage, merge them and write to BigQuery.
I was able to figure out the Read part of CSV, and I am stuck at merging as Dataflow documentation is not helping to understand the merge options.
PCollection<String> File1 = p.apply(TextIO.Read.from("gs://**/DataFile1.csv"));
PCollection<String> File2 = p.apply(TextIO.Read.from("gs://**/DataFile2.csv"));
Merge the file1 and file2 contents and write to BigQuery Table that is already defined.
File 1 example:
Order,Status,Follow,substatus
Order1, open, Yes, staged
Order2, InProcess,No, withbackoffice
File 2 Example:
Order,Status,Follow,substatus
Order3, open, Yes, staged
Order4, InProcess,No, withbackoffice
BigQuery table should have the able with columns
Order,Status,Follow,substatus
- Order1, open, Yes, staged
- Order2, InProcess,No, withbackoffice
- Order3, open, Yes, staged
- Order4, InProcess,No, withbackoffice
I know how to merge with plain Java, but am unable to figure out the proper PTransform that helps me do this in Cloud Dataflow. Kindly help! Thanks.