There is NO documentation regarding how to convert pCollections into the pCollections necessary for input into .CoGroupByKey()
Context Essentially I have two large pCollections and I need to be able to find differences between the two, for type II ETL changes (if it doesn't exist in pColl1 then add to a nested field found in pColl2), so that I am able to retain history of these records from BigQuery.
Pipeline Architecture:
- Read BQ Tables into 2 pCollections: dwsku and product.
- Apply a CoGroupByKey() to the two sets to return --> Results
- Parse results to find and nest all changes in dwsku into product.
Any help would be recommended. I found a java link on SO that does the same thing I need to accomplish (but there's nothing on the Python SDK).
Convert from PCollection<TableRow> to PCollection<KV<K,V>>
Is there a documentation / support for Apache Beam, especially Python SDK?