0
votes

I am new to Dataflow and pub-sub tools in GCP.

Need to migrate current on prem process to GCP.

Current Process is as follows:

We have two types of data feeds

  1. Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
  • Separate ETL jobs are there to process full feed. ETL job process full feed and create load ready files and all tables will be truncate and re-load.
  1. Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
  • Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
  • Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
  • Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
  • At the end of ETL process load the load ready files into tables (Mainframe)

I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.

Initially I thought below design.

Pub/sub -> DataFlow -> mySQL/BigQuery

Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.

I have two questions:

  • Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
  • Is this design is suitable for the above requirement or is there any better solution in GCP
  • is there any alternative for DataFlow?
  • Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.
1
Google Cloud Beta product have Production grade quality. You simply don't have SLA (and financial counter part) in case ou outage. You can use it in production, and Yes it's suitable for your process.guillaume blaquiere
Google Cloud Pub/Sub ordered delivery is now generally available.Kamal Aboul-Hosn

1 Answers

0
votes

As was mentioned by @guillaume blaquiere, Beta product launching phase brings some restrictions but they are mostly related to the product support:

At beta, products or features are ready for broader customer testing and use. Betas are often publicly announced. There are no SLAs or technical support obligations in a beta release unless otherwise specified in product terms or the terms of a particular beta program. The average beta phase lasts about six months.

Commonly, Cloud Pub/Sub message ordering feature works as intended, once you have something for developers attention it is highly appreciated to send a report via Google Issue tracker.