I am new to Dataflow and pub-sub tools in GCP.
Need to migrate current on prem process to GCP.
Current Process is as follows:
We have two types of data feeds
- Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
- Separate ETL jobs are there to process full feed. ETL job process full feed and create load ready files and all tables will be truncate and re-load.
- Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
- Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
- Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
- Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
- At the end of ETL process load the load ready files into tables (Mainframe)
I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.
Initially I thought below design.
Pub/sub -> DataFlow -> mySQL/BigQuery
Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.
I have two questions:
- Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
- Is this design is suitable for the above requirement or is there any better solution in GCP
- is there any alternative for DataFlow?
- Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.