I am currently studying for the GCP Data Engineer exam and have struggled to understand when to use Cloud Scheduler and whe to use Cloud Composer.
From reading the docs, I have the impression that Cloud Composer should be used when there is interdependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job. You can then chain flexibly as many of these "workflows" as you want, as well as giving the opporutnity to restart jobs when failed, run batch jobs, shell scripts, chain queries and so on.
For the Cloud Scheduler, it has very similar capabilities in regards to what tasks it can execute, however, it is used more for regular jobs, that you can execute at regular intervals, and not necessarily used when you have interdependencies in between jobs or when you need to wait for other jobs before starting another one. Therefore, seems to be more tailored to use in "simpler" tasks.
These thoughts came after attempting to answer some exam questions I found. However, I was surprised with the "correct answers" I found, and was hoping someone could clarify if these answers are correct and if I understood when to use one over another.
Here are the example questions that confused me in regards to this topic:
Question 1
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?
A. Cloud Scheduler
B. Cloud Dataflow
C. Cloud Functions
D. Cloud Composer
Correct Answer: A
Question 2
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Cloud Dataproc
Correct Answer: D
Question 3
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
A. Cloud Dataflow
B. Cloud Composer
C. Cloud Dataprep
D. Cloud Dataproc
Correct Answer: D
Any insight on this would be greatly appreciated. Thank you !