6
votes

I am currently studying for the GCP Data Engineer exam and have struggled to understand when to use Cloud Scheduler and whe to use Cloud Composer.

From reading the docs, I have the impression that Cloud Composer should be used when there is interdependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job. You can then chain flexibly as many of these "workflows" as you want, as well as giving the opporutnity to restart jobs when failed, run batch jobs, shell scripts, chain queries and so on.

For the Cloud Scheduler, it has very similar capabilities in regards to what tasks it can execute, however, it is used more for regular jobs, that you can execute at regular intervals, and not necessarily used when you have interdependencies in between jobs or when you need to wait for other jobs before starting another one. Therefore, seems to be more tailored to use in "simpler" tasks.

These thoughts came after attempting to answer some exam questions I found. However, I was surprised with the "correct answers" I found, and was hoping someone could clarify if these answers are correct and if I understood when to use one over another.

Here are the example questions that confused me in regards to this topic:

Question 1

You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

A. Cloud Scheduler

B. Cloud Dataflow

C. Cloud Functions

D. Cloud Composer

Correct Answer: A

Question 2

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

A. cron

B. Cloud Composer

C. Cloud Scheduler

D. Workflow Templates on Cloud Dataproc

Correct Answer: D

Question 3

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

A. Cloud Dataflow

B. Cloud Composer

C. Cloud Dataprep

D. Cloud Dataproc

Correct Answer: D

Any insight on this would be greatly appreciated. Thank you !

2

2 Answers

11
votes

Your assumptions are correct, Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.

I don't know where you have got these questions and answers, but I assure you(and I just got the GCP Data Engineer certification last month), the correct answer would be Cloud Composer for each one of them, just ignore this supposed correct answers and move on.

3
votes

Cloud Scheduler is essentially Cron-as-a-service. All you need is to enter a schedule and an endpoint (Pub/Sub topic, HTTP, App Engine route). Cloud Scheduler has built in retry handling so you can set a fixed number of times and doesn't have time limits for requests. The functionality is much simpler than Cloud Composer.

Cloud Composer is managed Apache Airflow that "helps you create, schedule, monitor and manage workflows. Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure."(https://cloud.google.com/composer/docs/) Airflow is aimed at data pipelines with all the needed tooling.