0
votes

I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?

I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?

1

1 Answers

0
votes

Oozie is a scheduler - crude, poorly documented, but a scheduler.

  • If you don't need scheduling per se, or if CRON on an edge node is sufficient
  • if you want to handle your workflow logic by yourself (e.g. conditional branching, parallel executions w/ waiting for stragglers, calling generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors, <insert your pet feature here>) or don't need any fancy logic
  • if you handle your executions logs and state history by yourself, or don't care about history

... well, don't use a scheduler.

PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.

[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)