0
votes

I need to perform the following workflow on my hadoop cluster.

  • New files are added into an hdfs directory, /export/ (multiple times a day)
  • Files are in two formats: *_A.csv and *_B.csv
  • Copy all *_A.csv into /hive/dumptable_a/
  • Copy all *_B.csv into /hive/dumptable_b/
  • Run hive insert query to load partitioned table A from dumptable_a
  • Run hive insert query to load partitioned table B from dumptable_b
  • Delete data from /hive/dumptable_a/ and /hive/dumptable_b/

Can oozie be set up to monitor /export/ for new files, and kick off the workflow? If oozie cannot do this, or if it is not the right tool, what is the best alternative?

1
Possible duplicate of Oozie file based coordinator - Rahul Sharma

1 Answers

0
votes

Yes, as Rahul mentioned, please look at Oozie file based coordinator, where you can find an example on how to use the <datasets> and <input-events> elements.

Or you can look at an example in oozie documentation here