I need to perform the following workflow on my hadoop cluster.
- New files are added into an hdfs directory, /export/ (multiple times a day)
- Files are in two formats: *_A.csv and *_B.csv
- Copy all *_A.csv into /hive/dumptable_a/
- Copy all *_B.csv into /hive/dumptable_b/
- Run hive insert query to load partitioned table A from dumptable_a
- Run hive insert query to load partitioned table B from dumptable_b
- Delete data from /hive/dumptable_a/ and /hive/dumptable_b/
Can oozie be set up to monitor /export/ for new files, and kick off the workflow? If oozie cannot do this, or if it is not the right tool, what is the best alternative?