How to pick Dynamic File Name from HDFS while inserting into Hive Table

Question

I have a Hive Table. Now I need to write a workflow where everyday the job will search for a file in a location -

/data/data_YYYY-mm-dd.csv
like
/data/data_2015-07-07.csv
/data/data_2015-07-08.csv
...

So each day workflow will automatically pick the file name and load the data into the Hive Table(MyTable).

I am writing the script of loading as below- LOAD DATA INPATH "/data/${filepath}" OVERWRITE INTO TABLE MyTable.

Now while running the same as a plain hive job I can set the filepath as data_2015-07-07.csv , but how to do that in Oozie coordinator so that it automatically picks the path with name as date.

I tried to set the workflow parameter from Oozie coordinator-

clicklog_${YYYY}-{MONTH}-{DAY}.csv

You're question is a bit had to follow through. When you say you are setting workflow parameter from Oozie coordinator, how does your coordinator look like? Also, what do you mean by what happens to the already existing file? Could you elaborate in your question please? — Vinayak Ponangi

Abhishek Choudhary Abhishek Choudhary · Accepted Answer · 2015-07-08T11:29:41

Well after checking through Oozie coordinator documentation, I found the solution. Its simple and straightforward, whatever the configuration you already added in Hive Workflow, will be ignored and OOzie coordinator will fill them-

So My Hive Workflow was -

<workflow-app name="Workflow__" xmlns="uri:oozie:workflow:0.5">
    <start to="hive-cfc5"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="hive-cfc5">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>/user/hive-site.xml</job-xml>
            <script>/user/sub/create.hql</script>
        </hive>
        <ok to="hive-2ade"/>
        <error to="Kill"/>
    </action>
    <action name="hive-2ade">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>/user/hive-site.xml</job-xml>
            <script>/user/sub/load_query.hql</script>
              <param>filepath=test_2015-06-26.csv</param>
        </hive>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

Now I scheduled the same workflow in my oozie coordinator-

Simply by setting the filepath parameter-

test_${YYYY}-{MONTH}-{DAY}.csv

<coordinator-app name="My_Coordinator"
  frequency="*/60 * * * *"
  start="${start_date}" end="${end_date}" timezone="America/Los_Angeles"
  xmlns="uri:oozie:coordinator:0.2"
  >
  <controls>
    <execution>FIFO</execution>
  </controls>
  <action>
    <workflow>
      <app-path>${wf_application_path}</app-path>
      <configuration>
        <property>
          <name>filepath</name>
          <value>test_${YYYY}-{MONTH}-{DAY}.csv</value>
        </property>
        <property>
          <name>oozie.use.system.libpath</name>
          <value>True</value>
        </property>
        <property>
          <name>start_date</name>
          <value>2015-07-07T14:50Z</value>
        </property>
        <property>
          <name>end_date</name>
          <value>2015-07-14T07:23Z</value>
        </property>
      </configuration>
   </workflow>
  </action>
</coordinator-app>

and then I used a crone job to run the same every 60 minute (*/60 * * * *) to check for any above pattern file is available or not

How to pick Dynamic File Name from HDFS while inserting into Hive Table

1 Answers