Sqoop job running forever using Oozie

Question

We have two node hadoop yarn cluster, it is hadoop 2.2, on which We have scheduled two actions in a single workflow using oozie, the first action is a python map-reduce streaming action, and the second one is sqoop export job, which actually transfers the output of map-reduce streaming action to mysql database.

The streaming action successfully gets executed,which results in the starting of sqoop job, which remains running for ever.

stdout results in the following.

Sqoop command arguments :
         export
         --connect
         jdbc:mysql://localhost/database
         --username
         root
         --password
         root
         --table
         tableName
         --direct
         --export-dir
        /user/hduser/oozieProject/workflow/output

 =================================================================
Invoking Sqoop command line now >>>

2137 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2158 [main] INFO  org.apache.sqoop.Sqoop  - Running Sqoop version: 1.4.4.2.0.6.1-102
2170 [main] WARN  org.apache.sqoop.tool.BaseSqoopTool  - Setting your password on the command-line is insecure. Consider using -P instead.
2178 [main] WARN  org.apache.sqoop.ConnFactory  - $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2197 [main] INFO  org.apache.sqoop.manager.MySQLManager  - Preparing to use a MySQL streaming resultset.
2197 [main] INFO  org.apache.sqoop.tool.CodeGenTool  - Beginning code generation
2464 [main] INFO  org.apache.sqoop.manager.SqlManager  - Executing SQL statement:     SELECT t.* FROM `missedCalls` AS t LIMIT 1
2483 [main] INFO  org.apache.sqoop.manager.SqlManager  - Executing SQL statement: SELECT t.* FROM `missedCalls` AS t LIMIT 1
2485 [main] INFO  org.apache.sqoop.orm.CompilationManager  - HADOOP_MAPRED_HOME is /usr/local/hadoop
3838 [main] INFO  org.apache.sqoop.orm.CompilationManager  - Writing jar file: /tmp/sqoop-hduser/compile/21bd1d5fe13adeed4f46a09f8b3d38fe/missedCalls.jar
3847 [main] INFO  org.apache.sqoop.mapreduce.ExportJobBase  - Beginning export of missedCalls
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat

the job properties is as follows

nameNode=hdfs://master:54310
jobTracker=master:8035
queueName=default

oozie.libpath=${nameNode}/user/hduser/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true

oozieProjectRoot=${nameNode}/user/hduser/oozieProject
appPath=${oozieProjectRoot}/workflow
oozie.wf.application.path=${appPath}
oozieLibPath=${oozie.libpath}

mapred.tasktracker.map.tasks.maximum=4
mapred.tasktracker.reduce.tasks.maximum=4

inputDir=${oozieProjectRoot}/data/*
outputDir=${appPath}/output

the workflow xml is as follows

<!--Oozie workflow file: workflow.xml --> 
<workflow-app name="WorkflowStreamingMRAction-Python" xmlns="uri:oozie:workflow:0.1">
<start to="streamingaAction"/>
<action name="streamingaAction">
    <map-reduce>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <prepare>
            <delete path="${outputDir}"/>
        </prepare>
        <streaming>
            <mapper>python mapper.py</mapper>
            <reducer>python reducer.py</reducer>
        </streaming>
        <configuration>
            <property>
                <name>oozie.libpath</name>
                <value>${oozieLibPath}/mapreduce-streaming</value>
            </property>
            <property>
                <name>mapred.input.dir</name>
                <value>${inputDir}</value>
            </property>
            <property>
                <name>mapred.output.dir</name>
                <value>${outputDir}</value>
            </property>
            <property>
                <name>mapred.reduce.tasks</name>
                <value>4</value>
            </property>
        </configuration>
        <file>${appPath}/mapper.py#mapper.py</file>
        <file>${appPath}/reducer.py#reducer.py</file>
    </map-reduce>
    <ok to="sqoopAction"/>
    <error to="killJobAction"/>
    </action>

    <action name="sqoopAction">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <command>export --connect jdbc:mysql://localhost/database --username root --password myPwd --table tableName --direct --export-dir /user/hduser/oozieProject/workflow/output</command>
        </sqoop>        
    <ok to="end"/>
    <error to="killJobAction"/>
    </action>
<kill name="killJobAction">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />

Please advice what could have gone wrong?

Thank you

FengWang FengWang · Accepted Answer · 2014-11-05T14:29:07

It is not running forever. You just need to wait.

Firstly, the Sqoop export job you see above is just an Oozie schedule job. And the Heart beat means it is running now. You just need to wait. Actually you can go to the YARN resource manager page (usually http://$namenode:8088/cluster), and then you can find the "real" Sqoop export job. (I guess the default number of mappers is 4.)

Secondly, the Sqoop does the "export" by using INSERT statement, so it is relatively slow. I would not suggest using Sqoop export when the table is large, for example, when it has more than 1 million entries.

Thirdly, since I notice you try to export to MySQL, you can try batch mode, which runs the INSERT query in this way: INSERT INTO <TABLE> VALUES (<ROW1>), (<ROW2>), etc.

So you can change your command to: sqoop export -D sqoop.export.records.per.statement=1000 --connect jdbc:mysql://localhost/database --username root --password myPwd --table tableName --direct --export-dir /user/hduser/oozieProject/workflow/output --batch

Sqoop job running forever using Oozie

1 Answers