Adding Hadoop dependencies to standalone Flink cluster

Question

I want to create a Apache Flink standalone cluster with serveral taskmanagers. I would like to use HDFS and Hive. Therefore i have to add some Hadoop dependencies.

After reading the documentation, the recommended way is to set the HADOOP_CLASSPATH env variable. But how do i have to add the hadoop files? Should i download the source files in some directory like /opt/hadoop ont the taskmanagers and set the variable to this path?

I only know the old but deprecated way downloading a Uber-Jar with the dependencies and place it under the /lib folder.

kkrugler kkrugler · Accepted Answer · 2020-12-06T23:43:24

Normally you'd do the standard Hadoop installation, since you (for HDFS) need Node Managers running on every server (with appropriate configuration), plus the NameNode running on your master server.

So then you can do something like this on the master server where you're submitting your Flink workflow:

export HADOOP_CLASSPATH=`hadoop classpath`
export HADOOP_CONF_DIR=/etc/hadoop/conf

Adding Hadoop dependencies to standalone Flink cluster

1 Answers