1
votes

I'm trying to write a Dataframe into a csv file and put this csv file into a remote machine. The Spark job is running on Yarn into a Kerberos cluster.

Below, the error I get when the job tries to write the csv file on the remote machine :

diagnostics: User class threw exception: org.apache.hadoop.security.AccessControlException: Permission denied: user=dev, access=WRITE, inode="/data/9/yarn/local/usercache/dev/appcache/application_1532962490515_15862/container_e05_1532962490515_15862_02_000001/tmp/spark_sftp_connection_temp178/_temporary/0":hdfs:hdfs:drwxr-xr-x

In order to write this csv file, i'm using the folowing parameters in a method that write this file in sftp mode :

def writeToSFTP(df: DataFrame, path: String) = {
    df.write
      .format("com.springml.spark.sftp")
      .option("host", "hostname.test.fr")
      .option("username", "test_hostname")
      .option("password", "toto")
      .option("fileType", "csv")
      .option("delimiter", ",")
      .save(path)
  }

I'm using the Spark SFTP Connector library as described in the link : https://github.com/springml/spark-sftp

The script which is used to launch the job is :

#!/bin/bash

kinit -kt /home/spark/dev.keytab [email protected]

spark-submit --class fr.edf.dsp.launcher.LauncherInsertion \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 5g \
--executor-memory 5g \
--queue dev \
--files /home/spark/dev.keytab#user.keytab,\
/etc/krb5.conf#krb5.conf,\
/home/spark/jar/dev-application-SNAPSHOT.conf#app.conf \
--conf "spark.executor.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
--conf "spark.driver.extraJavaOptions=-Dapp.config.path=./app.conf -Djava.security.auth.login.config=./jaas.conf" \
/home/spark/jar/dev-SNAPSHOT.jar > /home/spark/out.log 2>&1&

The csv files are not written into HDFS. Once the Dataframe is built i try to send it to the machine. I suspect a Kerberos issue with the sftp Spark connector : Yarn can't contact a remote machine...

Any help is welcome, thanks.

1

1 Answers

1
votes

add temporary location where you have write access, and do not worry about cleanup this because in the end after ftp done these files will be deleted,

def writeToSFTP(df: DataFrame, path: String) = {
        df.write
          .format("com.springml.spark.sftp")
          .option("host", "hostname.test.fr")
          .option("username", "test_hostname")
          .option("password", "toto")
          .option("fileType", "csv")
          **.option("hdfsTempLocation","/user/currentuser/")**
          .option("delimiter", ",")
          .save(path)
      }