We are using pyspark submit option using jar files placed in hdfs and yarn cluster mode like below.
#######################
source *****.sh # setting environment variables
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
/bin/pyspark --master yarn --deploy-mode client --driver-memory *g --executor-memory *g --executor-cores * --num-executors * --conf spark.yarn.archive=hdfs://node-master:/***/spark-libs.jar --jars hdfs://node-master:/***/spark-sftp_2.11-1.1.5.jar
############################ Spark code ###########################
conf = spark.sparkContext._conf.setAll(
[('spark.app.name', '***'),
("spark.sql.execution.arrow.enabled", "true"),
("spark.hadoop.fs.sftp.impl", "org.apache.hadoop.fs.sftp.SFTPFileSystem")])
spark = SparkSession \
.builder \
.config(conf=conf) \
.getOrCreate()
sc = spark.sparkContext
SFTP_HOST= 'host'
SFTP_USER = 'user'
SFTP_PEM = 'passkey'
file_path ='full_path'
spark.read\
.format("com.springml.spark.sftp")\
.option("host", SFTP_HOST)\
.option("username", SFTP_USER)\
.option("pem", SFTP_PEM)\
.option("fileType", "csv")\
.option("multiLine", "true")\
.option("header", "true")\
.load(file_path)
################################## error #############################
Py4JJavaError: An error occurred while calling o52.load. : java.lang.NoClassDefFoundError: com/springml/sftp/client/SFTPClient
Questions- How to resolve this issue is there anything we are skipping.
- Is it the right method to use jar file for spark-sftp. I can't use --package as it will install the dependent files and we are avoiding it.
Also we tried with cluster mode and followed below links but still in same page.
https://github.com/springml/spark-sftp/blob/master/README.md Use spark-sftp jar in pyspark