handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

Question

Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker container image

The challenge im facing is application have a mainapplication.jar and other dependency files & jars which are located in Remote location like AWS s3 ,but as per spark 2.3 documentation there is something called kubernetes init-container to download remote dependencies but in this case im not creating any Podspec to include init-containers in kubernetes, as per documentation Spark 2.3 spark/kubernetes internally creates Pods (driver,executor) So not sure how can i use init-container for spark-submit when there are remote dependencies.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies

Please suggest

It looks like you just need to include the references to the remote jars and remote files (if any) in the --jars and --files options to spark-submit. Then when the spark runtime creates the pods to run your job, those pods will include an init-container to retrieve the remote dependencies. — Jonah Benton
I don't use spark, but the implication is that the driver handles it, as long as the dependencies are specified in those options to spark-submit. If it doesn't work it might be because all of the spark on k8s machinery is still in experimental state. In that case, it appears that one could instead package the dependencies in the docker image containing the job, in accordance with the discussion of local: dependencies. Docker is able to retrieve remote resources and write them to the file system at build time, so they would be available in the image without having to have spark download them. — Jonah Benton
@shiv455 I am also facing a similar situation, how did you achieve it? — Ajeet

Johal Johal · Accepted Answer · 2018-10-03T15:45:57

It works as it should with s3a:// urls. Unfortunatly getting s3a running on the stock spark-hadoop2.7.3 is problematic (authentication mainly), so I opted for building spark with Hadoop 2.9.1, since S3A has seen significant development there

I have created a gist with the steps needed to

build spark with new hadoop dependencies
build the docker image for k8s
push image to ECR

The script also creates a second docker image with the S3A dependencies added and base conf settings for enabling S3A using IAM credentials so running in AWS doesn't require putting access/secretkey in conf files/args

I havn't run any production spark jobs yet using the image, but have tested that basic saving and loading to s3a urls does work.

I have yet to experiment with S3Guard which uses DynamoDB to ensure that S3 writes/reads are consistent - similarly to EMRFS

handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2 Answers