4
votes

Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker container image

The challenge im facing is application have a mainapplication.jar and other dependency files & jars which are located in Remote location like AWS s3 ,but as per spark 2.3 documentation there is something called kubernetes init-container to download remote dependencies but in this case im not creating any Podspec to include init-containers in kubernetes, as per documentation Spark 2.3 spark/kubernetes internally creates Pods (driver,executor) So not sure how can i use init-container for spark-submit when there are remote dependencies.

https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-remote-dependencies

Please suggest

2
It looks like you just need to include the references to the remote jars and remote files (if any) in the --jars and --files options to spark-submit. Then when the spark runtime creates the pods to run your job, those pods will include an init-container to retrieve the remote dependencies. - Jonah Benton
@JonahBenton so i dont need to create any init-container ? - shiv455
I don't use spark, but the implication is that the driver handles it, as long as the dependencies are specified in those options to spark-submit. If it doesn't work it might be because all of the spark on k8s machinery is still in experimental state. In that case, it appears that one could instead package the dependencies in the docker image containing the job, in accordance with the discussion of local: dependencies. Docker is able to retrieve remote resources and write them to the file system at build time, so they would be available in the image without having to have spark download them. - Jonah Benton
@shiv455 I am also facing a similar situation, how did you achieve it? - Ajeet

2 Answers

2
votes

It works as it should with s3a:// urls. Unfortunatly getting s3a running on the stock spark-hadoop2.7.3 is problematic (authentication mainly), so I opted for building spark with Hadoop 2.9.1, since S3A has seen significant development there

I have created a gist with the steps needed to

  • build spark with new hadoop dependencies
  • build the docker image for k8s
  • push image to ECR

The script also creates a second docker image with the S3A dependencies added and base conf settings for enabling S3A using IAM credentials so running in AWS doesn't require putting access/secretkey in conf files/args

I havn't run any production spark jobs yet using the image, but have tested that basic saving and loading to s3a urls does work.

I have yet to experiment with S3Guard which uses DynamoDB to ensure that S3 writes/reads are consistent - similarly to EMRFS

0
votes

The Init container is created automatically for you by Spark.

For example, you can use

kubectl describe pod [name of your driver svc] and you'll see the Init container named spark-init.

You can also acccess the logs from the init-container via a command like:

kubectl logs [name of your driver svc] -c spark-init

Caveat: I'm not running in AWS, but a custom K8S. My init-container successfully runs a downloads dependencies from an HTTP server (but not S3, strangely).