0
votes

I'm creating an EMR cluster (emr-5.24.0) with Terraform, deployed into a private subnet, that includes Spark, Hive and JupyterHub.

I've added an additional configuration JSON to the deployment, which should add persistency for the Jupiter notebooks into S3 (instead of locally on disk).

The overall architecture includes a VPC endpoint to S3 and I'm able to access the bucket I'm trying to write the notebooks to.

When the cluster is provisioned, the JupyterHub server is unable to start.

Logging into the master node and trying to start/restart the docker container for the jupyterhub does not help.

The configuration for this persistency looks like this:

[
    {
        "Classification": "jupyter-s3-conf",
        "Properties": {
            "s3.persistence.enabled": "true",
            "s3.persistence.bucket": "${project}-${suffix}"
        }
    },
  {
     "Classification": "spark-env",
     "Configurations": [
       {
         "Classification": "export",
         "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
          }
       }
    ]
  }
]

In the terraform EMR resource definition, this is then referenced:

configurations         = "${data.template_file.configuration.rendered}"

This is read from:

data "template_file" "configuration" {
  template = "${file("${path.module}/templates/cluster_configuration.json.tpl")}"

  vars = {
    project  = "${var.project_name}"
    suffix   = "bucket"
  }
}

When I don't use persistency on the notebooks, everything works fine and I am able to log into JupyterHub.

I'm fairly certain it's not a IAM policy issue since the EMR cluster role policy Allow action is defined as "s3:*".

Are there any additional steps that need to be taken in order for this to function ?

/K

2

2 Answers

0
votes

It seems that the jupyter on EMR uses the S3ContentsManager to connect with S3.

https://github.com/danielfrg/s3contents

I dig a bit S3ContentsManager and found the S3 endpoints which are the public one (as expected). Since the endpoint of S3 is public, jupyter needs to access the internet but you are running the EMR on the private subnet which is not possible to connect the endpoint I guess.

You might need to use a NAT gateway in a public subnet or create s3 endpoint for your VPC.

0
votes

Yup. We ran into this too. Add an S3 VPC Endpoint, then from AWS support -

add a JupyterHub notebook config:

{
"Classification": "jupyter-notebook-conf",
"Properties": {
"config.S3ContentsManager.endpoint_url": "\"https://s3.${aws_region}.amazonaws.com\"",
"config.S3ContentsManager.region_name": "\"${aws_region}\""
}
},

hth