I'm creating an EMR cluster (emr-5.24.0) with Terraform, deployed into a private subnet, that includes Spark, Hive and JupyterHub.
I've added an additional configuration JSON to the deployment, which should add persistency for the Jupiter notebooks into S3 (instead of locally on disk).
The overall architecture includes a VPC endpoint to S3 and I'm able to access the bucket I'm trying to write the notebooks to.
When the cluster is provisioned, the JupyterHub server is unable to start.
Logging into the master node and trying to start/restart the docker container for the jupyterhub does not help.
The configuration for this persistency looks like this:
[
{
"Classification": "jupyter-s3-conf",
"Properties": {
"s3.persistence.enabled": "true",
"s3.persistence.bucket": "${project}-${suffix}"
}
},
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
In the terraform EMR resource definition, this is then referenced:
configurations = "${data.template_file.configuration.rendered}"
This is read from:
data "template_file" "configuration" {
template = "${file("${path.module}/templates/cluster_configuration.json.tpl")}"
vars = {
project = "${var.project_name}"
suffix = "bucket"
}
}
When I don't use persistency on the notebooks, everything works fine and I am able to log into JupyterHub.
I'm fairly certain it's not a IAM policy issue since the EMR cluster role policy Allow action is defined as "s3:*".
Are there any additional steps that need to be taken in order for this to function ?
/K