EMR JupyterHub: S3 persistence of notebooks not working

Question

I am trying to set up an EMR cluster with JupyterHub and S3 persistence. I have the following classification:

    {
        "Classification": "jupyter-s3-conf",
        "Properties": {
            "s3.persistence.enabled": "true",
            "s3.persistence.bucket": "my-persistence-bucket"
        }
    }

I am installing dask with the following step (otherwise, opening the notebook would result in a 500 error):

command-runner.jar
Arguments: /usr/bin/sudo /usr/bin/docker exec jupyterhub conda install dask

However, when I then open a new notebook, it is not persisted. The bucket stays empty. The cluster DOES have access to S3, as when running a Spark job with the same configuration which reads from and writes to S3, it can do so, with the same bucket.

However, when looking into the jupyter log on my master, I see this:

[E 2019-08-07 12:27:14.609 SingleUserNotebookApp application:574] Exception while loading config file /etc/jupyter/jupyter_notebook_config.py
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 562, in _load_config_files
        config = loader.load_config()
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 457, in load_config
        self._read_file_as_dict()
      File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
        py3compat.execfile(conf_filename, namespace)
      File "/opt/conda/lib/python3.6/site-packages/ipython_genutils/py3compat.py", line 198, in execfile
        exec(compiler(f.read(), fname, 'exec'), glob, loc)
      File "/etc/jupyter/jupyter_notebook_config.py", line 5, in <module>
        from s3contents import S3ContentsManager
      File "/opt/conda/lib/python3.6/site-packages/s3contents/__init__.py", line 15, in <module>
        from .gcsmanager import GCSContentsManager
      File "/opt/conda/lib/python3.6/site-packages/s3contents/gcsmanager.py", line 8, in <module>
        from s3contents.gcs_fs import GCSFS
      File "/opt/conda/lib/python3.6/site-packages/s3contents/gcs_fs.py", line 3, in <module>
        import gcsfs
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/__init__.py", line 4, in <module>
        from .dask_link import register as register_dask
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 56, in <module>
        register()
      File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 51, in register
        dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
    AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'

What am I missing and what is going wrong?

what version of emr? 5.24 and after that without dask working well. I am using it right now. — Lamanus
Update: I started a "blank" cluster and it works there. So it might be an incompatibility with my additional libs. — rabejens

rabejens rabejens · Accepted Answer · 2019-08-07T16:11:09

It turned out it was a chain reaction of upgrading and installing custom packages breaking compatibility. I install additional packages in my cluster with the command-runner where I had some issues - I could only run one conda install command, the second one failed with no module named 'conda'.

So I updated Anaconda first by doing /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda with the command-runner. This caused jinja2 not finding markupsafe. Installing markupsafe pulled jupyterhub to 1.0.0 which broke even more things.

So here is how I got it to work (executed in order with command-runner.jar):

/usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda updates Anaconda.
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda install --freeze-installed markupsafe installs markupsafe which is needed after step 1.
Installed my desired additional packages into the container, but always with --freeze-installed option to circumvent breaking anything installed by EMR
A custom bootstrap action that runs a script from S3 installs my desired packages from step 3 with pip-3.6 as well so they work for PySpark (for it to work, they have to be installed on all nodes directly)

EMR JupyterHub: S3 persistence of notebooks not working

1 Answers