2
votes

I am spinning up short lived clusters and destroying them as soon as the purpose is served. However, I would like to persist my notebooks created in /datalab/notebooks directory and then copy them to the same directory when a new cluster is created, so all the notebooks created on the previous cluster are available.

I am able to copy the notebooks to a GCS bucket before shutting down, but unable to copy them back from GCS to /datalab/network after the new cluster creation as the directory /datalab/notebooks is created when my startup-script runs or after the initialization script datalab.sh is complete.

Where is this directory created or how can I copy the notebooks from my GCS bucket to /datalab/notebooks somehow?

The key is that the /datalab/notebooks needs to be available when this copy takes place.

Update

My cluster creation failed with the below error.

gsutil cp 'gs://dataproc-datalab-srinid/notebooks/*' /datalab/notebooks/
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.

However, when I login to the master and review the dataproc-initialization-script-2.log log, the copy has been successful (see below).

+ '[' -d /datalab/notebooks ']'
+ echo 'Sleeping since /datalab/notebooks doesnt exist yet...'
Sleeping since /datalab/notebooks doesnt exist yet...
+ sleep 50
+ '[' -d /datalab/notebooks ']'
+ gsutil cp 'gs://dataproc-datalab-srinid/notebooks/*' /datalab/notebooks/
Copying gs://dataproc-datalab-srinid/notebooks/BABA_notebook.ipynb...
/ [0 files][    0.0 B/ 40.8 KiB]                                                ^M/ [1 files][ 40.8 KiB/ 40.8 KiB]                                                ^MCopying gs://dataproc-datalab-srinid/notebooks/Untitled Notebook.ipynb...
/ [1 files][ 40.8 KiB/ 67.7 KiB]                                                ^M/ [2 files][ 67.7 KiB/ 67.7 KiB]                                                ^MCopying gs://dataproc-datalab-srinid/notebooks/hello.ipynb...
/ [2 files][ 67.7 KiB/ 68.7 KiB]                                                ^M/ [3 files][ 68.7 KiB/ 68.7 KiB]                                                ^MCopying gs://dataproc-datalab-srinid/notebooks/test-Copy1.ipynb...
/ [3 files][ 68.7 KiB/ 69.7 KiB]                                                ^M/ [4 files][ 69.7 KiB/ 69.7 KiB]                                                ^M
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://dataproc-datalab-srinid/notebooks/test.ipynb...
/ [4 files][ 69.7 KiB/ 70.7 KiB]                                                ^M-^M- [5 files][ 70.7 KiB/ 70.7 KiB]                                                ^M
Operation completed over 5 objects/70.7 KiB.

Code

if [ -d '/datalab/notebooks' ]; then
     gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
else
     echo 'Sleeping since /datalab/notebooks doesnt exist yet...'
     sleep 50
     if [ -d '/datalab/notebooks' ]; then
        gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
     else
        echo "Even after 50secs, the directory is not found, waiting for another 30secs.."
        sleep 30
        gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
     fi
fi
2
Since your command succeeded on the master it sounds like the error came from worker nodes. Presumably you only have the datalab/notebooks directory on the master node. So, simply wrap in a ROLE == 'Master' condition. Updated my answer to show the syntax for only running on masterDennis Huo

2 Answers

1
votes

I assume you are trying to do the copying as part of an initialization action. If that is not the case, then let us know how you are running the commands as that will affect how they need to be run.

Inside of the docker container for Datalab, the "/datalab" directory is ephemeral. For things you want to persist, you should use the "/content/datalab" directory instead. However, there is some special care you need to do that:

For an init action, the "/content/datalab" directory inside of the Datalab container maps to the "/root/datalab" directory in the VM (this is defined here ).

So, to copy notebooks from GCS to the "/content/datalab/notebooks" directory, try making the "/root/datalab/notebooks" directory (e.g. "mkdir -p ${HOME}/datalab/notebooks" assuming you run the setup in an init action), and then copy the notebooks from GCS to that location.

1
votes

If you run your copy-from-GCS command using an init action as well instead of from a GCE startup script then you control the order of init actions running, so you can simply put your copy-from-gcs init action after the datalab init action:

--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh,gs://your-bucket/copy-notebooks-from-gcs.sh

Alternatively, if the creation of that directory is asynchronous, you can add an init action or a startup script which sleeps until the directory is available; assuming you're using an init action, and you probably only want this to run on the master node:

#!/bin/bash

readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"

if [[ "${ROLE}" == 'Master' ]]; then
  if [ -d '/datalab/notebooks' ]; then
    gsutil cp ${GCS_NOTEBOOK_DIRECTORY] /datalab/notebooks
  else
    echo 'Sleeping since /datalab/notebooks doesnt exist yet...'
    sleep 5
  fi
fi