I am spinning up short lived clusters and destroying them as soon as the purpose is served. However, I would like to persist my notebooks created in /datalab/notebooks directory and then copy them to the same directory when a new cluster is created, so all the notebooks created on the previous cluster are available.
I am able to copy the notebooks to a GCS bucket before shutting down, but unable to copy them back from GCS to /datalab/network after the new cluster creation as the directory /datalab/notebooks is created when my startup-script runs or after the initialization script datalab.sh is complete.
Where is this directory created or how can I copy the notebooks from my GCS bucket to /datalab/notebooks somehow?
The key is that the /datalab/notebooks needs to be available when this copy takes place.
Update
My cluster creation failed with the below error.
gsutil cp 'gs://dataproc-datalab-srinid/notebooks/*' /datalab/notebooks/
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
However, when I login to the master and review the dataproc-initialization-script-2.log log, the copy has been successful (see below).
+ '[' -d /datalab/notebooks ']'
+ echo 'Sleeping since /datalab/notebooks doesnt exist yet...'
Sleeping since /datalab/notebooks doesnt exist yet...
+ sleep 50
+ '[' -d /datalab/notebooks ']'
+ gsutil cp 'gs://dataproc-datalab-srinid/notebooks/*' /datalab/notebooks/
Copying gs://dataproc-datalab-srinid/notebooks/BABA_notebook.ipynb...
/ [0 files][ 0.0 B/ 40.8 KiB] ^M/ [1 files][ 40.8 KiB/ 40.8 KiB] ^MCopying gs://dataproc-datalab-srinid/notebooks/Untitled Notebook.ipynb...
/ [1 files][ 40.8 KiB/ 67.7 KiB] ^M/ [2 files][ 67.7 KiB/ 67.7 KiB] ^MCopying gs://dataproc-datalab-srinid/notebooks/hello.ipynb...
/ [2 files][ 67.7 KiB/ 68.7 KiB] ^M/ [3 files][ 68.7 KiB/ 68.7 KiB] ^MCopying gs://dataproc-datalab-srinid/notebooks/test-Copy1.ipynb...
/ [3 files][ 68.7 KiB/ 69.7 KiB] ^M/ [4 files][ 69.7 KiB/ 69.7 KiB] ^M
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.
Copying gs://dataproc-datalab-srinid/notebooks/test.ipynb...
/ [4 files][ 69.7 KiB/ 70.7 KiB] ^M-^M- [5 files][ 70.7 KiB/ 70.7 KiB] ^M
Operation completed over 5 objects/70.7 KiB.
Code
if [ -d '/datalab/notebooks' ]; then
gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
else
echo 'Sleeping since /datalab/notebooks doesnt exist yet...'
sleep 50
if [ -d '/datalab/notebooks' ]; then
gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
else
echo "Even after 50secs, the directory is not found, waiting for another 30secs.."
sleep 30
gsutil cp gs://${BUCKET}/notebooks/* /datalab/notebooks/
fi
fi