4
votes

I'd like to setup google cloud datalab to read my notebooks from a github repo and push them back there as I make changes.

I was able to setup a cloud repo that mirrored my github project but when using ungit in datalab to push changes, it said that connected cloud repos are read-only.

How have others set this up?

5

5 Answers

5
votes

This works even with two-factor auth and doesn't involve fighting with a web user interface.

  • ssh into the Datalab VM. You can do this from the Google web console by going to Compute Engine | Instances and clicking on the ssh button
  • Your notebooks are on the persistent disks on this VM (/mnt/disks/datalab-pd/content/datalab/notebooks), so cd to that directory and git clone the repository to this place.
  • (optional) Set up password-less git following the steps in https://help.github.com/articles/connecting-to-github-with-ssh/
  • Work with your notebooks in Datalab (/mnt/disks/datalab-pd/content is mapped to the home directory in Datalab)
  • To commit, go back to the ssh window and use git from the command-line from there.
5
votes

I first setup a datalab instance with the datalab start <INSTANCE_NAME_HERE> command and the --no-create-repository flag so that no Cloud Source repo is setup. Then I followed these steps to clone a git repo:

  1. Connect to the instance.
  2. Click the ungit icon on the top right.
  3. Using the "address bar" within ungit, navigate to: /content/
  4. Now just put the git url you want to clone in the 'clone from' section.
  5. You will be asked to authenticate (I have lastpass so the credentials are saved), but otherwise I'm afraid you have to enter your credentials everytime you push or pull in the future.
3
votes

Connected Cloud Source Repositories sync only one way, from Github/BitBucket into the Cloud Source repo, notice the comment at the top here.

Datalab automatically integrates with a Cloud Source repo that is not a mirror, so you can pull and push to that separately. If you need to work with a Github repo, you'll need to set up your credentials on the VM hosting the Datalab instance inside the Datalab container. Be sure you're the only one who has access to that cloud project though, as VMs are accessible to all project readers.

1
votes

In github:

  1. Setup ssh deploy (public) key (generate it with ssh-keygen)

In gcp:

  1. Go to compute instances, click on datalab virtual machine
  2. Click edit
  3. Scroll down to the "Custom Metadata" section. Click-drag the settings associated with the 'user-data' scripts component. There is a definition of systemd datalab service - modify it and add /root dir mount definition:

    -v /mnt/disks/datalab-pd/root:/root
    
  4. Save the modification.

  5. ssh into cloud datalab vm instance
  6. create directory

    mkdir -p /mnt/disks/datalab-pd/root/.ssh
    
  7. in the .ssh directory put previously generated private ssh key and git config file (~/.ssh/config)

  8. config file should look like this:

    host github.com
     HostName github.com
     IdentityFile ~/.ssh/id_rsa
     User git
    

After rebooting datalab instance you should be able to push and pull to git repo

1
votes

I tried various answers above and could not get any to work. However, this method did work.

  1. This was tested using a github repo. I had to turn off two-factor authentication to get this to work. You can switch that off and on if its a problem.

  2. Connect GCP source repo to your github account. Click new repo. Select connect to external repo. Follow the wizard.

  3. Create a datalab instance. Use the create datalab your-dl-instance method.

  4. In the repo browser on cloud console, display the method for showing how to clone the repo. It should looke something like this: gcloud source repos clone github_yourusername_someproj --project=someproj To be honest, I am not sure if this is valuable or not but this process will mirror your github repo in GCP. Later when I push changes to remotes, I magically see this repo updated when the github repo is updated. Perhaps

  5. Connect to datalab.

  6. Create a notebook, name it. I called mine add_repo. Create a cell that uses bash magic. Have it clone the repo. Use https to clone the github repo. If you use the gcloud command shown above it will print a message about this being a mirror. It will tell you to do the following. Do this instead

    %%bash

    git clone https://github.com/youruserid/yourgithubrepo

  7. Now, you have a git repo inside your large datalab notebook git repo.

  8. Open ungit inside datalab notebook using the top right menubar icon.

  9. You will see a new directory which you can not push changes to at this point.

  10. In ungit, click the submodules pull down and select add submodule.

  11. For path enter yourgithubrepo which corresponds to the directory you see in your datalab notebook after you did the clone. It should not be prefixed with anything like /content/datalab/notebooks or such. Just use the repo name which matches the new directory.

  12. For URL, enter the clone url used above after the git clone command.

  13. When you click ok you will see .gitmodules is updated in ungit. Commit these changes to your /content/datalab/notebooks repo.

  14. Create a test notebook in the sub dir for your github repo. Save it. Ungit should show a change in the dir corresponding to the module but not the notebook itself. ie. ungit shows yourgithubrepo as modified. If you hover your mouse over the change, the tooltip will say that the subproject is dirty.

  15. Click the pulldown for submodules and select yourgithubrepo. Now ungit changes to /content/datalab/notebooks/yourgithubrepo. It will also show a yellow banner saying "This is a submodule". Commit and push your change. It will prompt you for a login and password to github. Assuming you have disabled two factor authentication this push will show up in github and google cloud. If you have two factor authentication enabled, ungit not be able to handle it. You will get an error message saying authentication failed.