GCP - Cannot SSH into fresh GPU Deep Learning VM instance

Question

If I create a fresh GCE VM instance with a GPU and a GPU-optimized Debian image, I cannot SSH into it, neither via the browser SSH window or using a third party SSH client (after uploading public key).

I have tried the suggestion here but it did not help.

If I create the instance without a GPU and with a standard Ubuntu image, everything works fine out of the box.

Is there something I am missing about GPU Deep Learning instances?

Edit:

GCloud command to recreate:

gcloud beta compute --project=avid-compound-233309 instances create instance-1 --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=105060870131-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-k80,count=1 --image=c0-common-gce-gpu-image-20191213 --image-project=ml-images --boot-disk-size=50GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --reservation-affinity=any

And yes it happens right after creation of VM and there is a big log of errors in the Serial Port 1 Log, short example:

[    9.393769] google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
[    9.394022] google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
[    9.394250] google_accounts_daemon[692]: Remainder of file ignored
[    9.394504] google_accounts_daemon[692]: Traceback (most recent call last):
[    9.394767] google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
[    9.395108] google_accounts_daemon[692]:     from pkg_resources import load_entry_point
[    9.395344] google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
[    9.395502] google_accounts_daemon[692]:     from pkg_resources.extern import six
[    9.395719] google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "/usr/lib/python3.5/site.py", line 173, in addpackage
Dec 23 19:40:05 localhost google_accounts_daemon[692]:       exec(line)
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<string>", line 1, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Remainder of file ignored
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Traceback (most recent call last):
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources import load_entry_point
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources.extern import six
Dec 23 19:40:05 localhost google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'

Howdy ... first thing I'd do is see if the VM started correctly by examining the Serial Console. If the VM failed to start, then there is nothing to login to ... see ... cloud.google.com/compute/docs/instances/… — Kolban
You also said that when you create a fresh VM this happens straight away? Can you post a gcloud command that creates such an instance and we can try and recreate and see what happens? — Kolban
I updated question. It seems like it is a bug on the GCloud side. In meanwhile, I created VM with standard Ubuntu 18.04 and installed nvidia + cuda drivers manually. — Peter Jung
Thanks for reporting this! We're working on the fix (GCP Deep Learning VM/Notebooks team) — Zain Rizvi
@PeterJung The engineering team has deployed a fix; the image is usable and you can SSH into it now. — Maxim

mebius99 mebius99 · Accepted Answer · 2019-12-24T09:23:43

It seems the freshly published image "GPU Optimized Debian m32 (with CUDA 10.0) (c0-common-gce-gpu-image-20191213)" contains damaged EXT filesystem. Directories, configuration and script files contain garbage. Hence initial configuration at first boot fails.

Started Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ 4.880071] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 144, inode_bitmap = 4718608
[ 4.883559] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 145, inode_bitmap = 4718609
[ 4.887054] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 146, inode_bitmap = 4718610
...
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ecdsa_key.pub is not a public key file.
localhost dhclient[516]: 
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ed25519_key.pub is not a public key file.
localhost ssh-generate-hostk[ [0;32m  OK   [0m] Started Getty on tty1.
...
keys[485]: /etc/ssh/ssh_host_rsa_key.pub is not a public key file.

There is a recently created public issue at the Public Issue Tracker: https://issuetracker.google.com/146807209

It should be fixed soon.

GCP - Cannot SSH into fresh GPU Deep Learning VM instance

1 Answers