6
votes

I'm desperate for help here. I have a compute engine instance that hosts a lot of websites. These are the steps that I took:

  1. Go to Compute Engine > Snapshots and take a snapshot of my instance

  2. Click on the newly created snapshot and click Create Instance.

  3. The new instance has all the configs of the current running instance

Then when I tried to access the new instance via SSH, it wouldn't work. Error message:

"Connection Failed We are unable to connect to the VM on port 22. Learn more about possible causes of this issue."

Clicking on Learn more gets me to https://cloud.google.com/compute/docs/ssh-in-browser#ssherror

The instance is booting up and sshd is not yet running - Not sure how to check this

The instance is not running sshd - Not sure how to check this either

sshd is listening on a port other than the one you are connecting to - My current instance is having ssh running on port 22 so I guess this is fine?

There is no firewall rule allowing SSH access on the port - Again, my current instance is having ssh running so I don't think it's because of firewall, right?

The firewall rule allowing SSH access is enabled, but is not configured to allow connections from GCP Console services. - Same as above

The instance is shut down - Instance is still running.

Strange thing is if I create a fresh instance from scratch and then do the steps above to clone to a new instance then that new instance can be accessed normally via SSH.

Can anyone show me how to fix this if possible? Or show me how to see logs, check for what went wrong etc as I tried to google but pretty confused with all the jargons or where to find a particular stuff. Sorry for the wall of text. Thanks

**

Edit #1

**: I got technical support from Google. The steps below might help someone else, but not me as when I reached step 7, I waited forever and couldn't get to the login page.

1.) Go to the VM instances page and click on the Instance name of your VM.

2.) Click the Edit button at the top of the page.

3.) Under Custom metadata, click Add item.

4.) Set 'Key' to 'startup-script' and set 'Value' to this script:

#! /bin/bash 

useradd -G sudo USERNAME 

echo 'USERNAME:PASSWORD' | chpasswd

NOTE: change the value of USERNAME and PASSWORD to the name and password of your choice.

5.) Enable "Enable connecting to serial ports" by checking the box below the SSH button.

6.) Click Save and then click RESET on the top of the page. Wait for some time for the instance to reboot.

7.) Click on 'Connect to serial port' in the page. In the new window, you might need to wait a bit and press on Enter of your keyboard once; then, you should see the login prompt.

8.) Login using the USERNAME and PASSWORD you provided.

Note: Please do not share any of your password and username for your data security.

As those steps above couldn't help me and the Google support representative looked at the log but didn't see anything wrong, she suggested to debug SSH following this guide https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh#use_your_disk_on_a_new_instance which I will do when I have time. Feel like I'm writing an essay. Will keep posted

3
Don't panic, as long as the disk is not deleted, data is not lost. Some suggestions: 1) can you try clone the broken instance? 2) create another instance, ssh to the broken instance via internal IP and check sshd.Dagang
No, I'm not panic, thanks :) . I was talking about the clone of my currently live instance, although I would be panic if I suddenly can't ssh to my live instance. To verify: 1) can you try clone the broken instance? The broken instance was cloned from live instance and I have tried to cloned a lot from live instance but the cloned instance never worked for me 2. create another instance, ssh to the broken instance via internal IP and check sshd I will try to google how to do this. As of now, I'm contacting Google support to see if they can help me to get this resolved. Fingers crossedtraudong

3 Answers

1
votes

The troubleshooting steps that you can follow are:

  1. Use the serial console to view your instance logs and check whether the new instance you created from the snapshot failed to start to the appropriate run level where the ssh daemon would get started. If sshd was not started you would not have ssh access to your instance.
  2. You can try restarting the instance if it doesn’t affect production and try to gain ssh access again. Might be that some issue prevented the instance from starting up properly and restarting it could fix it.

  3. You can try creating another VM instance from the snapshot in case the previous instance wasn’t created properly.

  4. If creating a new VM instance from the snapshot doesn’t fix the issue, it might be that the snapshot itself wasn’t created properly. You can read this documentation guide, section Understanding snapshot best practices, and try creating another snapshot and VM instances from it.
0
votes

I had the same problem and after a lot of searching, I found an answer from user Peripheral from ServerFault that worked for me.

I found the fix for me. A recent update has a known issue where it removes the default gateway from the iptables. To fix it, I have to go to the instance and select Edit. Scroll down, and under Custom Metadata put the following:

key: startup-script value: route add default gw <gatewayIP> eth0

Save and restart the VM.

Source

All credits to him/her, just want to share to help others find their solution faster.

0
votes

I had the same issue. I eventually figured that it was because I attached a persistent disk added an entry into the /etc/fstab file. This entry is supposed to automatically mount the attached disk upon restart of the instance.

However, when I created a snapshot of the boot disk, I didn't remove the /etc/fstab entry. So creating a new instance from this snapshot will always cause a boot error as the script tries to mount a disk that is not attached. This information is present in the documentation