My company is transitioning our operations to the Google Cloud and we have several instances in Google Compute Engine running. I have had 3 instances (running Ubuntu 14.04) now where I lose SSH ability after weeks of everything working fine. Here is output from multiple methods of trying to connect:
SSH from one session to another (same internal IP):
ssh: connect to host 130.211.137.231 port 22: Connection refused
SSH from Google Dev Console :
We are unable to connect to the VM on port
22. Learn more about possible causes of this issue.
SSH from PuTTY client : Network error: Connection refused
The most recent time this issue has happened, the instance is still running. I have an NFS shared directory that ftp'd files get written to, and they are still being updated. So NFS is still mounted and exported, and cronjobs are still running.
Running nmap from another instance on the same network results as following:
vwadmin@vw-server:~$ nmap -Pn 130.211.137.231
Starting Nmap 6.40 ( http://nmap.org ) at 2015-03-09 15:41 UTC
Nmap scan report for 231.137.211.130.bc.googleusercontent.com (130.211.137.231)
Host is up (0.0019s latency).
Not shown: 997 filtered ports
PORT STATE SERVICE
22/tcp closed ssh
3389/tcp closed ms-wbt-server
8008/tcp closed http
Nmap done: 1 IP address (1 host up) scanned in 4.18 seconds
vwadmin@vw-server:~$
SSH was lost sometime late Friday evening. On Saturday evening I created a snapshot of the drive for troubleshooting. Looking at the logfiles, syslog and auth.log both stopped being written to on Friday evening (I'm guessing around the time we lost SSH). Where/what should I be looking for in system logs that could stop logs from being written, close all ports, but yet allow NFS to continue working and cronjobs to run fine? Please keep in mind, this a cloud environment so SSH is my only way into the instance itself so all I can do right now is look through the logs from the snapshot. This particular instance that has broke twice is only running a handful of lftp type cronjobs currently.