3
votes

What happens when one of the multiple local SSDs attached to a compute engine instance has a hardware failure? Specifically:

  1. Is the failure automatically detected by Google Cloud Platform?
  2. Is there a notification, such as by email?
  3. How long does it take for the drive to be replaced?
  4. Is the VM stopped, and restarted after the replacement? Or, is it a hot-swap?
  5. Obviously, the data on that SSD is lost, however, what happens to the data on other SSDs attached to the same Virtual Machine?

Edit: I am aware of the "ephemeral" nature of local SSDs, and plan to replicate my data on multiple machines across different zones in my primary region, and at least one replication to a completely different region. The database I am planning to use is "data-center/rack aware". I am particularly looking for documentation/information about how Google Cloud Platform handles hardware failures in local SSDs.

3
Does this answer your question? Google Cloud - Local SSD hadware failure?Martin Zeitler
@MartinZeitler Not really. I am aware of the "ephemeral" nature of the local SSDs. I will be having data replication across multiple zones, and possibly even across multiple regions. I am looking for more information about what happens when a local SSD fails. I couldn't find anything on GCP documentation.user2101712
If the local SSD fails, the instance fails. A new instance will be launched with blank SSDs. All data stored on all SSDs will be lost. You will need to set up Stackdriver monitoring and alerting to be notified. The drive is not replaced on the same VM instance.John Hanley
@JohnHanley Thanks. Is this behavior documented anywhere by Google? While I do not doubt your knowledge on this topic, a link to an official document will be much appreciated!user2101712
There is no document that I am aware of. I am a Google GDE and my comment comes from personal experience and knowledge. I would have posted an answer if I had an authoritative reference to link to.John Hanley

3 Answers

2
votes

You might want to use persistent disks instead, because your use-case might not apply:

As adding local SSDs reads:

Local SSDs are suitable only for temporary storage such as caches, processing space, or low value data. If you store important data in a local SSD device, you must also store that same data in a durable storage option.

1
votes
  1. Yes
  2. It depends - block-level failures are just that and are directly passed-through to the guest. So you might see read errors in your dmesg or similar. If an entire device fails, you get a hostError in your Cloud Logging logs for the instance. What happens next depends on your maintenance policy.
  3. Drives are not replaced from the user's point of view - you can only get a new instance. (Of course Google internally replaces broken hardware but this is not exposed to the customer)

Points 4. and 5. are a bit tricky to answer - when an automatic restart for a hostError happens, you have a 60 minutes recovery timeout. This can however mean in practice that your instance is spending 60 minutes in a RUNNING but not booted state while trying to get a broken Local SSD back to then eventually fail and boot up with blank Local SSDs.

Overall, I would recommend you treat an instance as the failure domain and not individual disks as any sort of issue is likely to lead to a hostError of the instance instead of partial failure.

0
votes

I'd like to clarify #5.

If your VM experiences host error google documentation states:

If the host system experiences a host error, Compute Engine makes a best effort to reconnect to the VM and preserve the local SSD data, but might not succeed. If the attempt is successful, the VM restarts automatically. However, if the attempt to reconnect fails, the VM restarts without the data.

Which means that you aren't guaranteed to get your data back. Which isn't fun plan accordingly and store your data in more reliable solutions such as persistent disks or buckets.