Azure SQL failover group, what does the grace period mean?

Question

I am currently reading this: https://docs.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group, and I have a hard time understanding the automatic failover policy:

By default, a failover group is configured with an automatic failover policy. The SQL Database service triggers failover after the failure is detected and the grace period has expired. The system must verify that the outage cannot be mitigated by the built-in high availability infrastructure of the SQL Database service due to the scale of the impact. If you want to control the failover workflow from the application, you can turn off automatic failover.

When defining the failover group in an ARM template:

{
  "condition": "[equals(parameters('redundancyId'), 'pri')]",
  "type": "Microsoft.Sql/servers",
  "kind": "v12.0",
  "name": "[variables('sqlServerPrimaryName')]",
  "apiVersion": "2014-04-01-preview",
  "location": "[parameters('location')]",
  "properties": {
    "administratorLogin": "[parameters('sqlServerPrimaryAdminUsername')]",
    "administratorLoginPassword": "[parameters('sqlServerPrimaryAdminPassword')]",
    "version": "12.0"
  },
  "resources": [
    {
      "condition": "[equals(parameters('redundancyId'), 'pri')]",
      "apiVersion": "2015-05-01-preview",
      "type": "failoverGroups",
      "name": "[variables('sqlFailoverGroupName')]",
      "properties": {
        "serverName": "[variables('sqlServerPrimaryName')]",
        "partnerServers": [
          {
            "id": "[resourceId('Microsoft.Sql/servers/', variables('sqlServerSecondaryName'))]"
          }
        ],
        "readWriteEndpoint": {
          "failoverPolicy": "Automatic",
          "failoverWithDataLossGracePeriodMinutes": 60
        },
        "readOnlyEndpoint": {
          "failoverPolicy": "Disabled"
        },
        "databases": [
          "[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]"
        ]
      },
      "dependsOn": [
        "[variables('sqlServerPrimaryName')]",
        "[resourceId('Microsoft.Sql/servers/databases', variables('sqlServerPrimaryName'), variables('sqlDatabaseName'))]",
        "[resourceId('Microsoft.Sql/servers', variables('sqlServerSecondaryName'))]"
      ]
    },
    {
      "condition": "[equals(parameters('redundancyId'), 'pri')]",
      "name": "[variables('sqlDatabaseName')]",
      "type": "databases",
      "apiVersion": "2014-04-01-preview",
      "location": "[parameters('location')]",
      "dependsOn": [
        "[variables('sqlServerPrimaryName')]"
      ],
      "properties": {
        "edition": "[variables('sqlDatabaseEdition')]",
        "requestedServiceObjectiveName": "[variables('sqlDatabaseServiceObjective')]"
      }
    }
  ]
},
{
  "condition": "[equals(parameters('redundancyId'), 'pri')]",
  "type": "Microsoft.Sql/servers",
  "kind": "v12.0",
  "name": "[variables('sqlServerSecondaryName')]",
  "apiVersion": "2014-04-01-preview",
  "location": "[variables('sqlServerSecondaryRegion')]",
  "properties": {
    "administratorLogin": "[parameters('sqlServerSecondaryAdminUsername')]",
    "administratorLoginPassword": "[parameters('sqlServerSecondaryAdminPassword')]",
    "version": "12.0"
  }
}

I specify the readWriteEndpoint like this:

    "readWriteEndpoint": {
      "failoverPolicy": "Automatic",
      "failoverWithDataLossGracePeriodMinutes": 60
    }

With a failoverWithDataLossGracePeriodMinutes set to 60 minutes.

What does this mean? I cannot find a clear answer anywhere. Does it mean that:

When an outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. In the 60 minutes, the only way to read my data is to use the readOnlyEndpoint directly? OR
My read/write endpoint is turned instantly, if they somehow can detect that there was no data to be synced

I think it boils down to: do I have to manually make the failover, if I detect an outage, if I don't care about data loss, but I want to be able to write to my database?

Bonus question: is the reason why the grace period is present because there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary (if i switch manually)?

Sorry, I can't keep it to only one question. I have read a lot and I really need to know this.

David Browne - Microsoft David Browne - Microsoft · Accepted Answer · 2019-06-15T19:25:31

What does this mean?

It means that:

"when a outage is happening in my primary region where my primary database resides, the read/write endpoint points to the primary and only after 60 minutes it fails over to my secondary, which becomes the new primary. "

It can't failover automatically even when the data is synced because the high-availability solution in the primary region is trying to do the same thing, and almost all of the time your primary database will come back quickly in the primary region. Performing an automatic cross-region fail-over would interfere with this.

And

"the reason why the grace period is present, is that because the there can be unsynced data on the primary, that will be overwritten, or tossed away, if the secondary becomes the new primary"

And to allow time for the database to failover within the primary region.

Azure SQL failover group, what does the grace period mean?

1 Answers