1
votes

I have tried several times, on several subscriptions using a couple of different accounts and I keep running into the same exact issue when attempting to deploy a new service fabric cluster through the Azure portal. I tried this with both secure and unsecure clusters (to ensure that my certificate setup was not to blame) as well as with 5 node clusters as well as single node test clusters. In all cases the error was exactly the same.

At step 4, in all cases, the portal indicates that the portal generated ARM template is valid, and allows me to start the deployment process. After about 10 minutes I get the dreaded Deployment Failed icon on my dashboard for the 20th time!

enter image description here

Clicking on the icon takes me to the error logs which indicate that there was an issue with "Write Deployments"

enter image description here

enter image description here

I also see that all the necessary resource types have been generated (Storage Accounts, VM ScaleSets, Etc..)

However looking at the VM Scale Set I see another (more descriptive) issue stating that there was a provisioning error with the code "ProvisioningState/failed/InternalDiskManagementError" and that an internal disk management error occurred.

enter image description here

I am at a complete loss. I am not doing anything custom, this is all on the Azure Portal and as I mentioned I tried both simple test clusters without security or logging as well as 5 node clusters with security and logging enabled. In all case I get the same exact error. This is on 3 different Azure accounts.

The only other thing that I might try is different regions (I've only been targeting West US 2) and maybe some variants on the VM size (been targeting A0 for cost).

Has anyone else run into similar issues? I've been able to deploy clusters before (a few months back) but ever since then I keep getting stopped by this bug!

UPDATE 1

I attempted a deployment in West US 2 using the A1_V2 VM Size and I again got the Write Deployment failure, but this time on the VM Scale Set I have a different error:

ProvisioningState/failed/VMExtensionHandlerNonTransientError

Handler 'Microsoft.Azure.Diagnostics.IaaSDiagnostics' has reported failure for VM Extension 'VMDiagnosticsVmExt_vmNodeType0Name' with terminal error code '1007' and error message: 'Install failed for plugin (name: Microsoft.Azure.Diagnostics.IaaSDiagnostics, version 1.10.0.0) with exception Command C:\Packages\Plugins\Microsoft.Azure.Diagnostics.IaaSDiagnostics\1.10.0.0\DiagnosticsInstall.cmd of Microsoft.Azure.Diagnostics.IaaSDiagnostics has not exited on time! Killing it...'

UPDATE 2

I made a deployment in Central US using a D sized VM and was able to deploy just fine. At this point it seems that either the Region or the VM Size is what is causing issues. Going to make a few more deployments using various VM sizes and regions and will continue updating here with my findings...

UPDATE 3

Was able to create a single node Standard_D1_v2 cluster in West US 2.

UPDATE 4

Was able to create a 3 node Standard_A2_v2 cluster in West US 2.

Region is not the issue.....

UPDATE 5

A second attempt at deploying A1_V2 VM in West US 2 resulted in the same error as the last time this VM size was used:

ProvisioningState/failed/VMExtensionHandlerNonTransientError

FINAL UPDATE

The issue seems to be that the VM's I was using are under-powered.

I hope that Microsoft updates their portal so the next developer does not run into the same issues as me. Right now the portal makes you think that your setup is valid (even passes the validation in step 4) and then fails without any clarity. I opened a support ticket and even the Azure techs are giving me the run around and having me check my Resource Provider settings! They have no clue that I'm using insufficient VM sizes!

I also think it's way too expensive for developers to have to pay so much just to get some test nodes up on the cloud. And I'm still perplexed that I was able to get a 5 node A0 cluster up an running, but no longer can! Maybe there was a Service Fabric software update since then?

1

1 Answers

3
votes
  • The recommended VM SKU is Standard D3 or Standard D3_V2 or equivalent with a minimum of 14 GB of local SSD.
  • The minimum supported use VM SKU is Standard D1 or Standard D1_V2 or equivalent with a minimum of 14 GB of local SSD.
  • Partial core VM SKUs like Standard A0 are not supported for production workloads.
  • Standard A1 SKU is not supported for production workloads for performance reasons.

Source

These errors are usually caused by using unsupported VM sizes. As a workaround for test clusters, you can first deploy using something like D3_V2 and after successful deployment, scale down.