6
votes

I have an Azure Function (v2) that accesses Cosmos DB, but not through a binding (we need to use custom serialization settings). I've followed the example here for setting up an object that should then be available to all instances of the activity function. Mine is a little different because our custom CosmosDb object requires an await for setup.

public static class AnalyzeActivityTrigger
{
    private static readonly Lazy<Task<CosmosDb>> LazyCosmosDb = new Lazy<Task<CosmosDb>>(InitializeDocumentClient);
    private static Task<CosmosDb> CosmosDb => LazyCosmosDb.Value;

    private static Task<CosmosDb> InitializeDocumentClient()
    {
        return StorageFramework.CosmosDb.GetCosmosDb(DesignUtilities.Storage.CosmosDbContainerDefinitions, DesignUtilities.Storage.CosmosDbMigrations);
    }

    [FunctionName(nameof(AnalyzeActivityTrigger))]
    public static async Task<Guid> Run(
        [ActivityTrigger]DurableActivityContext context,
        ILogger log)
    {
        var analyzeActivityRequestString = context.GetInput<string>();
        var analyzeActivityRequest = StorageFramework.Storage.Deserialize<AnalyzeActivityRequest>(analyzeActivityRequestString);
        var componentDesign = StorageFramework.Storage.Deserialize<ComponentDesign>(analyzeActivityRequest.ComponentDesignString);

        var (analysisSet, _, _) = await AnalysisUtilities.AnalyzeComponentDesignAndUploadArtifacts(componentDesign,
            LogVariables.Off, new AnalysisLog(), Stopwatch.StartNew(), analyzeActivityRequest.CommitName, await CosmosDb);

        return analysisSet.AnalysisReport.Guid;
    }
}

We fan out, calling this activity function in parallel. Our documents are fairly large, so updating them is expensive, and that happens as part of this code.

I sometimes get this error when container.ReplaceItemAsync is called:

Response status code does not indicate success: 408 Substatus: 0 Reason: (Message: Request timed out. ...

The obvious thing to do seems to be to increase the timeout, but could this be indicative of some other problem? Increasing the timeout seems like addressing the symptom rather than the problem. We have code that scales up our RUs before all this happens, too. I'm wondering if it has to do with Azure Functions fanning out and that putting too much load on it. So I've also played around with adjusting the host.json settings for durableTask like maxConcurrentActivityFunctions and maxConcurrentOrchestratorFunctions, but to no avail so far.

How should I approach this 408 error? What steps can I consider to mitigate it other than increasing the request timeout?

Update 1: I increased the default request timeout to 5 minutes and now I'm getting 503 responses.

Update 2: Pointing to a clone published to an Azure Function on the Premium plan seems to work after multiple tests.

Update 3: We weren't testing it hard enough. The problem is exhibited on the Premium plan as well. GitHub Issue forthcoming.

Update 4: We seem to have solved this by a combination of using Gateway mode in connecting to Cosmos and increasing RUs.

1
A 408 error indicates a timeout, not exceeded throughput (which usually returns a 429), so the first thing I would check is make sure that the throughput is the problem and not something else. The default request timeout is 60 seconds.What do you have your retry policy set to?Nathan Bierema
@NathanBierema The retry policy has to do with throttling for exceeded throughput, not the timeout right? That is what the doc seems to say, and the property MaxRetryAttemptsOnThrottledRequests seems to indicate that purpose as well.Scotty H

1 Answers

1
votes

A timeout can indeed signal issues regarding instance resources. Reference: https://docs.microsoft.com/azure/cosmos-db/troubleshoot-dot-net-sdk#request-timeouts

If you are running on Functions, take a look at the Connections. Also verify CPU usage in the instances. If CPU is high, it can affect requests latency and end up getting timeouts.

For Functions, you can certainly use DI to avoid the whole Lazy declaration: https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/AzureFunctions

Create a Startup.cs file with:

using System;
using Microsoft.Azure.Cosmos;
using Microsoft.Azure.Functions.Extensions.DependencyInjection;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;

[assembly: FunctionsStartup(typeof(YourNameSpace.Startup))]

namespace YourNameSpace
{
    public class Startup : FunctionsStartup
    {
        public override void Configure(IFunctionsHostBuilder builder)
        {
            builder.Services.AddSingleton((s) => {
                CosmosClient cosmosClient = new CosmosClient("connection string");

                return cosmosClient;
            });
        }
    }
}

And then you can make your Functions not static and inject it:

public class AnalyzeActivityTrigger
{
    private readonly CosmosClient cosmosClient;
    public AnalyzeActivityTrigger(CosmosClient cosmosClient)
    {
        this.cosmosClient = cosmosClient;
    }

    [FunctionName(nameof(AnalyzeActivityTrigger))]
    public async Task<Guid> Run(
        [ActivityTrigger]DurableActivityContext context,
        ILogger log)
    {
        var analyzeActivityRequestString = context.GetInput<string>();
        var analyzeActivityRequest = StorageFramework.Storage.Deserialize<AnalyzeActivityRequest>(analyzeActivityRequestString);
        var componentDesign = StorageFramework.Storage.Deserialize<ComponentDesign>(analyzeActivityRequest.ComponentDesignString);

        var (analysisSet, _, _) = await AnalysisUtilities.AnalyzeComponentDesignAndUploadArtifacts(componentDesign,
            LogVariables.Off, new AnalysisLog(), Stopwatch.StartNew(), analyzeActivityRequest.CommitName, this.cosmosClient);

        return analysisSet.AnalysisReport.Guid;
    }
}