2
votes

I want to delete 20-30k items in bulk. Currently I am using below method to delete these items. But its taking 1-2 mins.

private async Task DeleteAllExistingSubscriptions(string userUUId)
        {
            var subscriptions = await _repository
                .GetItemsAsync(x => x.DistributionUserIds.Contains(userUUId), o => o.PayerNumber);

            if (subscriptions.Any())
            {
                List<Task> bulkOperations = new List<Task>();
                foreach (var subscription in subscriptions)
                {
                    bulkOperations.Add(_repository
                        .DeleteItemAsync(subscription.Id.ToString(), subscription.PayerNumber).CaptureOperationResponse(subscription));
                }
                await Task.WhenAll(bulkOperations);
            }
        }

Cosmos Client:As we can see I have already set AllowBulkExecution = true

private static void RegisterCosmosClient(IServiceCollection serviceCollection, IConfiguration configuration)
        {
            string cosmosDbEndpoint = configuration["CosmoDbEndpoint"];

            Ensure.ConditionIsMet(cosmosDbEndpoint.IsNotNullOrEmpty(),
                () => new InvalidOperationException("Unable to locate configured CosmosDB endpoint"));

            var cosmosDbAuthKey = configuration["CosmoDbAuthkey"];

            Ensure.ConditionIsMet(cosmosDbAuthKey.IsNotNullOrEmpty(),
                () => new InvalidOperationException("Unable to locate configured CosmosDB auth key"));

            serviceCollection.AddSingleton(s => new CosmosClient(cosmosDbEndpoint, cosmosDbAuthKey,
                new CosmosClientOptions { AllowBulkExecution = true }));
        }

Is there any way to delete these item in a batch with CosmosDB SDK 3.0 in less time?

1
I do it exactly how you do it, but I never do it for that many items or that often to want to optimize my routines. Hopefully one of those CosmosDB masters show up. It would be interesting if there is a better way to do this.Andy

1 Answers

1
votes

Please check the metrics to understand if the volume of data you are trying to send is not getting throttled because your provisioned throughput is not enough.

Bulk just improves the client-side aspect of sending that data by optimizing how it flows from your machine to the account, but if your container is not provisioned to handle that volume of operations, then operations will get throttled and the time it takes to complete will be longer.

As with any data flow scenario, the bottlenecks are:

  • The source environment cannot process the data as fast as you want, which would show as a bottleneck/spike on the machine's CPU (processing more data would require more CPU).
  • The network's bandwidth has limitations, in some cases the network has limits on the amount of data it can transfer or even the amount of connections is can open. If the machine you are running the code has such limitations (for example, Azure VMs have SNAT, Azure App Service has TCP limits) and you are running into them, new connections might get delayed and thus increasing latency.
  • The destination has limits in the amount of operations it can process (in the form of provisioned throughput in this case).