0
votes

Ive been getting some odd issues with Cosmos DB as part of a data migration. The migration consisted of deleting and recreating our production collection and then using the Azure Cosmos DB migration tool to copy documents from our Development collection. I wanted a full purge of the data already in the production collection rather than copying the new documents on top, so to achieve this I did the following process…

  1. Deleted the production collection, named “Production_Products
  2. Recreated the Production collection with the same name and partition key
  3. Using the Azure Cosmos DB Data Migration Tool, I copied the documents from our development collection into the newly created and empty production collection “Production_Products
  4. Once the migration was complete we tested the website and we kept getting the following error…

    Microsoft.Azure.Documents.NotFoundException: at Microsoft.Azure.Documents.AddressResolver.EnsureRoutingMapPresent

This was very confusing as we could query the data from Azure no problem. After multiple application restarts and checking the config we created a new collection “Production_Products_Test” and repeated the migration steps.

This worked fine. Later in the day we reverted our changes by recreating a new collection with the original name “Production_Products” and that failed. We had to revert back to using the “_Test” collection.

Can anyone offer any insight into why this is happening?

1
Was the website running while the collection was deleted/recreated and was the website holding a singleton instance of the Cosmosclient? Which version of the SDK and language are you using? - Matias Quaranta
@MatiasQuaranta The website is a microserviecs app orchestrated using Kubernetes. There were at least 2 C# .net core services that had a singleton instance against that Cosmos DB collection when it was deleted and recreated with the same name. The SDK is Microsoft.Azure.DocumentDB.Core 2.13. This database is replicated across 2 regions. There were 3 instances running per region, each holding a client to the collection. We restarted the services in one region (West Europe) but not North Europe. Possible that the services in North Europe are holding on to the old version of the collection? - JGilmartin

1 Answers

2
votes

Based on the comments.

The DocumentClient maintains address caches, if you delete and recreate the collection externally (not through the DocumentClient or at least, not through that particular DocumentClient instance since you describe there are many services), the issue that might arise is that the address cache that that instance has is invalid. Newer versions of the SDK contain fixes that would react and refresh the cache (see the Change log here https://docs.microsoft.com/azure/cosmos-db/sql-api-sdk-dotnet).

The SDK 2.1.3 is rather old (more than 2 years) and the recommendation would be to update it (2.10.3 is the latest at this point).

The reason for the invalidation of those caches is that when you delete and recreate, the new collection has a different ResourceId.

Having said that, there is a scenario that won't be easily fixed, and that is if when you delete and recreate a collection, your code is using ResourceIds (for example, using the SelfLinks) instead of the names/ids to do operations. In those cases, if you are caching or holding a reference to the ResourceId of the previous collection, those requests will fail. Instead, you would need to use the names/ids through UriFactory.

Normally in these cases knowing the full stack trace of the exception (not just the name of the type) helps understand what is going on exactly.