0
votes

I am able to successfully copy data from sql Azure to Azure cosmos DB using data factory. After first copy I want to keep it updated from the same sql source but Here is the problem:

Lets assume I have a SQL table like this:

Col1   Col2
val11  val21
val12  val22
val13   val21
val11   val23
  1. In data factory copy activity I chose SQL as source and cosmos DB as destination. I chose upsert option for copying.

  2. In the mapping section of azure data factory I setup the col1 mapped to /primarykey of my cosmosdb collection and col2 mapped to col2.

  3. Now I executed the data factory pipeline. It successfully copied it and items look like following:

   ID            PartitionKey         Document
   `SomeGuid1`      val11              { Col2: val21 }
   `SomeGuid2`      val12              { Col2: val22 }
   `SomeGuid3`      val13              { Col2: val21 }
   `SomeGuid4`      val11              { Col2: val23 }
  1. Until this point it is fine. But let's say SQL data has changed and first row has become (val11 val25(Changed)). Now I want to copy this sql table again to cosmos DB.
  2. When I run the data factory pipeline again then it copies all the rows again and duplicate the data.
   ID            PartitionKey         Document
   `SomeGuid1`      val11              { Col2: val21 }
   `SomeGuid2`      val12              { Col2: val22 }
   `SomeGuid3`      val13              { Col2: val21 }
   `SomeGuid4`      val11              { Col2: val23 }
   `SomeGuid5`      val11              { Col2: val25 }  >-- changed value
   `SomeGuid6`      val12              { Col2: val22 }
   `SomeGuid7`      val13              { Col2: val21 }
   `SomeGuid8`      val11              { Col2: val23 }

But I don't want this to be duplicated. I want this to just copy and replace the old data. So I want data after 2nd copy task it to be:

   ID            PartitionKey         Document
   `SomeGuid5`      val11              { Col2: val25 }  >-- changed value
   `SomeGuid6`      val12              { Col2: val22 }
   `SomeGuid7`      val13              { Col2: val21 }
   `SomeGuid8`      val11              { Col2: val23 }

I think it is duplicating all the rows because upsert checks the Id property but since source doesn't have Id it generates one, and in 2nd copy also it generates new Ids and adds the new documents.

The way we do it in SQL is we copy in a temp table and then swap the table name with original table after full copy. But cosmos db doesn't support renaming the container: https://feedback.azure.com/forums/263030-azure-cosmos-db/suggestions/19466575-add-ability-to-rename-databases-and-collections

Any help will be highly appreciated. We like cosmos db and we will love to use it.

1
This doesn't appear to have any relation to SQL Server, so why have you tagged it?Larnu
I am copying data from sql server which is source for this. If someone who knows sql and cosmos both may answer this. I removed the sql tag now.Gyan Prakash
"I am copying data from sql server which is source for this." You don't actually mention that in your question. if that is the case, you should and should say where that SQL Server source is (is it in Azure as well? Is it a local instance or on a host on the network?).Larnu
It is Sql Azure. I will edit the question to reflect it.Gyan Prakash

1 Answers

1
votes

https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-cosmos-db#azure-cosmos-db-sql-api-as-sink

Note: Data Factory automatically generates an ID for a document if an ID isn't specified either in the original document or by column mapping. This means that you must ensure that, for upsert to work as expected, your document has an ID.

If "Col1" is your unique key by which you identify duplicates, you should map it to the id field. You don't need to use Guids for the id field.

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping