0
votes

Note: I asked a very similar question to this previously, but was not clear enough on exactly what I was looking for, and marked an answer too aggressively. I am looking for a confirmed yes/no on a specific point.

I want to build an automated job that performs offline processing on DocumentDb documents by querying the DocumentDb on a schedule, looking for documents that have changed since the last time the check was performed.

Given the metadata available in DocumentDb, it looks like the way to do this would be the following:

  • The first time the process runs, retrieve all documents.
  • Store the largest _ts value from the result set as highWatermark, along with the IDs and eTags of the documents that have that particular value as their _ts value.
  • For each subsequent query, include a "WHERE _ts >= highWatermark" clause. Filter out the previously-recorded documents whose eTags have not changed. The result be the set of all changes since the last time the query ran.

My question is is this guaranteed to work? Is it guaranteed that this will not miss any documents? As far as I can tell, it comes down to the transactional semantics around _ts within DocumentDb's implementation, which is not documented to this level of detail. I want to know if it's guaranteed that no document can be updated with a _ts value that is lower than the largest _ts value returned during a query that returns the most-recently changed document in the collection.

EDIT, prompted by David's comment:

To be a little more precise, with a couple of specific scenarios:

  1. If updates for two documents, D0 and D1, are applied to the database at T0 and T1 (where T1 > T0, such that an arbitrary query may return D0 but not D1), is it possible that D0._ts > D1._ts? The use of strictly-greater-than is intentional, as my proposed implementation deals with multiple updates receiving the same _ts but only some of them being retrieved by a query.
  2. Assume I execute my implementation's query at time T0, and the query takes a long time to run, and/or requires a couple of ExecuteNextAsync() calls to pull multiple batches from the server. During that period, 2 different documents (D1 and D2) are updated, getting _ts values of T1 and T2 (where T1 < T2). Is it possible for D2 to appear in the result set? More importantly, if it does, is D1 guaranteed to be included?
1
Let's say you have a very large set of documents to go through. You set the highwater mark for _ts, but during processing, one of the previously-processed documents gets updated by another process, thereby having a newer timestamp than your highwater mark. Wouldn't this be an edge case where you'd miss document updates in a future processing pass? - David Makogon
@DavidMakogon, I have added some precision to my question. Your scenario represents part of what I'm trying to figure out - if a document D gets updated during the process of query results being returned, is it possible for D._ts to be strictly less than the largest _ts in the result set? You get these kinds of guarantees with rowversion in SQL Server because rowversion values are guaranteed to monotonically increase when committed, but since _ts is based on a wall-clock timestamp I don't know what it's guarantees are in these kinds of scenarios. - nlawalker

1 Answers

1
votes

With default consistency this is not guaranteed to work because a document with a lower _ts can show up later. However, if you can guarantee that your update requests were far enough apart (say 60 seconds) then the risk is very low.

I don't think David's edge case is a worry so long as your treat every document with a higher _ts as new.

You might also want to consider an append-only approach using something like Richard Snodgrass' temporal model. That makes the idempotency semantics easier.