Note: I asked a very similar question to this previously, but was not clear enough on exactly what I was looking for, and marked an answer too aggressively. I am looking for a confirmed yes/no on a specific point.
I want to build an automated job that performs offline processing on DocumentDb documents by querying the DocumentDb on a schedule, looking for documents that have changed since the last time the check was performed.
Given the metadata available in DocumentDb, it looks like the way to do this would be the following:
- The first time the process runs, retrieve all documents.
- Store the largest _ts value from the result set as highWatermark, along with the IDs and eTags of the documents that have that particular value as their _ts value.
- For each subsequent query, include a "WHERE _ts >= highWatermark" clause. Filter out the previously-recorded documents whose eTags have not changed. The result be the set of all changes since the last time the query ran.
My question is is this guaranteed to work? Is it guaranteed that this will not miss any documents? As far as I can tell, it comes down to the transactional semantics around _ts within DocumentDb's implementation, which is not documented to this level of detail. I want to know if it's guaranteed that no document can be updated with a _ts value that is lower than the largest _ts value returned during a query that returns the most-recently changed document in the collection.
EDIT, prompted by David's comment:
To be a little more precise, with a couple of specific scenarios:
- If updates for two documents, D0 and D1, are applied to the database at T0 and T1 (where T1 > T0, such that an arbitrary query may return D0 but not D1), is it possible that D0._ts > D1._ts? The use of strictly-greater-than is intentional, as my proposed implementation deals with multiple updates receiving the same _ts but only some of them being retrieved by a query.
- Assume I execute my implementation's query at time T0, and the query takes a long time to run, and/or requires a couple of ExecuteNextAsync() calls to pull multiple batches from the server. During that period, 2 different documents (D1 and D2) are updated, getting _ts values of T1 and T2 (where T1 < T2). Is it possible for D2 to appear in the result set? More importantly, if it does, is D1 guaranteed to be included?