What are we doing ?
We have our raw data in cloud datastore which we are processing and after cleaning and extracting putting it in GBQ for analysis.
So as we have very big data we are cleaning it in batches and store a cursor string to start next batch from right where we have left in the previous batch.
Code snippet
#read cursor string and create cursor object
start_cursor = Cursor(urlsafe=tag_generated_till_cursor_string)
entities_list, next_cursor, more = ndbEntity.query().order(ndbEntity.updated_date)\
.fetch_page(500, start_cursor=start_cursor)
if next_cursor:
# persisting next_cursor.urlsafe()
Looks good so far?
Now the Issue ?
We are having trouble with handling reaching end of entities list that is once we are done with processing all entities of this kind.
As after reaching at the end, we would be getting next_cursor as None so there are two things which can be done:-
- Persisting None
- Ignore persisting if
next_cursoris None
The issue with the first option would be in next batch processing would start from the beginning and we would be ending up re-processing all raw data.
Issue with the second approach would be we would be processing last page's entities multiple times as in last batch we haven't updated cursor string.
Both of would not work with us also there is no efficient way of checking if already processed in GBQ before re-inserting it.
Also there isn't much documentation about cursors which can help us in ignoring re-processing any entities. What can help us in overcoming this issue is there something which can save us from this?