Managing Long Running Tasks in Google App Engine

Question

I am running an experimental application on the Google App Engine (Python) which consists of an API and a Worker which executes some long running tasks.

The API is a standard module with cloud endpoints and the Worker is a ManagedVM module.

The user creates a Task on the database which it puts as a task on a pull-queue. The Worker pulls that task and executes it infinitely by extending the lease. The Worker will also set the status of the task on the database (e.g. "running" or "error").

sequence diagram of gae interaction

The goal is to execute the Task forever until the user stops it again. I implemented this by setting the status to "stop" and call a handler on the Worker to stop the task (workerUrl is saved to the task model). The Worker will then set the status to "stopped" and delete the task it leased.

This works so far with some minor issues (e.g. put the task twice). But i wonder if there is a better approach to handle this challenge? Basically the question is how to synchronize remote worker execution with a database model.

Also i do have the problem, that if i stop the application the task should be put back on the queue depending on their status (could be stopped intentionally). On the live server this is not a big deal since the app basically never stops, but on the dev-server it is. Maybe this is suited for a cron job?

Update 1:

The information about the task is actually stored in the database. The task on the queue contains the id of the the model on the database, note that in the picture the status gets updated on the api.

The communication between the Worker (MVM) and the API happens via Cloud Endpoints for which i use the python client on the Worker side. This works quite reliable.

It's true, for avoiding duplicated task i should use task names. The only problem i encountered was the fact that you cannot reuse the same name for quite a while so i need to figure out some task names, maybe including the time-stamp.

The reason i chose the task queue over some database only approach was the fail over mechanism that comes with it. As my workers are likely to fail another worker will pickup the task. Plus it reduces database calls and api calls.

I think i will go for refactoring the "task status" in its own model to have a kind of command model. This will also serve as a log for the Task, so one can see what happened on a execution level. Maybe this works along with the task queue, if not i will see if i can remove the queue.

Dan Sanderson Dan Sanderson · Accepted Answer · 2015-07-06T22:19:13

It sounds like you just want the Worker and the API frontend to share storage, and your solution is to use the task queue as that storage. Better would be to use the Cloud Datastore's REST API from the Worker. You can still use the pull queue to initiate work (and avoid leasing the same task to multiple workers), but then use a corresponding datastore record to communicate updates between the API frontend and the Worker. For example, the API frontend stores the "stop" request, then then Worker polls for this as it runs.

Note that the API frontend on App Engine can commit to both the datastore and the task queue in the same transaction. This allows for more robust set-up of shared records and tasks.

As for avoiding duplicate tasks, consider task names. You can assign names to tasks to prevent two tasks being created with the same name.

I'm not sure how to best set up the GAE API frontend and managed VM worker backend during development if they're sharing a datastore. Maybe they both need to use the Datastore REST API and a dev-specific project, and the API frontend needs an "if a dev server use REST else use the native datastore API" wrapper. (If there's a better way, I'd enjoy knowing it. I'm still new to managed VMs.) To your question, I'll add that the dev server will persist its own local simulated datastore between runs unless told not to, so that may help persist worker status between runs.

I assume your technique of leaving tasks on the task queue and extending leases indefinitely is meant to allow interrupted work to be picked up by a new worker while still avoiding multiple workers from doing the same work. With infinite tasks, you might be just as well off building a similar mechanism based on the datastore and using a keep-alive process to restart work as needed. I'm hard pressed to argue this is much better, but it might simplify the dev environment issues.

Managing Long Running Tasks in Google App Engine

1 Answers