6
votes

One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under-the-hood, is that Dataflow actually writes the results (sharded) in JSON format to GCS, and then kicks off a BigQuery load job to import that data.

However, we've noticed that some JSON files are not deleted after the job regardless of whether it succeeds or fails. There is no warning or suggestion in the error message that the files will not be deleted. When we noticed this, we had a look at our bucket and it had hundreds of large JSON files from failed jobs (mostly during development).

I would have thought that Dataflow should handle any cleanup, even if the job fails, and when it succeeds those files should definitely be deleted Leaving these files around after the job has finished incurs significant storage costs!

Is this a bug?

Example job id of a job that "succeeded" but left hundreds of large files in GCS: 2015-05-27_18_21_21-8377993823053896089

enter image description here

enter image description here

enter image description here

3
Is not by design. Thank you for pointing this issue out to us. We are looking in to it.Stephen Gildea
is there any update on this bug @Stephen Gildea?Graham Polley
This issue has been fixed. Does it clean up for you now?Stephen Gildea
I'll test and let you know youGraham Polley
It still does not delete them, so it's not fixed yet.Graham Polley

3 Answers

5
votes

Because this is still happening we decided that we'll clean up ourselves after the pipeline has finished executing. We run the following command to delete everything that is not a JAR or ZIP:

gsutil ls -p <project_id> gs://<bucket> | grep -v '[zip|jar]$' | xargs -n 1 gsutil -m rm -r
5
votes

Another possible cause of left over files is cancelled jobs. Currently dataflow does not delete files from cancelled jobs. In other cases files should be cleaned up.

Also the error listed on the first post "Unable to delete temporary files" is the result of a logging issue on our side, and should be resolved within a week or two. Until then, feel free to ignore these errors as they do not indicate left over files.

2
votes

This was a bug where the Dataflow service would sometimes fail to delete the temporary JSON files after a BigQuery import job completes. We have fixed the issue internally and rolled out a release with the fix.