One of our Dataflow jobs writes its output to BigQuery. My understanding of how this is implemented under-the-hood, is that Dataflow actually writes the results (sharded) in JSON format to GCS, and then kicks off a BigQuery load job to import that data.
However, we've noticed that some JSON files are not deleted after the job regardless of whether it succeeds or fails. There is no warning or suggestion in the error message that the files will not be deleted. When we noticed this, we had a look at our bucket and it had hundreds of large JSON files from failed jobs (mostly during development).
I would have thought that Dataflow should handle any cleanup, even if the job fails, and when it succeeds those files should definitely be deleted Leaving these files around after the job has finished incurs significant storage costs!
Is this a bug?
Example job id of a job that "succeeded" but left hundreds of large files in GCS: 2015-05-27_18_21_21-8377993823053896089