What happens when a file on gcs is changed while being read by a dataproc job?

Question

When using GCS as hdfs, can a dataproc job continue to run (as it does with native hdfs), when the GCS file it's reading is updated/temporarily deleted?

I can run some tests too, but just wondering if anybody knew offhand.

Dennis Huo Dennis Huo · Accepted Answer · 2016-09-13T18:53:54

The behavior will be pretty similar to HDFS in that lots of times, in-flight reads will be fine and you'll be lucky enough that they'll run to completion correctly. This is because most of the time the GoogleCloudStorageReadChannel used by Dataproc will have a single unbounded-range stream for the entire duration of the read, which survives at least temporarily after the file's metadata is deleted.

However, there are no guarantees this stream will run to completion if the file has technically been deleted, and even if single streams are expected to run to completion, temporary errors may cause the channel to attempt explicit low-level retries which would be guaranteed to fail if the file has been deleted.

As for updates, if you do something like rewrite a single file with strictly more data (for example, simulating appends by rewriting an entire file without changing the contents of the first parts of the file), the behavior should be correct, reading up to the size of the file at the time the channel first opened, since Hadoop's split calculations should cause workers to only read up to that limit even if the file is replaced with a larger one in the middle of the job.

What happens when a file on gcs is changed while being read by a dataproc job?

1 Answers