When using GCS as hdfs, can a dataproc job continue to run (as it does with native hdfs), when the GCS file it's reading is updated/temporarily deleted?
I can run some tests too, but just wondering if anybody knew offhand.
When using GCS as hdfs, can a dataproc job continue to run (as it does with native hdfs), when the GCS file it's reading is updated/temporarily deleted?
I can run some tests too, but just wondering if anybody knew offhand.
The behavior will be pretty similar to HDFS in that lots of times, in-flight reads will be fine and you'll be lucky enough that they'll run to completion correctly. This is because most of the time the GoogleCloudStorageReadChannel used by Dataproc will have a single unbounded-range stream for the entire duration of the read, which survives at least temporarily after the file's metadata is deleted.
However, there are no guarantees this stream will run to completion if the file has technically been deleted, and even if single streams are expected to run to completion, temporary errors may cause the channel to attempt explicit low-level retries which would be guaranteed to fail if the file has been deleted.
As for updates, if you do something like rewrite a single file with strictly more data (for example, simulating appends by rewriting an entire file without changing the contents of the first parts of the file), the behavior should be correct, reading up to the size of the file at the time the channel first opened, since Hadoop's split calculations should cause workers to only read up to that limit even if the file is replaced with a larger one in the middle of the job.