Our dataflow jobs read from a GCS multi regional bucket that contains files of interest. These files also get moved to an archive bucket, so sometimes we see GCS list operations return files that have been moved (as you would expect as it is an eventually consistent operation).
Unfortunately our jobs blow up as a result when FileBasedSource
s try to read these "ghost" files. It seems both Google's Dataflow SDK and Apache Beam have made the methods that open GCS files final (in FileBasedSource
: createReader
and startImpl
) so we can't override them.
Other than not moving files, any recommendations on work around this? This Stack Overflow question indicates others experience similar issues, but it seems the response was "blowing up as expected".