1
votes

Our dataflow jobs read from a GCS multi regional bucket that contains files of interest. These files also get moved to an archive bucket, so sometimes we see GCS list operations return files that have been moved (as you would expect as it is an eventually consistent operation).

Unfortunately our jobs blow up as a result when FileBasedSources try to read these "ghost" files. It seems both Google's Dataflow SDK and Apache Beam have made the methods that open GCS files final (in FileBasedSource: createReader and startImpl) so we can't override them.

Other than not moving files, any recommendations on work around this? This Stack Overflow question indicates others experience similar issues, but it seems the response was "blowing up as expected".

1
I edited my answer because GCS object listing has since been made strongly consistent.jkff

1 Answers

3
votes

Right now, Google Cloud Storage object listing operations are strongly consistent, so the original issue no longer applies.

It still applies when using an eventually-consistent filesystem such as S3. See BEAM JIRA to track this issue.