1
votes

My Dataflow job seems to always create a new Default Google Cloud Storage bucket even with stagingLocation, tempLocation, and gcpTempLocation defined. After looking through the Apache Beam (2.6.0) library code, these lines in GcpOptions.java appear to be the culprit:

String tempLocation = options.getTempLocation();
  if (isNullOrEmpty(tempLocation)) {
    tempLocation =
        tryCreateDefaultBucket(
            options,
            newCloudResourceManagerClient(options.as(CloudResourceManagerOptions.class))
                .build());
    options.setTempLocation(tempLocation);
  }

However, this code should only be executed in the event that tempLocation is not defined in PipelineOptions, and I have set tempLocation both using the command line argument as well as using the setTempLocation method. I have defined the Options interface by extending GcpOptions and adding a few additional option getters and setters.

If I don't define the tempLocation getter and setter in the Options interface it appears as though the tempLocation will automatically default to the newly created default bucket's temp folder (gs://dataflow-staging-us-central1-job_number/temp), yet I have defined the tempLocation command line argument.

Here is the main method of the Dataflow job:

public static void main(String[] args) {
    Options options = PipelineOptionsFactory.fromArgs(args).as(Options.class);
    GoogleCredentials credentials = null;
    try {
        credentials = GoogleCredentials.fromStream(new FileInputStream("./src/main/resources/credentials.json"))
                .createScoped(Lists.newArrayList("https://www.googleapis.com/auth/cloud-platform"));
    } catch (IOException e) {
        e.printStackTrace();
    }
    options.setGcpCredential(credentials);
    options.setTempLocation("gs://example_bucket/temp");
    options.setGcpTempLocation("gs://example_bucket/temp");
    run(options);
}

Can anyone explain why the default bucket is always created and how I can avoid this?

Edit:

It looks as though if I deploy the Dataflow job directly from the command line with the same arguments instead of generating the Dataflow template and then deploying the job through the console interface the tempLocation seems to be set correctly and no additional bucket is created. This seems to resolve the issue I am having but I am not fully sure why this solution works.

1
Did you try also to supply DataflowPipelineOptions interface for pipeline options?Nick_Kh
@mk_sta I just tried extending DataflowPipelineOptions in the Options.class interface but the same issue seems to occur. I originally extended GcpOptions instead. If I don't define the tempLocation getter and setter in the Options interface it appears as though the tempLocation will automatically default to the newly created default bucket's temp folder (gs://dataflow-staging-us-central1-job_number/temp)eagerbeaver
Let me do some basic troubleshooting. Can you confirm that the bucket you're setting as the temp location has already been created? Can you confirm that you're supplying the options to the Pipeline like so: Pipeline p = Pipeline.create(options);?Daniel Oliveira
@Daniel Oliveira I can confirm that the bucket I am setting as the tempLocation is already existing on Google Cloud Storage. I am also supplying the options to the Pipeline using the code you included inside of the run method. After observing the generated template file, I can also see that both options are correctly set in the template's PipelineOptions. After deploying the Dataflow job using the online Google Cloud interface however, the default gcpTempLocation is still created. I have included a new edit in my original post as my issue has been solved, but the solution confuses me.eagerbeaver

1 Answers

1
votes

I guess this behavior for particular PipelineOptions is well explained in GCP documentation:

gcpTempLocation - A Cloud Storage path for Dataflow to stage any temporary files. You must create this bucket ahead of time, before running your pipeline. In case you don't specify gcpTempLocation, you can specify the pipeline option tempLocation and then gcpTempLocation will be set to the value of tempLocation. If neither are specified, a default gcpTempLocation will be created.

So far, it seems to be the hard coded case for GCP Dataflow managed services.