3
votes

We've got multiple Google Cloud Dataflow jobs (written in Java / Kotlin), and they can be run in two different ways:

  1. Initiated from a user's Google Cloud account
  2. Initiated from a serviceaccount (with the required policies and permissions)

When running the Dataflow job from a user's account, Dataflow provides the default controller serviceaccount to the workers. It does not provide the authorized user to the workers.

When running the Dataflow job from the serviceaccount, I imagine that the serviceaccount that is set using setGcpCredential would be propagated to the worker VMs that Dataflow uses in the background. The JavaDocs don't mention any of this, but they do mention that the credentials are used to authenticate towards GCP services.

In most of our use cases for Dataflow, we run the Dataflow job in project A, while we read from BigQuery in project B. Hence, we provide the user with reader access to the BigQuery dataset in project B, as well as the serviceaccount used in the second way as described above. That same serviceaccount will also have the roles jobUser and dataViewer for BigQuery in project A.

Now, the issue is, that in both cases, we seem to need to provide the default controller serviceaccount with access to the BigQuery dataset that is used in the Dataflow job. In case we don't, we'll get a permission denied (403) for BigQuery, when the job tries to access the dataset in project B. For the second way as described, I'd expect Dataflow to be independent of the default controller serviceaccount. My hunch is that Dataflow does not propagate the serviceaccount that is set in the PipelineOptions to the workers.

In general, we provide project, region, zone, temporary locations (gcpTempLocation, tempLocation, stagingLocation), the runner type (in this case DataflowRunner), and the gcpCredential as PipelineOptions.

So, does Google Cloud Dataflow really propagate the provided serviceaccount to the workers?

Update

We first tried adding the options.setServiceAccount, as indicated by Magda, without adding IAM permissions. This results in the following error from the Dataflow logs:

{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : " Current user cannot act as service account [email protected]. Causes: Current user cannot act as service account [email protected]..",
    "reason" : "forbidden"
  } ],
  "message" : " Current user cannot act as service account [email protected].. Causes: Current user cannot act as service account [email protected].",
  "status" : "PERMISSION_DENIED"
}

After that, we tried to add roles/iam.serviceAccountUser to this service account. Unfortunately, that resulted in the same error. This serviceaccount already had the IAM roles Dataflow worker and BigQuery Job User. The default compute engine controller serviceaccount [email protected] only has the Editor role and we did not add any other IAM roles / permissions.

1
how did you solve this?Raj Saxena

1 Answers

3
votes

I think you need to set controller service account too. You can use options.setServiceAccount("hereYourControllerServiceAccount@yourProject.iam.gserviceaccount.com") in Dataflow Pipeline Options.

You will need to add some additional permissions:

  • For controller: Dataflow Worker and Storage Object Admin.

  • For executor: Service Account User.

That's what I found in Google's documentation and try out myself.

I think that might give you some insights:

For the BigQuery source and sink to operate properly, the following two accounts must have access to any BigQuery datasets that your Cloud Dataflow job reads from or writes to:

-The GCP account you use to execute the Cloud Dataflow job

-The controller service account running the Cloud Dataflow job

For example, if your GCP account is [email protected] and the project number of the project where you execute the Cloud Dataflow job is 123456789, the following accounts must all be granted access to the BigQuery Datasets used: [email protected], and [email protected].

More on: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#controller_service_account