We've got multiple Google Cloud Dataflow jobs (written in Java / Kotlin), and they can be run in two different ways:
- Initiated from a user's Google Cloud account
- Initiated from a serviceaccount (with the required policies and permissions)
When running the Dataflow job from a user's account, Dataflow provides the default controller serviceaccount to the workers. It does not provide the authorized user to the workers.
When running the Dataflow job from the serviceaccount, I imagine that the serviceaccount that is set using setGcpCredential would be propagated to the worker VMs that Dataflow uses in the background. The JavaDocs don't mention any of this, but they do mention that the credentials are used to authenticate towards GCP services.
In most of our use cases for Dataflow, we run the Dataflow job in project A, while we read from BigQuery in project B. Hence, we provide the user with reader access to the BigQuery dataset in project B, as well as the serviceaccount used in the second way as described above. That same serviceaccount will also have the roles jobUser and dataViewer for BigQuery in project A.
Now, the issue is, that in both cases, we seem to need to provide the default controller serviceaccount with access to the BigQuery dataset that is used in the Dataflow job. In case we don't, we'll get a permission denied (403) for BigQuery, when the job tries to access the dataset in project B. For the second way as described, I'd expect Dataflow to be independent of the default controller serviceaccount. My hunch is that Dataflow does not propagate the serviceaccount that is set in the PipelineOptions to the workers.
In general, we provide project, region, zone, temporary locations (gcpTempLocation, tempLocation, stagingLocation), the runner type (in this case DataflowRunner), and the gcpCredential as PipelineOptions.
So, does Google Cloud Dataflow really propagate the provided serviceaccount to the workers?
Update
We first tried adding the options.setServiceAccount
, as indicated by Magda, without adding IAM permissions. This results in the following error from the Dataflow logs:
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : " Current user cannot act as service account [email protected]. Causes: Current user cannot act as service account [email protected]..",
"reason" : "forbidden"
} ],
"message" : " Current user cannot act as service account [email protected].. Causes: Current user cannot act as service account [email protected].",
"status" : "PERMISSION_DENIED"
}
After that, we tried to add roles/iam.serviceAccountUser
to this service account. Unfortunately, that resulted in the same error. This serviceaccount already had the IAM roles Dataflow worker and BigQuery Job User.
The default compute engine controller serviceaccount [email protected]
only has the Editor role and we did not add any other IAM roles / permissions.