1
votes

Can we use multiple service accounts within one Dataproc cluster.

Let's say I have 3 buckets: Service account A has r/w access to bucket A, with r access to bucket B and C. Service account B has r/w access to bucket B, with r access to bucket A and C. Service account C has r/w access to bucket C, with r access to bucket A and B

Can I have a cluster spun up with service account D, but use each of the above defined service accounts (A, B and C) within the jobs to get appropriate access to the buckets?

2

2 Answers

3
votes

The GCS Connector for Hadoop can be configured to use a different service account than what is provided by the GCE metadata server. Using this mechanism, it would be possible to access different buckets using different service accounts.

To use a json keyfile instead of the metadata, the configuration key "google.cloud.auth.service.account.json.keyfile" should be set to the location of a JSON keyfile that is local to each node in the cluster. How to set that key will depend on the context in which the filesystem is being accessed. For a standard MR job, only accessing a single bucket, you can set that key/value pair on the JobConf. If you're accessing GCS via the Hadoop FileSystrem interface, you can specify that key/value pair in the Configuration object used when acquiring the appropriate FileSystem instance.

That said, Dataproc does not attempt to segregate individual applications from each other. So if your intent is a multi-tenant cluster then there are not sufficient security boundaries around individual jobs to guarantee that any a job will not maliciously attempt to grab credentials from another job.

If you're intent is not multi-tenant clusters, consider creating a task specific service account that is allowed read or write access to all buckets that it will be required to interact with. For example, if you have a job 'meta-analysis' that reads and writes to multiple buckets, you can create a service account meta-analysis that has permission required for that job.

0
votes

With this relatively new feature ( 6 months in GA at the moment of writing ) you can try to use dataproc cooperative multi tenancy to map user accounts submitting job against dataproc into service account Here is the excellent article by Google's engineers: https://cloud.google.com/blog/topics/developers-practitioners/dataproc-cooperative-multi-tenancy