1
votes

I am trying to create my first Google Cloud Dataproc cluster using the following command:

gcloud dataproc clusters create hive-cluster    \ 
    --scopes sql-admin  \   
    --image-version 1.3  \   
    --initialization-actions "gs://goog-dataproc-${PROJECT}:${REGION}:hive-metastore" \
    --master-machine-type n1-standard-1 \
    --master-boot-disk-size 15 \
    --num-workers 2 \
    --worker-machine-type n1-standard-1 \
    --worker-boot-disk-size 15 \
    --region us-east1 \
    --zone us-east1-b

However, I get the following error:

    Dataproc could not validate the initialization action using the service-owned service accounts. Cluster creation may still succeed if the initialization action is accessible from GCE VMs.
    Reason: service-1456309104734317@dataproc-accounts.iam.gserviceaccount.com does not have storage.objects.get access to goog-dataproc-initialization-actions-us-east1/cloud-sql-proxy/cloud-sql-proxy.sh.
    Waiting for cluster creation operation...done.
    ERROR: (gcloud.dataproc.clusters.create) Operation [projects/traits-seater-824109/regions/us-east1/operations/5b36fb82-ade2-3d5f-a6bd-cb1a206bb54e] failed: Multiple Errors:
     - Error downloading script 'gs://goog-dataproc-initialization-actions-us-east1/cloud-sql-proxy/cloud-sql-proxy.sh': [email protected] does not have storage.objects.get access to goog-dataproc-initialization-actions-us-east1/cloud-sql-proxy/cloud-sql-proxy.sh.

I checked the permissions in IAM and gave the storage->Object viewer roles to the service accounts mentioned in the error message above but I still get the same error. Any suggestions how to get past this error?

2
There might be a typo in your stackoverflow question where you deleted part of your initialization-actions flag and mixed it with your --metadata flag, though I think that's only a typo in the question posted here instead of the command you actually ran since the erro rmessage references the correct init action path.Dennis Huo
We have resolved issues with permissions to regional buckets, please retry your command.Igor Dvorzhak
Thanks @IgorDvorzhak! That did it. The file now has public access now and the command runs fine.Carol

2 Answers

2
votes

There appears to be a temporary issue with permissions settings on Dataproc's regionally-hosted versions of the initialization actions -- long term these regional copies are indeed what you should be using for better isolating regional reliability of the init actions and also to avoid cross-region copying of init actions, but in the meantime, you can use the shared "global" copy of the init action instead:

gcloud dataproc clusters create hive-cluster    \ 
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
...
1
votes

The problem may come from the scopes you provided when creating the cluster. You only restrict your cluster to access the sql-admin API (https://www.googleapis.com/auth/sqlservice.admin).

You may need to add the storage-ro scope (or https://www.googleapis.com/auth/devstorage.read_only) :

gcloud dataproc clusters create hive-cluster \ 
    --scopes sql-admin,storage-ro \
    [...]

Without the storage-ro scope, even if the bucket goog-dataproc-initialization-actions-us-east1 is public, I think that the Dataproc cluster will not be able to retrieve the file from GCS.