2
votes

I have BigQuery table, Dataproc cluster (with Datalab) and I follow this guide: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket")
project = spark._jsc.hadoopConfiguration().get("fs.gs.project.id")

# Set an input directory for reading data from Bigquery.
todays_date = datetime.strftime(datetime.today(), "%Y-%m-%d-%H-%M-%S")
input_directory = "gs://{}/tmp/bigquery-{}".format(bucket, todays_date)

# Set the configuration for importing data from BigQuery.
# Specifically, make sure to set the project ID and bucket for Cloud Dataproc,
# and the project ID, dataset, and table names for BigQuery.

conf = {
    # Input Parameters
    "mapred.bq.project.id": project,
    "mapred.bq.gcs.bucket": bucket,
    "mapred.bq.temp.gcs.path": input_directory,
    "mapred.bq.input.project.id": project,
    'mapred.bq.input.dataset.id': 'my-test-dataset',
    'mapred.bq.input.table.id': 'test-table'
}

# Read the data from BigQuery into Spark as an RDD.
table_data = spark.sparkContext.newAPIHadoopRDD(
    "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
    "org.apache.hadoop.io.LongWritable",
    "com.google.gson.JsonObject",
    conf=conf)

The script is working fine when I try to connect to public datasets. However, when I try to connect to my private dataset, I receive the following error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:210)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:75)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:82)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:102)
    at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:90)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getBigQueryHelper(AbstractBigQueryInputFormat.java:357)
    at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:108)
    at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1333)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
    at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
    at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:587)
    at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: metadata
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:159)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:208)
    ... 35 more

Some additional info:

  1. I am using python (pySpark) via Datalab (which was initiated via https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/datalab)
  2. BigQuery data is on US, Dataproc cluster is on EU
  3. Dataproc image is the latest (1.2)
  4. Dataproc cluster was configurated to have google-wide API access
1

1 Answers

4
votes

As per the error message that you are receiving (Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token [...] Caused by: java.net.UnknownHostException: metadata), it looks like the error is in the Service Account not being able to retrieve the access token correctly.

In order to simplify your use case scenario, I would propose that you start by narrowing down the products that you are using (because the failure can be in different steps). To do so, I propose that you run your PySpark code directly from the Dataproc Cluster that you have already running, as explained in the documentation:

  1. Go to Dataproc > Clusters menu in the GCP Console.
  2. Get into the cluster you are using, then to the "VM Instances" tab.
  3. SSH into the Master node by clicking on the "SSH" button next to its name.
  4. Create a script words.py containing the PySpark code you want to run.
  5. Run the script with the command spark-submit words.py.

Once you do so, check if you get the same error message. If you do, the issue should be in the Dataproc / BigQuery side. If you don't, it most probably is located in Datalab instead. My guessing is that you would get the same error message, as it looks like a credentials issue.

Once you have (possibly) identified where the issue lies, see which Service Account you are using by running the following command in the terminal when you have SSHed into the master node in your cluster:

gcloud auth list

Also make sure that the environment variable GOOGLE_APPLICATION_CREDENTIALS is empty by running the command below. If it is empty, the VM instance where the node is running will use the default service account for GCE (which should be the one you obtained when running gcloud auth list, as Dataproc runs over GCE instances). If it is not empty, it will use the credentials file to which this environment variable is pointing. It is an implementation choice whether to use the default credentials or a custom one.

echo $GOOGLE_APPLICATION_CREDENTIALS

Once you know which Service Account is being used, move to the IAM tab in the Console, and check whether this Service Account has the right roles and permissions to access BigQuery.

My guessing is that the issue may be related to the Service Account in use, and probably GOOGLE_APPLICATION_CREDENTIALS is pointing to the wrong location, so you should start by making sure that your authentication configuration is the correct one; and to do so, I would run the code directly from inside the master node, in order to simplify the use case and reduce the components involved.