Reading a BigQuery table into a Spark RDD on GCP DataProc, why is the class missing for use in newAPIHadoopRDD

Question

Approximately just over one week ago, I was able to read a BigQuery table into an RDD for a Spark job running on a Dataproc cluster using the guide at https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example as a template. Since then, I am now encountering missing class issues, despite no changes being affected to the guide.

I have attempted to track down the missing class, com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList, although I cannot find any information on whether or not this class is now excluded from the gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

The job submission request is as follows:

gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py

The PySpark code breaks at the following point:

bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)

where conf is a Python dict structured as follows:

conf = {
    'mapred.bq.project.id': project_id,
    'mapred.bq.gcs.bucket': gcs_staging_bucket,
    'mapred.bq.temp.gcs.path': input_staging_path,
    'mapred.bq.input.project.id': bq_input_project_id,
    'mapred.bq.input.dataset.id': bq_input_dataset_id,
    'mapred.bq.input.table.id': bq_input_table_id,
}

When my output indicates that the code has reached the above spark_context.newAPIHadoopRDD function, the following is printed to stdout:

class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory

Traceback (most recent call last):
  File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
    sc)
  File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
    conf=conf)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList

This had not been an issue as recently as last week. I am concerned that even the hello world example on the GCP website is not stable in the short term. If anyone could shed some light on this issue, it would be greatly appreciated. Thanks.

Which Dataproc version do you use? Could you add the command you used to create the cluster? — Dagang
The Dataproc version is the latest by default: gcloud dataproc clusters create $CLUSTER_NAME \ --region $LOCATION --bucket $BUCKET_NAME \ --master-boot-disk-size 15GB \ --master-boot-disk-type pd-standard \ --worker-boot-disk-size 10GB \ --worker-boot-disk-type pd-standard \ --num-master-local-ssds 1 \ --num-worker-local-ssds 1 \ --single-node \ --master-machine-type n1-standard-2 \ --worker-machine-type n1-standard-2 \ — james miles

Dagang Dagang · Accepted Answer · 2019-08-29T16:56:53

I reproduced the problem

$ gcloud dataproc clusters create test-cluster --image-version=1.4

$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
    --cluster test-cluster \
    --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

then exactly the same error happened:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList

I noticed there was a new release 1.0.0 on Aug 23:

$ gsutil ls -l gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-**
   ...
   4038762  2018-10-03T20:59:35Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.8.jar
   4040566  2018-10-19T23:32:19Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
  14104522  2019-06-28T21:08:57Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC1.jar
  14104520  2019-07-01T20:38:18Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC2.jar
  14149215  2019-08-23T21:08:03Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0.jar
  14149215  2019-08-24T00:27:49Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

Then I tried version 0.13.9, it worked:

$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
    --cluster test-cluster \
    --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar

It is a problem with 1.0.0, there is already an issue filed on GitHub. We'll fix it and improve the tests.

Reading a BigQuery table into a Spark RDD on GCP DataProc, why is the class missing for use in newAPIHadoopRDD

1 Answers