Approximately just over one week ago, I was able to read a BigQuery table into an RDD for a Spark job running on a Dataproc cluster using the guide at https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example as a template. Since then, I am now encountering missing class issues, despite no changes being affected to the guide.
I have attempted to track down the missing class, com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList, although I cannot find any information on whether or not this class is now excluded from the gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
The job submission request is as follows:
gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py
The PySpark code breaks at the following point:
bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
where conf is a Python dict structured as follows:
conf = {
'mapred.bq.project.id': project_id,
'mapred.bq.gcs.bucket': gcs_staging_bucket,
'mapred.bq.temp.gcs.path': input_staging_path,
'mapred.bq.input.project.id': bq_input_project_id,
'mapred.bq.input.dataset.id': bq_input_dataset_id,
'mapred.bq.input.table.id': bq_input_table_id,
}
When my output indicates that the code has reached the above spark_context.newAPIHadoopRDD function, the following is printed to stdout:
class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory
Traceback (most recent call last):
File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
sc)
File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
This had not been an issue as recently as last week. I am concerned that even the hello world example on the GCP website is not stable in the short term. If anyone could shed some light on this issue, it would be greatly appreciated. Thanks.