1
votes

I am new to dataproc cluster and PySpark so, in the process of looking for codes to load table from bigquery to the cluster, i came across the code below and was unable to figure out what all am i suppose to change for my usecase in this code and what are we providing as an input in the input directory

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import subprocess


sc = SparkContext()
spark = SparkSession(sc)


bucket = spark._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = spark._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)


conf = {
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'dataset_new',
    'mapred.bq.input.dataset.id': 'retail',
    'mapred.bq.input.table.id': 'market',
}
1

1 Answers

0
votes

You are trying to use Hadoop BigQuery connector, for Spark you should use Spark BigQuery connector.

To read data from BigQuery you can follow an example:

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket = "[bucket]"
spark.conf.set('temporaryGcsBucket', bucket)

# Load data from BigQuery.
words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()