BigQuery and Pyspark in Dataproc

Question

I have a table in BigQuery that I want to query and implement FPgrowth algorithm. I want to try it first on the pyspark shell using a VM instance of the dataproc cluster.

I am looking for a way to directly query the table in BQ using pyspark. I want to use the resulting queried data to implement FPGrowth (which I am already familiar with).

Hello and welcome to Stackoverflow. Please read through the How to Ask so we can better understand what your real problem is. Have you already queried from BQ into dataproc? Is something related to the algorithm? — Willian Fuks

Willian Fuks Willian Fuks · Accepted Answer · 2017-11-09T01:48:15

Dataproc already has the necessary connectors available to query over BigQuery as you can see in the docs.

Code sample from docs:

import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()

# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'

# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

What I also recommend is to create a Dataproc cluster with a Jupyter service installed. This will give you the possibility of testing on the fly on how to implement FPgrowth or any other idea you want to try eventually.

In fact, before writing this answer, I just used my current jupyter notebook to query BQ to see how it would work:

BigQuery and Pyspark in Dataproc

2 Answers