2
votes

I have a table in BigQuery that I want to query and implement FPgrowth algorithm. I want to try it first on the pyspark shell using a VM instance of the dataproc cluster.

I am looking for a way to directly query the table in BQ using pyspark. I want to use the resulting queried data to implement FPGrowth (which I am already familiar with).

2
Hello and welcome to Stackoverflow. Please read through the How to Ask so we can better understand what your real problem is. Have you already queried from BQ into dataproc? Is something related to the algorithm? - Willian Fuks
Edited my question. Thanks for your time. - Nivi

2 Answers

4
votes

Dataproc already has the necessary connectors available to query over BigQuery as you can see in the docs.

Code sample from docs:

import pyspark
from pyspark.sql import SQLContext

sc = pyspark.SparkContext()

# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)

conf = {
    # Input Parameters.
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'

# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

What I also recommend is to create a Dataproc cluster with a Jupyter service installed. This will give you the possibility of testing on the fly on how to implement FPgrowth or any other idea you want to try eventually.

In fact, before writing this answer, I just used my current jupyter notebook to query BQ to see how it would work:

enter image description here

0
votes

To add to Willian Fuks answer (not enough rep to answer directly, apologies for that):

One of the issues I have with Dataproc's integration between Big Query and PySpark, is that you can't actually take advantage of BigQueries query capabilities to, for example, pre-filter the data you want to use in Spark.

When you use the example provided by Willian, what's actually happening is that all the data over the referenced table is copied into a temporary location so Spark can access it.

One option you may want to use - if it makes sense for your use case - is creating a temporary table in big query with the subset of data you're interested in (you can query in BigQuery, and save the results into a new table). You then link to that table instead, and do the rest of the work with PySpark.