How can I stream data from a Google PubSub topic into PySpark (on Google Cloud)

Question

I have data streaming into a topic in Google PubSub. I can see that data using simple Python code:

...
def callback(message):
    print(datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") + ": message = '" + message.data + "'")
    message.ack()

future = subscriber.subscribe(subscription_name, callback)
future.result()

The above python code receives data from the Google PubSub topic (with subscriber subscriber_name) and writes it to the terminal, as expected. I would like to stream the same data from the topic into PySpark (RDD or dataframe), so I can do other streaming transformation such as windowing and aggregations in PySpark, as described here: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

That link has documentation for reading other streaming sources, (e.g. Kafka), but not Google PubSub. Is there a way to stream from Google PubSub into PySpark?

Héctor Neri Héctor Neri · Accepted Answer · 2018-09-17T21:56:22

You can use Apache Bahir, which provides extensions for Apache Spark, including a connector for Google Cloud Pub/Sub.

You can find an example from Google Cloud Platform that using Spark on Kubernetes computes word counts from data stream received from a Google Cloud PubSub topic and writes the result to a Google Cloud Storage (GCS) bucket.

There's another example that uses DStream to deploy an Apache Spark streaming application on Cloud Dataproc and process messages from Cloud Pub/Sub.

How can I stream data from a Google PubSub topic into PySpark (on Google Cloud)

2 Answers