Spark structured streaming over google cloud storage

Question

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...) can be applied to Avro files that are being generated continuously under a bucket from external sources.

Apache beam already has something like File.MatchAll().continuously(), Watch, watchnewFiles that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?

Angus Davis Angus Davis · Accepted Answer · 2018-01-05T20:23:30

As the GCS connector exposes a Hadoop-Compatible FileSystem (HCFS), "gs://" URIs should be valid targets for SparkSession.readStream.from.

Avro file handling is implemented by spark-avro. Using it with readStream should be accomplished the same way as generic reading (e.g., .format("com.databricks.spark.avro"))

Spark structured streaming over google cloud storage

1 Answers