2
votes

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...) can be applied to Avro files that are being generated continuously under a bucket from external sources.

Apache beam already has something like File.MatchAll().continuously(), Watch, watchnewFiles that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?

1

1 Answers

1
votes

As the GCS connector exposes a Hadoop-Compatible FileSystem (HCFS), "gs://" URIs should be valid targets for SparkSession.readStream.from.

Avro file handling is implemented by spark-avro. Using it with readStream should be accomplished the same way as generic reading (e.g., .format("com.databricks.spark.avro"))