I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...)
can be applied to Avro files that are being generated continuously under a bucket from external sources.
Apache beam already has something like File.MatchAll().continuously()
, Watch, watchnewFiles
that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?