I want to make a streaming Apache Beam pipeline in GCP which reads data from Google Pub/Sub and push it to GCS. I have the bit where I can read the data from Pub/Sub. My current code looks like that (picked it up from one of GCP Apache beam templates)
pipeline.apply("Read PubSub Events",
PubsubIO.readMessagesWithAttributes().fromTopic(options.getInputTopic()))
.apply("Map to Archive", ParDo.of(new PubsubMessageToArchiveDoFn()))
.apply(
options.getWindowDuration() + " Window",
Window.into(FixedWindows.of(DurationUtils.parseDuration(options.getWindowDuration()))))
.apply(
"Write File(s)",
AvroIO.write(AdEvent.class)
.to(
new WindowedFilenamePolicy(
options.getOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(NestedValueProvider.of(
options.getAvroTempDirectory(),
(SerializableFunction<String, ResourceId>) input ->
FileBasedSink.convertToFileResourceIfPossible(input)))
.withWindowedWrites()
.withNumShards(options.getNumShards()));
It can produce files which look like this
windowed-file2020-04-28T09:00:00.000Z-2020-04-28T09:02:00.000Z-pane-0-last-00-of-01.avro
I want to store the data in GCS in dynamically created directories. In the following directories 2020-04-28/01, 2020-04-28/02, etc - the 01 and 02 are subdirectories denoting the hour of the day when the data got processed by the dataflow streaming pipeline.
Example:
gs://data/2020-04-28/01/0000000.avro
gs://data/2020-04-28/01/0000001.avro
gs://data/2020-04-28/01/....
gs://data/2020-04-28/02/0000000.avro
gs://data/2020-04-28/02/0000001.avro
gs://data/2020-04-28/02/....
gs://data/2020-04-28/03/0000000.avro
gs://data/2020-04-28/03/0000001.avro
gs://data/2020-04-28/03/....
...
The 0000000, 0000001, etc are simple file names which I have used for illustration, I do not expect the files to be sequentially names. Do you think this is possible in a GCP dataflow streaming setup?