We are currently developing an Apache Beam pipeline to read data from GCP Pub/Sub and to write the data received to a bucket in AWS S3.
We are using TextIO.write
in beam-sdks-java-io.amazon-web-services
to write to S3.
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards)
.withTempDirectory(FileBasedSink.convertToFileResourceIfPossible(options.getTempLocation))
.to(FileBasedSink.convertToFileResourceIfPossible(options.getOutputDirectory))
So first we tested this pipeline locally using the DirectRunner
and that worked fine. (Data incoming from Pub/Sub was received by the pipeline and written to S3.
options.setRunner(classOf[DirectRunner])
options.setStagingLocation("./outputFolder/staging")
options.setTempLocation("s3://my-s3-bucket/temp")
options.setOutputDirectory("s3://my-s3-bucket/output")
In the last part we wanted to run this pipeline using Dataflow runner without any code change, so we modified the code to use the DataflowRunner
options.setRunner(classOf[DataflowRunner])
options.setStagingLocation("gs://my-gcs-bucket/binaries")
options.setGcpTempLocation("gs://my-gcs-bucket/temp")
options.setTempLocation("s3://my-s3-bucket/temp")
options.setOutputDirectory("s3://my-s3-bucket/output")
With this setting, data is received by the pipeline from pub/sub but it does not write to S3. There are also no errors written to the Dataflow logs in StackDriver.
Does anyone know what the issue could be? Is the pipeline options config incorrect? Or is the write to S3 failing silently?
Does anyone have suggestion on how to config the logs in beam-sdks-java-io.amazon-web-services
to output the DEBUG level logging?
Thanks!