DirectPipelineRunner in Dataflow to read from Local machine to Google Cloud storage

Question

I tried running a Dataflow pipeline to read from Local machine(windows) and write to Google cloud storage using a DirectPipelineRunner. The job failed with the error below specifying FileNotFoundException(so i believe the dataflow job is unable to read my location). I am running the job from my local machine to run the GCP based template that i created. I am able to see it in the GCP Dataflow dashboard, but fails with the below error. Please help. I also tried IP or hostname of my local machine along with my local location, but faced this FileNotFoundException?

Error:

java.io.FileNotFoundException: No files matched spec: C:/data/sampleinput.txt
    at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:172)
    at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:158)
    at org.apache.beam.sdk.io.FileBasedSource.split(FileBasedSource.java:261)
    at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)

COMMAND TO RUN THE TEMPLATE:

gcloud dataflow jobs run jobname --gcs-location gs://<somebucketname of template>/<templatename> --parameters inputFilePattern=C:/data/sampleinput.txt,outputLocation=gs://<bucketname>/output/outputfile,runner=DirectPipelineRunner

CODE:

PCollection<String>  textData =pipeline.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()));
    textData.apply("Write Text Data",TextIO.write().to(options.getOutputLocation()));

MrtN MrtN · Accepted Answer · 2018-09-22T12:39:43

The gcloud dataflow jobs run command runs your job on Cloud Dataflow. That means the Dataflow workers will try to find C:/data/sampleinput.txt, which does not exist on these workers, obviously.

You can fix this by uploading sampleinput.txt to a bucket and supply the URI gs://<bucketname>/sampleinput.txt as inputFilePattern. Then the Dataflow workers will be able to find your input file and the job should succeed.

DirectPipelineRunner in Dataflow to read from Local machine to Google Cloud storage

1 Answers