1
votes

I tried running a Dataflow pipeline to read from Local machine(windows) and write to Google cloud storage using a DirectPipelineRunner. The job failed with the error below specifying FileNotFoundException(so i believe the dataflow job is unable to read my location). I am running the job from my local machine to run the GCP based template that i created. I am able to see it in the GCP Dataflow dashboard, but fails with the below error. Please help. I also tried IP or hostname of my local machine along with my local location, but faced this FileNotFoundException?

Error:

java.io.FileNotFoundException: No files matched spec: C:/data/sampleinput.txt
    at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:172)
    at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:158)
    at org.apache.beam.sdk.io.FileBasedSource.split(FileBasedSource.java:261)
    at com.google.cloud.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:275)

COMMAND TO RUN THE TEMPLATE:

gcloud dataflow jobs run jobname --gcs-location gs://<somebucketname of template>/<templatename> --parameters inputFilePattern=C:/data/sampleinput.txt,outputLocation=gs://<bucketname>/output/outputfile,runner=DirectPipelineRunner

CODE:

PCollection<String>  textData =pipeline.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()));
    textData.apply("Write Text Data",TextIO.write().to(options.getOutputLocation()));
1

1 Answers

2
votes

The gcloud dataflow jobs run command runs your job on Cloud Dataflow. That means the Dataflow workers will try to find C:/data/sampleinput.txt, which does not exist on these workers, obviously.

You can fix this by uploading sampleinput.txt to a bucket and supply the URI gs://<bucketname>/sampleinput.txt as inputFilePattern. Then the Dataflow workers will be able to find your input file and the job should succeed.