1
votes

I tried a dataflow job to read from Google cloud storage and write to Local machine. I used a DirectPipelineRunner. The job completed successfully. But i don't see the files written in my local machine. Should i specify any ip/hostname along with my local location corresponding to the output location parameter? How will i specify a location in my local machine?

Command below:

gcloud dataflow jobs run sampleJobname1 --gcs-location gs://bucket/templatename1 --parameters inputFilePattern=gs://samplegcsbucket/abc/*,outputLocation=C:\data\gcp\outer,runner=DirectPipelineRunner

CODE:

PCollection<String>  textData =pipeline.apply("Read Text Data", TextIO.read().from(options.getInputFilePattern()));
    textData.apply("Write Text Data",TextIO.write().to(options.getOutputLocation()));
1

1 Answers

1
votes

The reason this might be working as a dataflow job is intended to input and output to cloud services.

If you want to write to your local machine then you can use a simplefunction which can take a string input and return Void. here you can write your custom java code to save the files in your local machine. You have to run this dataflow using directrunner.

@SuppressWarnings("serial")
public static class SaveFileToLocal extends SimpleFunction<String>, Void> {

    @Override
    public KV<String, String> apply(KV<String, Iterable<String>> input) {

        String file_contents : input.getValue()

        // CODE TO WRITE THE TEXT TO LOCAL PATH
    }
}

If you still fail to achieve this using above approach then i would suggest to use Cloud storage API and perform the same using python or PHP code.