How to create dataflow pipeline and auto deploy to google cloud?

1

votes

I'm using Apache beam and maven to create pipeline and run dataflow jobs. After the logic coding, I run the following command to upload the job/template to Google Cloud.

mvn compile exec:java -Dexec.mainClass=com.package.MyMainClass -Dexec.args="--runner=DataflowRunner --autoscalingAlgorithm=NONE --numWorkers=25 --project=<PROJEC> --subnetwork=regions/us-east1/subnetworks/default --zone=us-east1-b --network=default --stagingLocation=gs://<TBD> --templateLocation=gs://<TBD> --otherCustomOptions"

After that, I've seen two ways of starting to run the job

I had to go to the Dataflow UI page, click to create a new job and use my own template blablabla... and then the job will start running
The job already started running

I wonder how 2 is implemented. I basically want to get rid of the hassle of going into the UI. I want to submit and start the job right here at my laptop. Any insights will be appreciated!

mavengoogle-cloud-platformgoogle-cloud-dataflowapache-beam

0

votes

Once the template is staged, as well as the UI you can start it using:

REST API

Gcloud Command Line

0

votes

It's important to make a distinction between traditional and templated Dataflow job execution:

If you use Dataflow templates (as in your case), staging and execution are separate steps. This separation gives you additional flexibility to decide who can run jobs and where the jobs are run from.

However, once your template is staged, you need to explicitly run your job from that template. To automate this process, you can make use of:

The API:

    POST https://dataflow.googleapis.com/v1b3/projects/YOUR_PROJECT_ID/templates:launch?gcsPath=gs://YOUR_BUCKET_NAME/templates/TemplateName
    {
        "jobName": "JOB_NAME",
        "parameters": {
            "inputFile" : "gs://YOUR_BUCKET_NAME/input/my_input.txt",
            "outputFile": "gs://YOUR_BUCKET_NAME/output/my_output"
        },
        "environment": {
            "tempLocation": "gs://YOUR_BUCKET_NAME/temp",
            "zone": "us-central1-f"
        }
    }

The gcloud command line tool:

    gcloud dataflow jobs run JOB_NAME \
        --gcs-location gs://YOUR_BUCKET_NAME/templates/MyTemplate \
        --parameters inputFile=gs://YOUR_BUCKET_NAME/input/my_input.txt,outputFile=gs://YOUR_BUCKET_NAME/output/my_output

Or any of the client libraries.

Alternatively, if you don't want to create a Dataflow template and you just want to deploy and run the job directly (which is probably what you're reffering to in point 2), you can just remove the --templateLocation parameter. If you get any errors when doing this, make sure that your pipeline code can be executed for a non-templated job as well; for reference, take a look at this question.

How to create dataflow pipeline and auto deploy to google cloud?

2 Answers