I've exported a Cloud Dataflow template from Dataprep as outlined here:
https://cloud.google.com/dataprep/docs/html/Export-Basics_57344556
In Dataprep, the flow pulls in text files via wildcard from Google Cloud Storage, transforms the data, and appends it to an existing BigQuery table. All works as intended.
However, when trying to start a Dataflow job from the exported template, I can't seem to get the startup parameters right. The error messages aren't overly specific but it's clear that for one thing, I'm not getting the locations (input and output) right.
The only Google-provided template for this use case (found at https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud-storage-text-to-bigquery) doesn't apply as it uses a UDF and also runs in Batch mode, overwriting any existing BigQuery table rather than append.
Inspecting the original Dataflow job details from Dataprep shows a number of parameters (found in the metadata file) but I haven't been able to get those to work within my code. Here's an example of one such failed configuration:
import time
from google.cloud import storage
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def dummy(event, context):
pass
def process_data(event, context):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials)
data = event
gsclient = storage.Client()
file_name = data['name']
time_stamp = time.time()
GCSPATH="gs://[path to template]
BODY = {
"jobName": "GCS2BigQuery_{tstamp}".format(tstamp=time_stamp),
"parameters": {
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),
"outputLocations": '{{\"location1\":\"[project]:[dataset].[table]\", [... other locations]"}}',
"customGcsTempLocation": "gs://[my bucket]/dataflow"
},
"environment": {
"zone": "us-east1-b"
}
}
print(BODY["parameters"])
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
print(response)
The above example indicates invalid field ("location1", which I pulled from a completed Dataflow job. I know I need to specify the GCS location, the template location, and the BigQuery table but haven't found the correct syntax anywhere. As mentioned above, I found the field names and sample values in the job's generated metadata file.
I realize that this specific use case may not ring any bells but in general if anyone has had success determining and using the correct startup parameters for a Dataflow job exported from Dataprep, I'd be most grateful to learn more about that. Thx.
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),? Dataflow args are just like command line args (--input=example.csv). Nothing complicated. Show your code that processes the command line so that we are not guessing. - John Hanley