I am trying to run a PySpark job via jupyter and I need to create a function to run the job. I need to pass a jar file and I am trying to figure out how to do that. I did find some documentation on it: https://cloud.google.com/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.SubmitJobRequest
https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/HadoopJob
But I am not able to figure out exactly how to add the URI to the function. My function currently looks something like this:
from google.cloud import dataproc_v1
def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
filename):
"""Submit the Pyspark job to the cluster (assumes `filename` was uploaded
to `bucket_name."""
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'jar_file_uris':'gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar', #PROBLEM HERE!
'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename)
}
}
result = dataproc_cluster_client.submit_job(
project_id=project, region=region, job=job_details)
job_id = result.reference.job_id
print('Submitted job ID {}.'.format(job_id))
return job_id
The problem is with the jar_file_uris part of the job details argument. Currently, I am getting an error.