0
votes

I am fairly new to AWS and I am currently exploring it. I was hoping to get an insight or suggestion on how to work with AWS Glue. Say I want to run a script that uses SELECT for an input and outputs it to Redshift. However, I want to to use a parameter for the WHERE clause particularly dates. In Python and it goes something like:

df = "SELECT * FROM table WHERE date between {0}-{1}-{3} AND {4}-{5}-{6}".format(day,month,year,day2,month2,year2)

After I get this values, I then use something like a TEMP table in Redshift for staging and LEFT JOIN in the Fact Table. But I can't seem to figure out how Glue can do this.

I've read some articles/posts that suggests to use a flat file for the variables but I'm not so sure how to do this.

Any thoughts?

1

1 Answers

1
votes

I've figured it out.

First, create a Python Shell Job via AWS Glue then use this code:

import sys
from awsglue.utils import getResolvedOptions

args = getResolvedOptions(sys.argv, ['fromDate', 'toDate'])


print ("VAL1 is : ", args['fromDate'])
print ("VAL2: ", args['fromDate'])

query = "SELECT * FROM table WHERE date BETWEEN '{}' AND '{}'".format(args['fromDate'], args['toDate')


print(query)

The above code is for getting the parameters that are passed in the Job.

Note also that you have to add the JOB PARAMETERS in the console as you create your job.

Then you will need a code for passing the parameters to the job. I used boto3 for this.

import boto3
import time

job_name = 'example_job'
client = boto3.client(service_name='glue', region_name='ap-southeast-1',
          endpoint_url='https://glue.ap-southeast-1.amazonaws.com') 

response = client.start_job_run(JobName=job_name, Arguments={"--fromDate":"1-1-2019",
                                                             "--toDate":"31-12-2019"
                                                            })
status = client.get_job_run(JobName=job_name, RunId=response['JobRunId'])

if status:
    state = status['JobRun']['JobRunState']
    while state not in ['SUCCEEDED']:
        time.sleep(30)
        status = client.get_job_run(JobName=job_name, RunId=response['JobRunId'])
        state = status['JobRun']['JobRunState']
        if state in ['STOPPED', 'FAILED', 'TIMEOUT']:
            raise Exception('Failed to execute glue job: ' + status['JobRun']['ErrorMessage'] + '. State is : ' + state)