Pre filter solution for aws Glue / Redshift (before the loading into S3)

Question

I am working on aws Glue scheduled transformation scripts from Redshift tables (or views) to Redshift tables.

Some of my tables are huge and I do not want to load the full table into S3 files or for some of them, I would like to have an incremental solution. So I am looking for a pre filter solution for aws Glue / Redshift (before the loading into S3).

I know that if you work directly on a S3 bucket you can use predicate push-down to avoid loading the data you don't care about. It seems that it is not supported on Redshift tables.

Do you know a similary solution for Redshift? Or a workaround?

Thank you

fernolimits fernolimits · Accepted Answer · 2019-03-24T11:52:55

I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [TempDir, JOB_NAME]
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

qry_for_df = """
    (SELECT *
       FROM table1
      WHERE field1 = ‘something’ ) as subq
"""

df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()

Some considerations: I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.

I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion): https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/

You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)

Pre filter solution for aws Glue / Redshift (before the loading into S3)

1 Answers