I'm new using AWS Glue, the solution that I've found for now is this glueContext.read:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['TempDir', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
qry_for_df = """
(SELECT *
FROM table1
WHERE field1 = ‘something’ ) as subq
"""
df = glueContext.read.format("jdbc").option("driver", "com.amazon.redshift.jdbc.Driver").option("url", "jdbc:redshift://…”).option("dbtable", qry_for_df).option("user", "username”).option("password", “XXXXXXXX”).load()
Some considerations:
I've not found a way to use the predefined JDBC connection of the AWS Glue for it, so I'm adding the Jar lib path s3://your-bucket/path/RedshiftJDBC42XXX.jar in the job.
I've not included way to store the password and url, but here is a link for Amazon best practices (as suggestion):
https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
You can take the parameters from your metadata, but the database name is missing and the connection name should be hard coded or via job parameters
cnx_parameters = glueContext.extract_jdbc_conf("Your_Cnx_Name", catalog_id=None)