1
votes

I have an RDS database that is sitting in a VPC. My ultimate goal is to run a nightly job that takes the data from RDS and stores it in Redshift. I am currently doing this using Glue and Glue connections. I am able to write to RDS/Redshift using connections with the following line:

datasource2 = DynamicFrame.fromDF(dfFinal, glueContext, "scans")

output = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource2, catalog_connection = "MPtest", connection_options = {"database" : "app", "dbtable" : "scans"})

Where dfFinal is my final data frame after a bunch of transformations that are not essential to this post. That code works fine, however I would like to modify it so I could read a table from RDS into a data frame.

Since the RDS database is in a VPC, I would like to use the catalog_connection parameter, but the DynamicFrameReader class has no from_jdbc_conf method and thus no obvious way to use my glue connection.

I have seen posts that say you could use a method like this:

url = "jdbc:postgresql://host/dbName"
properties = {
"user" : "user",
"password" : "password"
}
df = spark.read.jdbc(url=url, table="table", properties=properties)

But when I try that it times out because it's not a publicly accessible database. Any suggestions?

1

1 Answers

0
votes

You are on the right track with using Glue connections.

  1. Define Glue connection of Type JDBC for your Postgres instance

    Type    JDBC
    
    JDBC URL    jdbc:postgresql://<RDS ip>:<RDS port>/<database_name>
    
    VPC Id  <VPC of RDS instance>
    
    Subnet  <subnet of RDS instance>
    
    Security groups <Security Group allowed to connect to RDS>
    
    
  2. Edit Glue Job, and select the Glue Connection so it appears under "Required Connections"

  3. Create connections options dictionary as


options = {'url': connection.jdbc_url,

           'user': connection.username,

           'password': connection.password,

           'dbtable': table
}

  1. Use the options dictionary in the below to create a Dynamic frame to read from the table

table_ddf = glueContext.create_dynamic_frame.from_options(

        connection_type='postgresql',

        connection_options=options,

        transformation_ctx=transformation_ctx
)