I have an RDS database that is sitting in a VPC. My ultimate goal is to run a nightly job that takes the data from RDS and stores it in Redshift. I am currently doing this using Glue and Glue connections. I am able to write to RDS/Redshift using connections with the following line:
datasource2 = DynamicFrame.fromDF(dfFinal, glueContext, "scans")
output = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource2, catalog_connection = "MPtest", connection_options = {"database" : "app", "dbtable" : "scans"})
Where dfFinal is my final data frame after a bunch of transformations that are not essential to this post. That code works fine, however I would like to modify it so I could read a table from RDS into a data frame.
Since the RDS database is in a VPC, I would like to use the catalog_connection parameter, but the DynamicFrameReader class has no from_jdbc_conf method and thus no obvious way to use my glue connection.
I have seen posts that say you could use a method like this:
url = "jdbc:postgresql://host/dbName"
properties = {
"user" : "user",
"password" : "password"
}
df = spark.read.jdbc(url=url, table="table", properties=properties)
But when I try that it times out because it's not a publicly accessible database. Any suggestions?