There are 2 possible ways to access data from RDS in glue etl (spark):
1st Option:
- Create a glue connection on top of RDS
- Create a glue crawler on top of this glue connection created in first step
- Run the crawler to populate the glue catalogue with database and table pointing to RDS tables.
- Create a dynamic frame in glue etl using the newly created database and table in glue catalogue.
Code Sample :
from pyspark.context import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")
2nd Option
Create a dataframe using spark sql :
url = "jdbc:postgresql://<rds_host_name>/<database_name>"
properties = {
"user" : "<username>",
"password" : "<password>"
}
df = spark.read.jdbc(url=url, table="<schema.table>", properties=properties)
Note :
- You will need to pass postgres jdbc jar for creating the database using spark sql.
- I have tried first method on glue etl and second method on python shell (dev-endpoint)