1
votes

I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?

I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.

1

1 Answers

1
votes

It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:

Assuming you created a crawler and inserted the source on your AWS Glue job like this:

  # Read data from database
    datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
  • AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
  • PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")