0
votes

I want to create a DynamicFrame in my Glue job from an Aurora-rds mysql table. Can I create DynamicFrame from my rds table using a custom query - having a where clause? I dont want to read the entire table every time in my DynamicFrame and then filter later. Looked at this website but didnt find any option here or elsewhere, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

Construct JDBC connection options

connection_mysql5_options = { "url": "jdbc:mysql://:3306/db", "dbtable": "test", "user": "admin", "password": "pwd"}

Read DynamicFrame from MySQL 5

df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql5_options)

Is there any way to give a where clause and say select only top 100 rows from test table, say it has a column named "id" and I want to fetch using this query:

select * from test where id<100;

Appreciate any help. Thank you!

2
I found this today, and probably is what you are looking for too stackoverflow.com/questions/51388993/…tubadc

2 Answers

0
votes

The way I was able to provide a custom query was by creating a Spark DataFrame and specifying it with options: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options

Then transform that DataFrame into a DynamicFrame using said class: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html

tmp_data_frame = spark.read.format("jbdc")
.option("url", jdbc_url)
.option("user", username)
.option("password", password)
.option("query", "select * from test where id<100")
.load()

dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, glueContext)
0
votes

Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:

  • The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
  • The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.

In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:

from awsglue.context import GlueContext
from pyspark.context import SparkContext

glue_context = GlueContext(SparkContext.getOrCreate())

tmp_data_frame = glue_context.spark_session.read\
  .format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
  .option("url", jdbc_url)\
  .option("user", username)\
  .option("password", password)\
  .option("dbtable", "(select * from test where id<100)")\
  .load()