6
votes

I have a MySQL source from which I am creating a Glue Dynamic Frame with predicate push down condition as follows

datasource = glueContext.create_dynamic_frame_from_catalog(
    database = source_catalog_db, 
    table_name = source_catalog_tbl, 
    push_down_predicate = "id > 1531812324", 
    transformation_ctx = "datasource")

I am always getting all the records in 'datasource' whatever the condition I put in 'push_down_predicate'. What am I missing?

2

2 Answers

15
votes

Pushdown predicate works for partitioning columns only. In other words, your data files should be placed in hierarchically structured folders. For example, if data is located in s3://bucket/dataset/ and partitioned by year, month and day then the structure should be following:

s3://bucket/dataset/year=2018/month=7/day=18/<data-files-here>

In such case pushdown predicate would work for columns year, month and day only:

datasource = glueContext.create_dynamic_frame_from_catalog(
    database = source_catalog_db, 
    table_name = source_catalog_tbl, 
    push_down_predicate = "year = 2017 and month > 6 and day between 3 and 10", 
    transformation_ctx = "datasource")

Besides that you have to keep in mind that pushdown predicates work with s3 data sources only.

Here is a nice blog post written by AWS Glue devs about data partitioning.

1
votes

This is great! I was able to use it to obtain the last 30 days of data using my "dt" partition column:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "my_db",
    table_name = "my_table",
    push_down_predicate = "to_date(dt) >= date_sub(current_date, 30)", 
    transformation_ctx = "datasource0"
)

I'm using Glue 1.0 - Spark 2.4 - Python 2.