1
votes
orig_dyf = glueContext.create_dynamic_frame.from_options(
    's3',
    {
        "paths": [
            's3://bucket/sample_data/'
        ],
        "recurse" : True,
        "exclusions" :  "[\"temp/**\"]"
    },
    "json",
    transformation_ctx = "orig_dyf")

I want to exclude the files from the folder temp, but this isn't working. As per https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3 we should be passing a string containing a JSON list of Unix-style glob patterns. Weird, that when I use

"[\"**.csv\"]"

or a file suffix, it actually works. When I try to exclude a folder, it doesn't work and still includes the files.

According to https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude

myfolder/**

the expected behaviour is matches objects in all subfolders of myfolder, such as /myfolder/mysource/mydata and /myfolder/mysource/data

1

1 Answers

1
votes

Give full path in exclusions as shown below

orig_dyf = glueContext.create_dynamic_frame.from_options(
's3',
{
    "paths": [
        's3://bucket/sample_data/'
    ],
    "recurse" : True,
    "exclusions" :  "[\"s3://bucket/sample_data/temp/**\"]"
},
"json",
transformation_ctx = "orig_dyf")