1
votes

I have already setup a glue crawler successfully on the AWS console. Now I have a Cloudformation template to mimic the whole process, EXCEPT I cannot add the Exclusions: field to the template. Background: From the AWS Glue API, the Exclusions: field represent glob patterns to exclude files or folders matching a specific pattern within the data store, in my example, an S3 data store.

With much effort I cannot get the glob patterns to populate on the glue crawler console despite all other values from the script populating alongside the crawler configuration, i.e. the S3Target, crawler name, IAM role, and grouping behavior, all these glue settings/fields populate successfully from the CFN template, all except the Exclusions field, also known as exclude patterns on the Glue Console. My CFN template passes validation and I've run the crawler hoping the exclude globs albeit hidden would somehow still have an affect, but unfortunately I cannot seem to populate the Exclusions field?

Here's the S3Target Exclusion AWS Glue API guide

Here's an AWS sample YAML CFN for a Glue Crawler

Here's a helpful YAML string array guide

YAML

 CFNCrawlerSecDeraNUM:
    Type: AWS::Glue::Crawler
    Properties:
      Name: !Ref CFNCrawlerName
      Role: !GetAtt CFNRoleSecDERA.Arn
      #Classifiers: none, use the default classifier
      Description: AWS Glue crawler to crawl SecDERA data
      #Schedule: none, use default run-on-demand
      DatabaseName: !Ref CFNDatabaseName
      Targets:
        S3Targets:
          - Exclusions:
              - "*/readme.htm"
              - "*/sub.txt"
              - "*/pre.txt"
              - "*/tag.txt"
          - Path: "s3://sec-input"
      TablePrefix: !Ref CFNTablePrefixName
      SchemaChangePolicy:
        UpdateBehavior: "UPDATE_IN_DATABASE"
        DeleteBehavior: "LOG"
        # Added single schema grouping Glue API option
      Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"

JSON

"CFNCrawlerSecDeraNUM": {
    "Type": "AWS::Glue::Crawler",
    "Properties": {
        "Name": {
            "Ref": "CFNCrawlerName"
        },
        "Role": {
            "Fn::GetAtt": [
                "CFNRoleSecDERA",
                "Arn"
            ]
        },
        "Description": "AWS Glue crawler to crawl SecDERA data",
        "DatabaseName": {
            "Ref": "CFNDatabaseName"
        },
        "Targets": {
            "S3Targets": [
                {
                    "Exclusions": [
                        "*/readme.htm",
                        "*/sub.txt",
                        "*/pre.txt",
                        "*/tag.txt"
                    ]
                },
                {
                    "Path": "s3://sec-input"
                }
            ]
        },
        "TablePrefix": {
            "Ref": "CFNTablePrefixName"
        },
        "SchemaChangePolicy": {
            "UpdateBehavior": "UPDATE_IN_DATABASE",
            "DeleteBehavior": "LOG"
        },
        "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
    }
}
1

1 Answers

2
votes

You are passing Exclusions as a new S3Target object to the S3Targets list.

Try change this:

  Targets:
    S3Targets:
      - Exclusions:
          - "*/readme.htm"
          - "*/sub.txt"
          - "*/pre.txt"
          - "*/tag.txt"
      - Path: "s3://sec-input"

To this:

  Targets:
    S3Targets:
      - Path: "s3://sec-input"
        Exclusions:
          - "*/readme.htm"
          - "*/sub.txt"
          - "*/pre.txt"
          - "*/tag.txt"