I have already setup a glue crawler successfully on the AWS console.
Now I have a Cloudformation template to mimic the whole process, EXCEPT I cannot add the Exclusions: field to the template. Background: From the AWS Glue API, the Exclusions: field represent glob patterns to exclude files or folders matching a specific pattern within the data store, in my example, an S3 data store.
With much effort I cannot get the glob patterns to populate on the glue crawler console despite all other values from the script populating alongside the crawler configuration, i.e. the S3Target, crawler name, IAM role, and grouping behavior, all these glue settings/fields populate successfully from the CFN template, all except the Exclusions field, also known as exclude patterns on the Glue Console. My CFN template passes validation and I've run the crawler hoping the exclude globs albeit hidden would somehow still have an affect, but unfortunately I cannot seem to populate the Exclusions field?
Here's the S3Target Exclusion AWS Glue API guide
Here's an AWS sample YAML CFN for a Glue Crawler
Here's a helpful YAML string array guide
YAML
CFNCrawlerSecDeraNUM:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: !GetAtt CFNRoleSecDERA.Arn
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl SecDERA data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
- Exclusions:
- "*/readme.htm"
- "*/sub.txt"
- "*/pre.txt"
- "*/tag.txt"
- Path: "s3://sec-input"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
# Added single schema grouping Glue API option
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
JSON
"CFNCrawlerSecDeraNUM": {
"Type": "AWS::Glue::Crawler",
"Properties": {
"Name": {
"Ref": "CFNCrawlerName"
},
"Role": {
"Fn::GetAtt": [
"CFNRoleSecDERA",
"Arn"
]
},
"Description": "AWS Glue crawler to crawl SecDERA data",
"DatabaseName": {
"Ref": "CFNDatabaseName"
},
"Targets": {
"S3Targets": [
{
"Exclusions": [
"*/readme.htm",
"*/sub.txt",
"*/pre.txt",
"*/tag.txt"
]
},
{
"Path": "s3://sec-input"
}
]
},
"TablePrefix": {
"Ref": "CFNTablePrefixName"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
}