We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go.
I have been trying to work on a module (or modules) that I can reuse as I know that we will be making several more pipelines for various projects. The difficulty I am having is in creating a good level of abstraction with the module. AWS Glue has several components/resources to it, including a Glue connection, databases, crawlers, jobs, job triggers and workflows. The problem is that the number of databases, jobs, crawlers and/or triggers and their interractions (i.e. some triggers might be conditional while others might simply be scheduled) can vary depending on the project, and I am having a hard time abstracting this complexity via modules.
I am having to create a lot of for_each
"loops" and dynamic blocks within resources to try to render the module as generic as possible (e.g. so that I can create N number of jobs and/or triggers from the root module and define their interractions).
I understand that modules should actually be quite opinionated and specific, and be good at one task so to speak, which means my problem might simply be conceptual. The fact that these pipelines vary significantly from project to project make them a poor use case for modules.
On a side note, I have not been able to find any robust examples of modules online for AWS Glue so this might be another indicator that it is indeed not the best use case.
Any thoughts here would be greatly appreciated.
EDIT: As requested, here is some of my code from my root module:
module "glue_data_catalog" {
source = "../../modules/aws-glue/data-catalog"
# Connection
create_connection = true
conn_name = "SAMPLE"
conn_description = "SAMPLE."
conn_type = "JDBC"
conn_url = "jdbc:sqlserver:"
conn_sg_ids = ["sampleid"]
conn_subnet_id = "sampleid"
conn_az = "eu-west-1a"
conn_user = var.conn_user
conn_pass = var.conn_pass
# Databases
db_names = [
"raw",
"cleaned",
"consumption"
]
# Crawlers
crawler_settings = {
Crawler_raw = {
database_name = "raw"
s3_path = "bucket-path"
jdbc_paths = []
},
Crawler_cleaned = {
database_name = "cleaned"
s3_path = "bucket-path"
jdbc_paths = []
}
}
crawl_role = "SampleRole"
}
Glue data catalog module:
#############################
# Glue Connection
#############################
resource "aws_glue_connection" "this" {
count = var.create_connection ? 1 : 0
name = var.conn_name
description = var.conn_description
connection_type = var.conn_type
connection_properties = {
JDBC_CONNECTION_URL = var.conn_url
USERNAME = var.conn_user
PASSWORD = var.conn_pass
}
catalog_id = var.conn_catalog_id
match_criteria = var.conn_criteria
physical_connection_requirements {
security_group_id_list = var.conn_sg_ids
subnet_id = var.conn_subnet_id
availability_zone = var.conn_az
}
}
#############################
# Glue Database Catalog
#############################
resource "aws_glue_catalog_database" "this" {
for_each = var.db_names
name = each.key
description = var.db_description
catalog_id = var.db_catalog_id
location_uri = var.db_location_uri
parameters = var.db_params
}
#############################
# Glue Crawlers
#############################
resource "aws_glue_crawler" "this" {
for_each = var.crawler_settings
name = each.key
database_name = each.value.database_name
description = var.crawl_description
role = var.crawl_role
configuration = var.crawl_configuration
s3_target {
connection_name = var.crawl_s3_connection
path = each.value.s3_path
exclusions = var.crawl_s3_exclusions
}
dynamic "jdbc_target" {
for_each = each.value.jdbc_paths
content {
connection_name = var.crawl_jdbc_connection
path = jdbc_target.value
exclusions = var.crawl_jdbc_exclusions
}
}
recrawl_policy {
recrawl_behavior = var.crawl_recrawl_behavior
}
schedule = var.crawl_schedule
table_prefix = var.crawl_table_prefix
tags = var.crawl_tags
}
It seems to me that I'm not actually providing any abstraction in this way but simply overcomplicating things.