0
votes

We are working with AWS Glue as a pipeline tool for ETL at my company. So far, the pipelines were created manually via the console and I am now moving to Terraform for future pipelines as I believe IaC is the way to go.

I have been trying to work on a module (or modules) that I can reuse as I know that we will be making several more pipelines for various projects. The difficulty I am having is in creating a good level of abstraction with the module. AWS Glue has several components/resources to it, including a Glue connection, databases, crawlers, jobs, job triggers and workflows. The problem is that the number of databases, jobs, crawlers and/or triggers and their interractions (i.e. some triggers might be conditional while others might simply be scheduled) can vary depending on the project, and I am having a hard time abstracting this complexity via modules.

I am having to create a lot of for_each "loops" and dynamic blocks within resources to try to render the module as generic as possible (e.g. so that I can create N number of jobs and/or triggers from the root module and define their interractions).

I understand that modules should actually be quite opinionated and specific, and be good at one task so to speak, which means my problem might simply be conceptual. The fact that these pipelines vary significantly from project to project make them a poor use case for modules.

On a side note, I have not been able to find any robust examples of modules online for AWS Glue so this might be another indicator that it is indeed not the best use case.

Any thoughts here would be greatly appreciated.

EDIT: As requested, here is some of my code from my root module:

module "glue_data_catalog" {
  source = "../../modules/aws-glue/data-catalog"

  # Connection
  create_connection = true

  conn_name        = "SAMPLE"
  conn_description = "SAMPLE."
  conn_type        = "JDBC"
  conn_url         = "jdbc:sqlserver:"
  conn_sg_ids      = ["sampleid"]
  conn_subnet_id   = "sampleid"
  conn_az          = "eu-west-1a"

  conn_user = var.conn_user
  conn_pass = var.conn_pass

  # Databases
  db_names = [
    "raw",
    "cleaned",
    "consumption"
  ]

  # Crawlers
  crawler_settings = {
    Crawler_raw = {
      database_name = "raw"
      s3_path       = "bucket-path"
      jdbc_paths    = []
    },
    Crawler_cleaned = {
      database_name = "cleaned"
      s3_path       = "bucket-path"
      jdbc_paths    = []
    }
  }

  crawl_role = "SampleRole"
}

Glue data catalog module:

#############################
# Glue Connection
#############################
resource "aws_glue_connection" "this" {
  count = var.create_connection ? 1 : 0

  name            = var.conn_name
  description     = var.conn_description
  connection_type = var.conn_type

  connection_properties = {
    JDBC_CONNECTION_URL = var.conn_url
    USERNAME            = var.conn_user
    PASSWORD            = var.conn_pass
  }

  catalog_id     = var.conn_catalog_id
  match_criteria = var.conn_criteria

  physical_connection_requirements {
    security_group_id_list = var.conn_sg_ids
    subnet_id              = var.conn_subnet_id
    availability_zone      = var.conn_az
  }
}

#############################
# Glue Database Catalog
#############################
resource "aws_glue_catalog_database" "this" {
  for_each = var.db_names

  name         = each.key
  description  = var.db_description
  catalog_id   = var.db_catalog_id
  location_uri = var.db_location_uri
  parameters   = var.db_params
}

#############################
# Glue Crawlers
#############################
resource "aws_glue_crawler" "this" {
  for_each = var.crawler_settings

  name          = each.key
  database_name = each.value.database_name

  description   = var.crawl_description
  role          = var.crawl_role
  configuration = var.crawl_configuration

  s3_target {
    connection_name = var.crawl_s3_connection
    path            = each.value.s3_path
    exclusions      = var.crawl_s3_exclusions
  }

  dynamic "jdbc_target" {
    for_each = each.value.jdbc_paths
    content {
      connection_name = var.crawl_jdbc_connection
      path            = jdbc_target.value
      exclusions      = var.crawl_jdbc_exclusions
    }
  }

  recrawl_policy {
    recrawl_behavior = var.crawl_recrawl_behavior
  }

  schedule     = var.crawl_schedule
  table_prefix = var.crawl_table_prefix
  tags         = var.crawl_tags
}

It seems to me that I'm not actually providing any abstraction in this way but simply overcomplicating things.

1
If the pipelines are so different between project, why just not have them managed by separate TF code base? If you will try to cram everything into one codebase, it will get complex very quickly and it will be difficult to debug and modify.Marcin
Every project would have its own TF code, that goes without saying. But to avoid repetition I thought I might create a module that I can use every time for the various projects.LazyEval
Can you edit your question to share an example of what you've written so far and then explain what issues you're having with it please?ydaetskcoR

1 Answers

0
votes

I think I found a good solution to the problem, though it happened "by accident". We decided to divide the pipelines into two distinct projects:

  • ETL on source data
  • BI jobs to compute various KPIs

I then noticed that I could group resources together for both projects and standardize the way we have them interact (e.g. one connection, n tables, n crawlers, n etl jobs, one trigger). I was then able to create a module for the ETL process and a module for the BI/KPIs process which provided enough abstraction to actually be useful.