9
votes

I'm confused as to how I should use terraform to connect Athena to my Glue Catalog database.

I use

resource "aws_glue_catalog_database" "catalog_database" {
    name = "${var.glue_db_name}"
}

resource "aws_glue_crawler" "datalake_crawler" {
    database_name = "${var.glue_db_name}"
    name          = "${var.crawler_name}"
    role          = "${aws_iam_role.crawler_iam_role.name}"
    description   = "${var.crawler_description}"
    table_prefix  = "${var.table_prefix}"
    schedule      = "${var.schedule}" 

    s3_target {
      path = "s3://${var.data_bucket_name[0]}"
  }
    s3_target {
      path = "s3://${var.data_bucket_name[1]}"
  }
 }

to create a Glue DB and the crawler to crawl an s3 bucket (here only two), but I don't know how I link the Athena query service to the Glue DB. In the terraform documentation for Athena, there doesn't appear to be a way to connect Athena to a Glue catalog but only to an S3 Bucket. Clearly, however, Athena can be integrated with Glue.

How can I terraform an Athena database to use my Glue catalog as its data source rather than an S3 bucket?

2
Did you run the crawler? Did it create AWS Glue tables? If you do not define aws_glue_catalog_table resources with terraform that point to their respective S3 locations, the crawler will need to run at least once to create the tables. Once they are created your Glue DB and the tables should become visible in Athena, even without defining a terraform aws_athena_database resource. - Martin
I've run the crawler. There are no tables created in the Glue db. To be honest, I don't even know if I've given the crawler the correct role/policies: I've created a gist that includes the code I'm using: gist.github.com/stevenranney/3f5545e1e736266807b1f337e0be58e0 - Steven
I suggest to also add tag "terraform" (or "terraform-aws") to the question. - Martin

2 Answers

5
votes

Our current basic setup for having Glue crawl one S3 bucket and create/update a table in a Glue DB, which can then be queried in Athena, looks like this:

Crawler role and role policy:

  • The assume_role_policy of the IAM role needs only Glue as principal
  • The IAM role policy allows actions for Glue, S3, and logs
  • The Glue actions and resources can probably be narrowed down to the ones really needed
  • The S3 actions are limited to those needed by the crawler
resource "aws_iam_role" "glue_crawler_role" {
  name = "analytics_glue_crawler_role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "glue_crawler_role_policy" {
  name = "analytics_glue_crawler_role_policy"
  role = "${aws_iam_role.glue_crawler_role.id}"
  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:*",
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketAcl",
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::analytics-product-data",
        "arn:aws:s3:::analytics-product-data/*",
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:/aws-glue/*"
      ]
    }
  ]
}
EOF
}

S3 Bucket, Glue Database and Crawler:

resource "aws_s3_bucket" "product_bucket" {
  bucket = "analytics-product-data"
  acl = "private"
}

resource "aws_glue_catalog_database" "analytics_db" {
  name = "inventory-analytics-db"
}

resource "aws_glue_crawler" "product_crawler" {
  database_name = "${aws_glue_catalog_database.analytics_db.name}"
  name = "analytics-product-crawler"
  role = "${aws_iam_role.glue_crawler_role.arn}"

  schedule = "cron(0 0 * * ? *)"

  configuration = "{\"Version\": 1.0, \"CrawlerOutput\": { \"Partitions\": { \"AddOrUpdateBehavior\": \"InheritFromTable\" }, \"Tables\": {\"AddOrUpdateBehavior\": \"MergeNewColumns\" } } }"

  schema_change_policy {
    delete_behavior = "DELETE_FROM_DATABASE"
  }

  s3_target {
    path = "s3://${aws_s3_bucket.product_bucket.bucket}/products"
  }
}
0
votes

I had many things wrong in my Terraform code. To start with:

  1. The S3 bucket argument in the aws_athena_database code refers to the bucket for query output not the data the table should be built from.
  2. I had set up my aws_glue_crawler to write to a Glue database rather than an Athena db. Indeed, as Martin suggested above, once correctly set up, Athena was able to see the tables in the Glue db.
  3. I did not have the correct policies attached to my crawler. Initially, the only policy attached to the crawler role was

    resource "aws_iam_role_policy_attachment" "crawler_attach" {
        policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
        role = "${aws_iam_role.crawler_iam_role.name}"
    } 
    

    after setting a second policy that explicitly allowed all S3 access to all of the buckets I wanted to crawl and attaching that policy to the same crawler role, the crawler ran and updated tables successfully.

The second policy:

resource "aws_iam_policy" "crawler_bucket_policy" {
    name = "crawler_bucket_policy"
    path = "/"
    description = "Gives crawler access to buckets"
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1553807998309",
      "Action": "*",
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "Stmt1553808056033",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket0"
    },
    {
      "Sid": "Stmt1553808078743",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket1"
    },
    {
      "Sid": "Stmt1553808099644",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket2"
    },
    {
      "Sid": "Stmt1553808114975",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket3"
    },
    {
      "Sid": "Stmt1553808128211",
      "Action": "s3:*",
      "Effect": "Allow",
      "Resource": "arn:aws:s3:::bucket4"
    }
  ]
}
EOF
}

I'm confident that I can get away from hardcoding the bucket names in this policy but I don't yet know how to do that.