2
votes

Im using AWS Glue to copy data from DynamoDB to S3. I have written the below code to copy DyanmoDB table to S3 in the same account. It works fine, copies my table with 600million records without any issues. It take about 20min.

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime

# inputs
dataset_date = datetime.strftime(datetime.now(), '%Y%m%d')
table_name = "table-name"
read_percentage = "0.5"
output_location = 's3://'+dataset_date
fmt ="json" 

# glue setup
sc = SparkContext()
glueContext = GlueContext(sc)

# scan the DDB table
table = glueContext.create_dynamic_frame.from_options("dynamodb",
                                                  connection_options={
                                                                      "dynamodb.input.tableName": table_name,
                                                                      "dynamodb.throughput.read.percent": read_percentage,
                                                                      "dynamodb.splits": "100"
                                                                      }
                                                )

# write to S3
glueContext.write_dynamic_frame.from_options(frame=table,
                                         connection_type="s3",
                                         connection_options={"path": output_location},
                                         format=fmt,
                                         transformation_ctx="datasink"
                                        )

But now I want to do a cross account S3 dump using the above script. The DynamoDB tables are in account A (prod account) and the Glue job to read from DynamoDB tables and S3 bucket to dump that data are in Account B (DW account). I don't know if it is possible to use my script but give cross-account Glue access so it can read DynamoDB tables from Account A

2

2 Answers

0
votes

Create an IAM role in Account A (DynamoDB table owner account) that allows for Glue as Principal to read tables.

Configure permissions policy for IAM role in Account A (DynamoDB table owner account) that allowing reading data in tables. A sample you can build from is provided as follow:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListAndDescribe",
            "Effect": "Allow",
            "Action": [
                "dynamodb:List*",
                "dynamodb:DescribeReservedCapacity*",
                "dynamodb:DescribeLimits",
                "dynamodb:DescribeTimeToLive"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllTables",
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGet*",
                "dynamodb:DescribeStream",
                "dynamodb:DescribeTable",
                "dynamodb:Get*",
                "dynamodb:Query",
                "dynamodb:Scan"
            ],
            "Resource": [
                "arn:aws:dynamodb:*:*:table/table-1",
                "arn:aws:dynamodb:*:*:table/table-2"
            ]
        }
    ]
}

Configure trust policy in the above IAM role in Account A (Dynamo DB tables account) to permit Glue to assume it.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "glue.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In IAM role configured for Glue job in Account B (which doesn't own the tables), include a permission policy for it to assume the IAM role in Account A (Dynamo DB tables owner account).

    {
        "Sid": "DelegateDynamoDBTablesOwnerRoleArn",
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "arn:aws:iam::dynamo-db-table-owner-role-arn:role/*"
    }

References

  1. https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html#cross-account-calling-etl
0
votes

I don't think this going to work with the glue script running in the DW account.

The GlueContext connects to DynamoDB using the table name:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

What you should explore is run the glue script in the Prod account and give cross-account access to your DW buckets so that Glue running in the prod account can assume the role to write the target S3 Buckets in DW account.