How to read AWS Glue Data Catalog table schemas programmatically

Question

I have a set of daily CSV files of uniform structure which I will upload to S3. There is a downstream job which loads the CSV data into a Redshift database table. The number of columns in the CSV may increase and from that point onwards the new files will come with the new columns in them. When this happens, I would like to detect the change and add the column to the target Redshift table automatically.

My plan is to run a Glue Crawler on the source CSV files. Any change in schema would generate a new version of the table in the Glue Data Catalog. I would then like to programmatically read the table structure (columns and their datatypes) of the latest version of the Table in the Glue Data Catalog using Java, .NET or other languages and compare it with the schema of the Redshift table. In case new columns are found, I will generate a DDL statement to alter the Redshift table to add the columns.

Can someone point me to any examples of reading Glue Data Catalog tables using Java, .NET or other languages? Are there any better ideas to automatically add new columns to Redshift tables?

hoaxz hoaxz · Accepted Answer · 2018-01-15T15:59:13

If you want to use Java, use the dependency:

<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>aws-java-sdk-glue</artifactId>
  <version>{VERSION}</version>
</dependency>

And here's a code snippet to get your table versions and the list of columns:

AWSGlue client = AWSGlueClientBuilder.defaultClient();
GetTableVersionsRequest tableVersionsRequest = new GetTableVersionsRequest()
    .withDatabaseName("glue_catalog_database_name")
    .withCatalogId("table_name_generated_by_crawler");
GetTableVersionsResult results = client.getTableVersions(tableVersionsRequest);
// Here you have all the table versions, at this point you can check for new ones
List<TableVersion> versions = results.getTableVersions();
// Here's how to get to the table columns
List<Column> tableColumns = versions.get(0).getTable().getStorageDescriptor().getColumns();

Here you can see AWS Doc for the TableVersion and the StorageDescriptor objects.

You could also use the boto3 library for Python.

Hope this helps.

How to read AWS Glue Data Catalog table schemas programmatically

1 Answers