How to convert headerless, compressed, pipe-delimited files stored in S3 into parquet using AWS Glue

Question

Currently, I have several thousand headerless, pipe-delimited, GZIP compressed files in S3, totaling ~10TB, with the same schema. What is the best way, in AWS Glue, to (1) add a header file, (2) convert to parquet format partitioned by week using a "date" field in the files, (3) have the files be added to the Glue Data Catalog for accessibility for querying in AWS Athena?

Harsh Bafna Harsh Bafna · Accepted Answer · 2019-05-18T05:07:27

1) Create an athena table pointing your data on S3 :

Create external table on athena

2) Create a dynamic frame from glue catalog, using the table you created in above step.

from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")

3) Write the data back to new S3 location in whatever format you like:

glueContext.write_dynamic_frame.from_options(
   frame = DyF,
   connection_type = "s3",
   connection_options = {"path": "path to new s3 location"},
   format = "parquet")

4) Create an athena table pointing your parquet data on S3 :

Create external table on athena

Note : Instead of creating athena table manually, you can also use glue crawler to create one for you. However, that will incur some charges.

How to convert headerless, compressed, pipe-delimited files stored in S3 into parquet using AWS Glue

1 Answers