Currently, I have several thousand headerless, pipe-delimited, GZIP compressed files in S3, totaling ~10TB, with the same schema. What is the best way, in AWS Glue, to (1) add a header file, (2) convert to parquet format partitioned by week using a "date" field in the files, (3) have the files be added to the Glue Data Catalog for accessibility for querying in AWS Athena?
1 Answers
1
votes
1) Create an athena table pointing your data on S3 :
Create external table on athena
2) Create a dynamic frame from glue catalog, using the table you created in above step.
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")
3) Write the data back to new S3 location in whatever format you like:
glueContext.write_dynamic_frame.from_options(
frame = DyF,
connection_type = "s3",
connection_options = {"path": "path to new s3 location"},
format = "parquet")
4) Create an athena table pointing your parquet data on S3 :
Create external table on athena
Note : Instead of creating athena table manually, you can also use glue crawler to create one for you. However, that will incur some charges.