2
votes

I want to run glue job to do ETL process for many csv files from s3 to Postgres DB. New files are written to s3 source bucket every day. When I run crawler for those files to generate table with schema, instead of just one table in glue Data catalog I get many tables, which means that crawler doesn't recognize the schemas for those files as the same. Maybe because there are many files with just header but no content.

So when I create glue job with wizard, when asked which table to use, I select just one of the tables from glue Data catalog (created based on the largest csv file). As a result, in the DB I have a data from that largest file only, not from all csv files. I guess it happens because crawler while it creates those tables in in glue Data catalog, also saves the list of files that correspond to this table, I found those files on s3:/aws-glue-temporary-000xxxxx-us-east-2/admin/partitionlisting/script_name/xxxxx/ for each glue job there is datasource0.input-files.json file with content like {"path":"s3://bucket1/customer/dt=2020-02-03/","files":["s3://bucket1/customer/dt=2020-02-03/file1.csv"]}]

When I check When I try to create a schema table in glue Data catalog manually, and assign it to glue job script, with a hope that all the files in s3 path will be processed, it doesn't read any of the files, and in the log I see

- Skipping Partition {}
 as no new files detected @ s3://bucket1/customer/ / or path does not exist

and when I check the corresponding datasource0.input-files.json, it doesn't have any files:[{"path":"s3://bucket1/customer/","files":[]}]

What am I doing wrong? How to make glue job script with manually created schema table read all the files in chosen s3 path? Or is it possible to use just one of many automatically created schema tables with all the files (not only the one schema was based on)?

1

1 Answers

1
votes

It might be that you are running the glue job with bookmark enabled. You are better off not using the bookmarks, and also not specify the transformation context when you are extracting your data considering you are setting everything up manually and not through the crawler. Additionally if you have partitions, you should also manually add the partition definitions as well.