I want to run glue job to do ETL process for many csv files from s3 to Postgres DB. New files are written to s3 source bucket every day. When I run crawler for those files to generate table with schema, instead of just one table in glue Data catalog I get many tables, which means that crawler doesn't recognize the schemas for those files as the same. Maybe because there are many files with just header but no content.
So when I create glue job with wizard, when asked which table to use, I select just one of the tables from glue Data catalog (created based on the largest csv file). As a result, in the DB I have a data from that largest file only, not from all csv files. I guess it happens because crawler while it creates those tables in in glue Data catalog, also saves the list of files that correspond to this table, I found those files on s3:/aws-glue-temporary-000xxxxx-us-east-2/admin/partitionlisting/script_name/xxxxx/ for each glue job there is datasource0.input-files.json file with content like {"path":"s3://bucket1/customer/dt=2020-02-03/","files":["s3://bucket1/customer/dt=2020-02-03/file1.csv"]}]
When I check When I try to create a schema table in glue Data catalog manually, and assign it to glue job script, with a hope that all the files in s3 path will be processed, it doesn't read any of the files, and in the log I see
- Skipping Partition {}
as no new files detected @ s3://bucket1/customer/ / or path does not exist
and when I check the corresponding datasource0.input-files.json, it doesn't have any files:[{"path":"s3://bucket1/customer/","files":[]}]
What am I doing wrong? How to make glue job script with manually created schema table read all the files in chosen s3 path? Or is it possible to use just one of many automatically created schema tables with all the files (not only the one schema was based on)?