I have set up a crawler in Glue, which crawls compressed CSV files (GZIP format) from S3 bucket. I have an ETL job which converts this CSV into Parquet and another crawler which read parquet file and populates parquet table.
The first crawler which reads compressed CSV file (GZIP format) seems like reading GZIP file header information. CSV file has five fields. Below is the sample data
3456789,1,200,20190118,9040
However after crawler populates the table, row looks like below.
xyz.csv0000644000175200017530113404730513420142427014701 0ustar wlsadmin3456789 1 200 20190118 9040
First column has some additional data which has file name and the user name of the machine where the GZIP file created.
Any idea, how can I avoid this situation and read correct values.