1
votes

I have set up a crawler in Glue, which crawls compressed CSV files (GZIP format) from S3 bucket. I have an ETL job which converts this CSV into Parquet and another crawler which read parquet file and populates parquet table.

The first crawler which reads compressed CSV file (GZIP format) seems like reading GZIP file header information. CSV file has five fields. Below is the sample data

3456789,1,200,20190118,9040

However after crawler populates the table, row looks like below.

xyz.csv0000644000175200017530113404730513420142427014701 0ustar wlsadmin3456789 1 200 20190118 9040

First column has some additional data which has file name and the user name of the machine where the GZIP file created.

Any idea, how can I avoid this situation and read correct values.

1
share sample gzip fileSandeep Fatangare

1 Answers

1
votes

In my opinion you will not able to create the table by using AWS Glue Crawler in your case.

https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

Of course you can try to implement Glue Grok Custom Classifier, but it can be dificult to do.

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-grok

I suggest to create the table manually, by using Create command in Athena or in Glue console. Because you need to skip first row you have to set 'skip.header.line.count'='1' option in TBLPROPERTIES. The definition of this table may look as the following example:

 CREATE EXTERNAL TABLE `gzip_test`(
  `col1` bigint, 
  `col2` bigint, 
  `col3` bigint, 
  `col4` bigint, 
  `col5` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://example/example'
TBLPROPERTIES (
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='gzip', 
  'delimiter'=',', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')