11
votes

Part One :

I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.

But the demo data of ELB in Athena works fine.

Part Two (Scenario:)

Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?

8
Update the question with some sample data from the csv file which was read by the crawler and also mention the schema structure. Don't just plainly mention Zero Records returned.Madhukar Mohanraju

8 Answers

22
votes

I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,

10
votes

I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .

4
votes

Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler) important note one CSV file run it we can view the records in Athena.

3
votes

I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").

Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.

1
votes

Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.

1
votes

So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.

This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue. You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )

I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.

So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json

And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.

for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html

1
votes

The structure of the s3 bucket / folder is very important :

s3://<bucketname>/<data-folder>/
    /<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
    /<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
    ...
    /<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>

and specify in the "include path" of the Glue Crawler:

s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
0
votes

Pointing glue crawler to the S3 folder and not the acutal file did the trick.