1
votes

I got bunch of gzipped CSV files sitting in S3 bucket, which I would like to import into BigQuery. It looks from the docs that BQ can import gzipped data from different sources. However, when I configure data transfer job it only allows for CSV, JSON, AVRO, PARQUET, ORC file formats.

Having transfer job configured for CSV file format and run, it logs the following:

Moving data from Amazon S3 to Google Cloud complete: Moved 64 object(s).    
Summary: succeeded 0 jobs, failed 0 jobs.

So it seems not to look into .gz archives anyway.

Is there a handy way to import gzipped CSV files from S3 into BQ table?

1
Hi, were you able to fix it?. As per the documentation, they support gzipped files. - Asish
Yes. There was an issue with job settings. Eventually I managed to import gzipped CSVs. - Vadim Tikanov
@VadimTikanov It would be nice if you posted your solution as an answer here in accordance to the community spirit. - itroulli
Frankly, I did not catch what exactly solved the issue. I tried to recreate the job and eventually it worked. It looks like I may have messed some checkbox initially. So the answer is "Yes, BQ supports importing gzipped CSVs, but make sure to pay attention to job settings". - Vadim Tikanov

1 Answers

0
votes

Other than bigquery s3 data transfer provided from UI, if you are looking for an alternative way through CLI. You can use gsutil -o with aws credentials to copy files to google cloud storage & bq cli to transfer files to bigquery with the schema file in json

 echo "source s3 bucket:"$source_bucket""
 echo "target gcs bucket:"$dest_bucket""
 SCHEMA="/path/to/schema/file"
 dir_data=$(gsutil -o "Credentials:aws_access_key_id=${aws_access_key_id}" \
     -o "Credentials:aws_secret_access_key=${aws_secret_access_key}" \
     ls "$source_bucket")
 echo "files to transfer "$dir_data""
 echo "transfering data into google storage $dest_bucket"
 tr_data=$(gsutil -m -o "Credentials:aws_access_key_id=${aws_access_key_id}" \
     -o "Credentials:aws_secret_access_key=${aws_secret_access_key}" \
     cp "$source_bucket"* "$dest_bucket")
 echo "bq transfer - $target_dataset"
 bq --debug_mode=true load --source_format=CSV \
     --encoding=UTF-8 --allow_jagged_rows --allow_quoted_newlines \
     --skip_leading_rows=1 "$target_table" "$dest_bucket"part-*.gz "$SCHEMA"
 echo "transfer to bq complete"