0
votes

Good morning everyone. I have a GCS Bucket, which has files that have been transferred from our Amazon S3 bucket. These files are in .gz.parquet format. I am trying to set up a transfer from the GSC bucket to BigQuery with the transfer feature, however I am running into issues with the parquet file format.

When I create a transfer and specify the file format as Parquet, I receive an error stating that the data is not in parquet format. When I tried specifying the file in CSV, weird values appear in my table as shown in the image linked: Results 2

I have tried the following URIs:

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.

  • bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.

Does anyone have any idea on how I should proceed? Thank you in advance!

2
Maybe the issue comes from the gz compression ? Have you tried uncompressing the files before tranfering them ? - Cylldby
Hello, thank you for your response. I was thinking maybe this can be it. I am trying to use the Transfer function from GCS to BQ because it's easier, but perhaps I need to use CloudComposer/Python instead... - Victoria
Ok, I was not looking at the right place.. Actually BQ supports GZipped Parquet files ! - Cylldby

2 Answers

1
votes

There is a dedicated documentation explaining how to copy Parquet data from Cloud storage bucket to Big Query which is given below. Could you please go thru it and update us if its still not solving your problem.

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet

Regards, Anbu.

0
votes

Seeing the looks of your URIs, the page you are looking for is this one, for loading hive partitioned parquet files into BigQuery.

You can try something like below in Cloud Shell:

bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gs://bucket-name/folder-1/folder-2/ \
dataset.table `gcs_uris`