0
votes

I have a 8GB-size CSV file of 104 million rows sat on the local hard drive. I need to upload this either directly to BigQuery as a table or via Google Cloud Storage + then point link in BigQuery. What's the quickest way to accomplish this? After trying the web console upload and Google Cloud SDK, both are quite slow (moving at 1% progress every few minutes).

Thanks in advance!

4

4 Answers

1
votes

All the 3 existing answer are right, but if you have a low bandwidth, no one will help you, you will be physically limited.

My recommendation is to gzip your file before sending it. Text file has an high compression rate (up to 100 times) and you can ingest gzip files directly into BigQuery without unzipped them

1
votes

Using the gsutil tool is going to be much faster, and fault tolerant than the web console (which will probably time out before finishing anyway). You can find detailed instructions here (https://cloud.google.com/storage/docs/uploading-objects#gsutil) but essentially, once you have the gcloud tools installed on your computer, you'll run:

gsutil cp [OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/

From there, you can upload the file into BigQuery (https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv) which will all happen on Google's network.

The bottleneck you're going to face is your internet upload speed during the initial upload. What we've done in the past to bypass this is spin up a compute box, run whatever process generated the file, and have it output onto the compute box. Then, we use the built in gsutil tool to upload the file to cloud storage. This has the benefit of running entirely on Google's Network and will be pretty quick.

1
votes

I would recomment you to give a look to this article where there are several points to take into consideration.

Basically the best option is to upload your object making use of the parallel upload feature of gsutil, into the article you can find this command:

gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp ./localbigfile gs://your-bucket

And also there you will find several tips to improve your upload, like moving the chunk size of the objects to upload.

Once uploaded I'd go to the option that dweling has provided for the Bigquery part by looking further at this document.

1
votes

Have you considered using the BigQuery Command Line Tool, as per example provided below?

bq load --autodetect --source-format=CSV PROJECT_ID:DATASET.TABLE ./path/to/local/file/data.csv

The above command will directly load the contents of the local CSV file data.csv into the specified table with schema automatically detected. Alternatively, details on how you could customise the load job as per your requirements through parsing additional flags can be found here https://cloud.google.com/bigquery/docs/loading-data-local#bq