3
votes

I get this error when trying to load a table in Google BQ:

Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 56659381010. Max allowed size is: 4294967296.

Is there a way to split the file using gsutil or something like that without having to upload everything again?

2
instead of splitting the compressed file, uncompress it - and BigQuery will be able to ingest it easily. to uncompress, go through GCE - Felipe Hoffa

2 Answers

2
votes

The largest compressed CSV file you can load into BigQuery is 4 gigabytes. GCS unfortunately does not provide a way to decompress a compressed file, nor does it provide a way to split a compressed file. GZip'd files can't be arbitrarily split up and reassembled in the way you could a tar file.

I imagine your best bet would likely be to spin up a GCE instance in the same region as your GCS bucket, download your object to that instance (which should be pretty fast, given that it's only a few dozen gigabytes), decompress the object (which will be slower), break that CSV file into a bunch of smaller ones (the linux split command is useful for this), and then upload the objects back up to GCS.

0
votes

I ran into the same issue and this is how I dealt with it:

First, spin up a Google Compute Engine VM instance. https://console.cloud.google.com/compute/instances

Then install the gsutil commands and then go through the authentication process. https://cloud.google.com/storage/docs/gsutil_install

Once you have verified that the gcloud, gsutil, and bq commands are working then save a snapshot of the disk as snapshot-1 and then delete this VM.


On your local machine, run this command to create a new disk. This disk is used for the VM so that you have enough space to download and unzip the large file.

gcloud compute disks create disk-2017-11-30 --source-snapshot snapshot-1 --size=100GB

Again on your local machine, run this command to create a new VM instance that uses this disk. I use the --preemptible flag to save some cost.

gcloud compute instances create loader-2017-11-30 --disk name=disk-2017-11-30,boot=yes --preemptible

Now you can SSH into your instance and then run these commands on the remote machine.

First, copy the file from cloud storage to the VM

gsutil cp gs://my-bucket/2017/11/20171130.gz .

Then unzip the file. In my case, for ~4GB file, it took about 17 minutes to complete this step:

gunzip 20171130.gz

Once unzipped, you can either run the bq load command to load it into BigQuery but I found that for my file size (~70 GB unzipped), that operation would take about 4 hours. Instead, I uploaded the unzipped file back to Cloud Storage

gsutil cp 20171130 gs://am-alphahat-regional/unzipped/20171130.csv

Now that the file is back on cloud storage, you can run this command to delete the VM.

gcloud compute instances delete loader-2017-11-30

Theoretically, the associated disk should also have been deleted, but I found that the disk was still there and I needed to delete it with an additional command

gcloud compute disks delete disk-2017-11-30

Now finally, you should be able to run the bq load command or you can load the data from the console.