0
votes

I have moved a public dataset available as a Public Google Storage Bucket into my own bucket. The file size is about 10 GB. When the data moved, the file was split into about 47 shards, all compressed. I am unable to combine them into one file. How can I combine them?

Information given on the following link does not help much:

https://cloud.google.com/storage/docs/gsutil/commands/compose

My bucket looks like this:

enter image description here

Any help will be appreciated.

2
What format are each of the file parts in? You say they are compressed ... is this individually compressed or if all the parts are concatenated together then is it one single compressed file? - Kolban
What public dataset did you move to Google Cloud Storage? What was the exact command used to move the files? - Daniel Ocando
What is the size uncompressed? what is the format uncompressed? Is it an appendable format? - guillaume blaquiere
@Kolban: They are in CSV. They were compressed when extracted from the nyc public bucket. - Kaustubh Mulay
@DanielOcando: It is the NYC 311 Public Data Set <console.cloud.google.com/…> - Kaustubh Mulay

2 Answers

2
votes

I propose you to use Cloud Build. It's not the most obvious solution, but it's serverless and cheap. Perfect for your 1 time use case. here what I propose to perform

steps:
- name: 'gcr.io/cloud-builders/gsutil'
  entrypoint: "bash"
  args: 
    - -c
    - |
       # copy all your files locally
       gsutil -m cp gs://311_nyc/311* .

       # Uncompress your file
       # I don't know your compression method? gunzip?

       # append your file in a merged file. Delete the files after the merge.
       for file in $(ls -1 311* ); do cat $file >> merged; rm $file; done

       # Copy the file to the destination bucket
       gsutil cp merged gs://myDestinationBucket/myName.csv

options:
# Use 1Tb of disk for getting all the files in the same time on the same server. 
# I didn't understand is the 10Gb is per uncompressed file or the total size. 
# If it's the total file size, I think that this option is useless
 diskSizeGb: 1000

# Optionally extend the default 10 minutes timeout if it takes too much time.
 timeout: 660s
0
votes

Combine using nodejs

const { Storage } = require('@google-cloud/storage');
await storage.bucket(bucketName).combine(sourceFilenameList, destFilename)