I want to setup Google Cloud Storage as my data lake and I'm using Pubsub + dataflow to save interactions into it. Dataflow creates a new file each 5 min to store it in a GCS folder. This will eventually lead to a lot of files inside the given folder. Is there any limit on the number of files that can be saved inside a GCS folder?
2 Answers
The limit is 5.2 pentillion, which would take many years to even create
We store some of our services as zero-compute JSON files with sub-folders in GCP buckets. I wanted to confirm we could store more than 4.2 billion folders in a bucket so we could access our files via ID just like we would in a database (currently we are up to over 100k files per folder - we basically use GCP buckets as a type of database that has a read:write ratio well-beyond 1m:1).
I asked our engineering team to open a ticket and confirm our usage was practical, and that passing 4.2 billion items was possible. Google Cloud support confirmed there are customers using Cloud Storage today that go well-beyond the 4.2 billion (32 bit) limit, into the trillions, and that the main index currently involves a 64 bit pointer, which may be the only limit.
64 bit is 5.2 pentillion, or 9,223,372,036,854,775,807 to be exact.
They do have other, related-limits like 1k writes/5k reads per second per bucket, which can auto-scale but has nuances, so if you think you may hit that limit, you may want to read about it here: https://cloud.google.com/storage/docs/request-rate.
For reference, here is there general storage quotas and limits: https://cloud.google.com/storage/quotas
...it does not describe the 64-bit / 5.2 pentillion item limitation, possibly because that limit would practically be impossible to reach, as it would take about a decade just to create the objects, after which time it would be 2032 and they would probably have engineered beyond 64-bit :)