15
votes

I wanted to get all the folders inside a given Google Cloud bucket or folder using Google Cloud Storage API.

For example if gs://abc/xyz contains three folders gs://abc/xyz/x1, gs://abc/xyz/x2 and gs://abc/xyz/x3. The API should return all three folder in gs://abc/xyz.

It can easily be done using gsutil

gsutil ls gs://abc/xyz

But I need to do it using python and Google Cloud Storage API.

10
You say you want to get the folders inside xyz, but the command gsutil ls gs://abc/xyz returns all objects in xyz, including non-folder items. So, which are you asking for? All folders, or all items, including folders? - Robino

10 Answers

15
votes

This question is about listing the folders inside a bucket/folder. None of the suggestions worked for me and after experimenting with the google.cloud.storage SDK, I suspect it is not possible (as of November 2019) to list the sub-directories of any path in a bucket. It is possible with the REST API, so I wrote this little wrapper...

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, prefix):
    if prefix and not prefix.endswith('/'):
        prefix += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": prefix,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

For example, if you have my-bucket containing:

  • dog-bark
    • datasets
      • v1
      • v2

Then calling list_directories('my-bucket', 'dog-bark/datasets') will return:

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

11
votes

You can use the Python GCS API Client Library. See the Samples and Libraries for Google Cloud Storage documentation page for relevant links to documentation and downloads.

In your case, first I want to point out that you're confusing the term "bucket". I recommend reading the Key Terms page of the documentation. What you're talking about are object name prefixes.

You can start with the list-objects.py sample on GitHub. Looking at the list reference page, you'll want to pass bucket=abc, prefix=xyz/ and delimiter=/.

9
votes

Here's an update to this answer thread:

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))
6
votes

I also need to simply list the contents of a bucket. Ideally I would like something similar to what tf.gfile provides. tf.gfile has support for determining if an entry is a file or a directory.

I tried the various links provided by @jterrace above but my results were not optimal. With that said its worth showing the results.

Given a bucket which has a mix of "directories" and "files" its hard to navigate the "filesystem" to find items of interest. I've provided some comments in the code on how the code referenced above works.

In either case, I am using a datalab notebook with credentials included by the notebook. Given the results, I will need to use string parsing to determine which files are in a particular directory. If anyone knows how to expand these methods or an alternate method to parse the directories similar to tf.gfile, please reply.

Method One

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

This generates results like this:

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

Method Two

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

This generates output like this.

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3
5
votes

To get a list of folders in a bucket, you can use the code snippet below:

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))
3
votes

I faced the same issue and managed to get it done by using the standard list_blobs described here:

from google.cloud import storage

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
    bucket_name, prefix=prefix, delimiter=delimiter
)

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

However, this only worked for me after I read AntPhitlok answer, and understood I have to make sure my prefix ends with / and I'm also using / as a delimiter.

Due to that fact, under the 'Blobs:' section, we will only get file names, not folders, if exist under the prefix folder. All the sub-directories will be listed under the 'Prefixes:' section.

It's important to note that blobs is in fact an iterator, so in order to get the sub-directories, we must "open" it. Therefore, leaving out the 'Blobs:' section from our code, will result in an empty set() inside blobs.prefixes

Edit: An example of usage - say I have a bucket named buck and inside it a directory named dir. inside dir I have another directory named subdir.

In order to list the directories inside dir, I can use:

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

*Notice the / at the end of prefix and as the delimiter.

This call will print me the following:

Prefixes:
subdir/
2
votes
# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)
1
votes
#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]

you can edit the condition to whatever works for you. in my case, I wanted only the deeper “folders”. for the save level folders, you can replace with

 x.split("/")[-2] == "SUBDIR”
1
votes

Here's a simple way to get all subfolders:

from google.cloud import storage


def get_subdirs(bucket_name, dir_name=None):
    """
    List all subdirectories for a bucket or
    a specific folder in a bucket. If `dir_name`
    is left blank, it will list all directories in the bucket.
    """
    client = storage.Client()
    bucket = client.lookup_bucket(bucket_name)

    all_folders = []
    for resource in bucket.list_blobs(prefix=dir_name):

        # filter for directories only
        n = resource.name
        if n.endswith("/"):
            all_folders.append(n)

    return all_folders

# Use as follows:
all_folders = get_subdirs("my-bucket")
1
votes

here is a simple solution

from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os

# set up your bucket 
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')

# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')

if you want only the folders in the selected folder

base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string 
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')

of course instead of using

list(set())

you could use numpy

import numpy as np
np.unique()