5
votes

I need to list all of the blobs in an Azure Blobstorage container. The container has circa 200,000~ blobs in it, and I'm looking to obtain the blob name, the last modified date, and the blob size.

Following the documentation for the Azure Java SDK V12, the following code should work:

BlobServiceClient blobServiceClient = new BlobServiceClientBuilder().connectionString(AzureBlobConnectionString).buildClient();
String containerName = "container1";
BlobContainerClient containerClient = blobServiceClient.getBlobContainerClient(containerName);
System.out.println("\nListing blobs...");

// List the blob(s) in the container.
for (BlobItem blobItem : containerClient.listBlobs()) {
  System.out.println("\t" + blobItem.getName());
}

However, when executed this application just seems to hang indefinitely. If I open Powershell and run the following command:

Get-AzStorageBlob -Container container1 -Context $ctx

I get the expected result within about 3 minutes.

I've given the code example upwards of an hour to execute, yet nothing comes of it. I attempted to restrict the data being requested as per the documentation, along with setting a 5 minute time out:

BlobServiceClient blobServiceClient = new BlobServiceClientBuilder().connectionString(AzureBlobConnectionString).buildClient();
String containerName = "container1";
BlobContainerClient containerClient = blobServiceClient.getBlobContainerClient(containerName);
System.out.println("\nListing blobs...");

ListBlobsOptions options = new ListBlobsOptions()
        .setMaxResultsPerPage(10)
        .setDetails(new BlobListDetails()
                .setRetrieveDeletedBlobs(false)
                .setRetrieveSnapshots(true));
Duration duration = Duration.ofMinutes(5);
containerClient.listBlobs(options, duration).forEach(blob ->
        System.out.printf("Name: %s, Directory? %b, Deleted? %b, Snapshot ID: %s%n",
                blob.getName(),
                blob.isPrefix(),
                blob.isDeleted(),
                blob.getSnapshot()));

However this resulted in it timing out with the exception:

Exception in thread "main" reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 300000ms in 'flatMap' (and no fallback has been configured)
at reactor.core.Exceptions.propagate(Exceptions.java:366)
at reactor.core.publisher.BlockingIterable$SubscriberIterator.hasNext(BlockingIterable.java:168)
at java.lang.Iterable.forEach(Iterable.java:74)
at AzureManagement.AzureControl.listAllBlobs(AzureControl.java:42)
at Main.main(Main.java:8)

I understand there used to be a method called "listBlobsSegmented", however this does not appear to be in V12 of the Azure SDK for Java.

If anybody has any ideas as to how to get a list of the blobs in the container in an effective and efficient manner I would very much appreciate it!

Thanks.

1
Please enable storage log and check log to get error message: docs.microsoft.com/en-us/azure/storage/common/…Jim Xu

1 Answers

0
votes

I faced exactly the same problem of any operation to hang forever. Actually you have no problem in the way you list blobs.

It turned out to be a dependency conflict problem, make sure that there's no conflict in your dependencies with Azure SDK. It seems weird but we discovered this when we downgraded the Azure SDK version from 12 to older version, instead of hanging it throw an exception like method not found in class ...

in my case, the conflict came from hadoop-hdfs which forces an old version of netty. While Azure SDK wants a newer version of netty.

When I removed the HDFS dependency: group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '3.2.0' I can list files and blobs without the hanging problem.