4
votes

I need to copy one storage account into another. I have created a Runbook and schedule it to run daily. This is an incremental copy.

What I am doing is

  1. list the blobs in source storage container
  2. Check the blobs in destination storage container
  3. If it doesn't exist in destination container copy the blob Start-AzureStorageBlobCopy

While this works for containers with small size, this takes a very long time and is certainly cost ineffective for containers with say 10 million block blobs because every time I run the task I have to go through all those 10 million blobs.

I don't see it in documentation but is there any way i can use conditional headers like DateModifedSince some thing like Get-AzureStorageBlob -DateModifiedSince date in powershell.

I have not tried but I can see it is possible to use DateModifiedSince in nodejs library

Is there anyway I can do it with powershell so that I can be able to use Runbooks?

EDIT:

Using AzCopy made a copy of storage account that contains 7 million blobs, I uploaded few new blobs and started the azcopy again. It still takes significant amount of time to copy few new uploaded files.

AzCopy /Source:$sourceUri /Dest:$destUri /SourceKey:$sourceStorageKey /DestKey:$destStorageAccountKey /S /XO /XN /Y

It is possible to filter for a blob with blob name in no time

For example Get-AzureStorageBlob -Blob will return the blob immediately from 7 million records

It should have been possible to filter blob(s) with other properties too..

2
You can use AzCopy. See this docJack Zeng
@JackZeng thanks I am testing AzCopy. I am using \XO \XN option. Since it is my first copying it's taking time. My container is > 100 gb. Once that is done i will test if AzCopy doesn't take the same amount just to copy one new blob. Still if AzCopy works (fingers crossed) I will have to move out of automation (use VM instead probably)Sami
Please try AzCopy, I think it can achieve better performance than what you're doing. :)Zhaoxing Lu

2 Answers

4
votes

I am not sure if this would be the actual correct answer but I have resorted to this solution for now.

AzCopy is a bit faster but since it's executable I have no option to use it in Automation.

I wrote my own runbook (can be modified as workflow) which implements following AzCopy command

AzCopy /Source:$sourceUri /Dest:$destUri /SourceKey:$sourceStorageKey /DestKey:$destStorageAccountKey /S /XO /Y

  1. Looking at List blobs we can only fiter blobs by blob prefix. So I cannot pull blobs filtered by Modified date. This leaves me to pull the whole blob list.
  2. I pull 20,000 blobs each time from source and destination Get-AzureStorageBlob with ContinuationToken
  3. Loop through pulled 20,000 source blobs and see if they do not exist in destination or have been modified in source
  4. If 2 is true then I write those blobs to the destination
  5. It takes around 3-4 hours to go through 7 million blobs. Task would prolong depending on how many blobs are to be written to the destination.

A code snippet

    #loop throught the source container blobs, 
    # and copy the blob to destination that are not already there
    $MaxReturn = 20000
    $Total = 0
    $Token = $null
    $FilesTransferred = 0;
    $FilesTransferSuccess = 0;
    $FilesTransferFail = 0;
    $sw = [Diagnostics.Stopwatch]::StartNew();
    DO
    {
        $SrcBlobs = Get-AzureStorageBlob -Context $sourceContext -Container $container -MaxCount $MaxReturn  -ContinuationToken $Token | 
            Select-Object -Property Name, LastModified, ContinuationToken

        $DestBlobsHash = @{}
        Get-AzureStorageBlob -Context $destContext -Container $container -MaxCount $MaxReturn  -ContinuationToken $Token  | 
            Select-Object -Property Name, LastModified, ContinuationToken  | 
                ForEach { $DestBlobsHash[$_.Name] = $_.LastModified.UtcDateTime }


        $Total += $SrcBlobs.Count

        if($SrcBlobs.Length -le 0) { 
            Break;
        }
        $Token = $SrcBlobs[$SrcBlobs.Count -1].ContinuationToken;

        ForEach ($SrcBlob in $SrcBlobs){
            # search  in destination blobs for the source blob and unmodified, if found copy it
            $CopyThisBlob = $false

            if(!$DestBlobsHash.count -ne 0){
                $CopyThisBlob = $true
            } elseif(!$DestBlobsHash.ContainsKey($SrcBlob.Name)){
                $CopyThisBlob = $true
            } elseif($SrcBlob.LastModified.UtcDateTime -gt $DestBlobsHash.Item($SrcBlob.Name)){
                $CopyThisBlob = $true
            }

            if($CopyThisBlob){
                #Start copying the blobs to container
                $blobToCopy = $SrcBlob.Name
                "Copying blob: $blobToCopy to destination"
                $FilesTransferred++
                try {
                    $c = Start-AzureStorageBlobCopy -SrcContainer $container -SrcBlob $blobToCopy  -DestContainer $container -DestBlob $blobToCopy -SrcContext $sourceContext -DestContext $destContext -Force
                    $FilesTransferSuccess++
                } catch {
                    Write-Error "$blobToCopy transfer failed"
                    $FilesTransferFail++
                }   
            }           
        }
    }
    While ($Token -ne $Null)
    $sw.Stop()
    "Total blobs in container $container : $Total"
    "Total files transferred: $FilesTransferred"
    "Transfer successfully: $FilesTransferSuccess"
    "Transfer failed: $FilesTransferFail"
    "Elapsed time: $($sw.Elapsed) `n"
0
votes

Last modified is stored in the iCloudBlob object, you can access it with Powershell, like this

$blob = Get-AzureStorageBlob -Context $Context  -Container $container
$blob[1].ICloudBlob.Properties.LastModified

Which will give you

DateTime : 31/03/2016 17:03:07
UtcDateTime : 31/03/2016 17:03:07
LocalDateTime : 31/03/2016 18:03:07
Date : 31/03/2016 00:00:00
Day : 31
DayOfWeek : Thursday
DayOfYear : 91
Hour : 17
Millisecond : 0
Minute : 3
Month : 3
Offset : 00:00:00
Second : 7
Ticks : 635950405870000000
UtcTicks : 635950405870000000
TimeOfDay : 17:03:07
Year : 2016

Having a read through the API I don't think it is possible to perform a search on the container with any parameters other than name. I can only imagine that the nodejs library still retrieves all blobs and then filters them.

I will dig into it a little bit more though