6
votes

I have an Azure search service that is used to search through BLOBS (which are images) based on BLOB metadata.

The index the search is based on is set to refresh hourly.

However I am still getting results for BLOBs that don't exist anymore returned in Search results.

Using the Get Indexer Status API (output below) shows that the index has successfully refreshed after the BLOBS were deleted.

"status": "running",
"lastResult": {
    "status": "success",
    "errorMessage": null,
    "startTime": "2018-02-05T16:00:03.29Z",
    "endTime": "2018-02-05T16:00:03.416Z",
    "errors": [],
    "warnings": [],
    "itemsProcessed": 0,
    "itemsFailed": 0,
    "initialTrackingState": "{\r\n  \"lastFullEnumerationStartTime\": \"2018-02-05T14:59:31.966Z\",\r\n  \"lastAttemptedEnumerationStartTime\": \"2018-02-05T14:59:31.966Z\",\r\n  \"nameHighWaterMark\": null\r\n}",
    "finalTrackingState": "{\"LastFullEnumerationStartTime\":\"2018-02-05T15:59:33.2900956+00:00\",\"LastAttemptedEnumerationStartTime\":\"2018-02-05T15:59:33.2900956+00:00\",\"NameHighWaterMark\":null}"
},
"

If it's relevant the BLOBs were deleted using Azure Storage Explorer

The problem this is causing is that these images are being output to a web page and currently displaying as missing images as well as making the index bigger than it needs to be.

3

3 Answers

4
votes

After some reading I found that the only deletion policy currently supported by Azure search is Soft Delete.

To enable this for BLOB storage you have to create a metadata value on each BLOB (e.g. IsDeleted) and update this value to enable it to be captured by the Deletion policy.

PUT https://[service name].search.windows.net/datasources/blob-datasource?api-version=2016-09-01
Content-Type: application/json
api-key: [admin key]

{
"name" : "blob-datasource",
"type" : "azureblob",
"credentials" : { "connectionString" : "<your storage connection string>" },
"container" : { "name" : "my-container", "query" : "my-folder" },
"dataDeletionDetectionPolicy" : {
    "@odata.type" :"#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",     
    "softDeleteColumnName" : "IsDeleted",
    "softDeleteMarkerValue" : "true"
    }
} 

Full details here

I'll need to do some testing to ensure that it is safe to update the metadata and then immediately delete the BLOB.

4
votes

While Soft Delete is an option, the index that is being targeted by the indexer can also be directly modified if you so choose.

You can use the POST to index API detailed on this page to directly delete documents, using their "key" field. An example below:

POST https://[service name].search.windows.net/indexes/[index name]/docs/index?api-version=[api-version]   
Content-Type: application/json   
api-key: [admin key]  
{  
  "value": [  
    {  
      "@search.action": "delete",  
      "key_field_name": "value"
    }
  ]  
} 

Assuming you didn't use field mappings to modify the default "key" behavior of blob indexers, from the documentation on this page the key field will be the base64 encoded value of the metadata_storage_path property (again, refer to the previous link for details). Therefore, upon deleting the blob, you can write a trigger to POST the appropriate payload to your search index from which you want the documents to be deleted.

1
votes

Here is a solution I implemented for removing blobs in azure search data source.

  • Step1 : remove a document from blob storage
  • Step2 : remove a document from azure search

In dictionary key is container name, values is list of files.

Here is code sample

 public async Task<bool> RemoveFilesAsync(Dictionary<string, List<string>> listOfFiles)
    {
        try
        {
            CloudBlobClient cloudBlobClient = searchConfig.CloudBlobClient;
            foreach (var container in listOfFiles)
            {
                List<string> fileIds = new List<string>();
                CloudBlobContainer staggingBlobContainer = cloudBlobClient.GetContainerReference(container.Key);

                foreach (var file in container.Value)
                {
                    CloudBlockBlob staggingBlob = staggingBlobContainer.GetBlockBlobReference(file);

                    var parameters = new SearchParameters()
                    {
                        Select = new[] { "id", "fileName" }
                    };

                    var results = searchConfig.IndexClient.Documents.Search<Document>(file, parameters);

                    var filedetails = results.Results.FirstOrDefault(p => p?.Document["fileName"]?.ToString()?.ToLower() == file.ToLower());
                    if (filedetails != null)
                        fileIds.Add(filedetails.Document["id"]?.ToString());

                     await staggingBlob.DeleteAsync();
                }

                // delete from search index
                var batch = IndexBatch.Delete("id", fileIds);
                await searchConfig.IndexClient.Documents.IndexWithHttpMessagesAsync(batch);
            }

            return true;
        }
        catch (Exception ex)
        {
            throw;
        }
    }