0
votes

Is it possible to merge multiple blob into a single Azure Search record?

Complete Scenario: We have list of companies stored as json in cosmosDB and its related documents(.docx/pdf) in blob storage. A company can have multiple documents with varying size up to 20 MB and there is no upper limit of number of documents. How can we merge content of all documents and push into 'content' field of Azure Search Index, so that we could perform full-text search in companies data coming from cosmos and blob.

I've looked into https://www.lytzen.name/2017/01/30/combine-documents-with-other-data-in.html - Scenario discuss in the tutorial has one-to-one relationship between candidate data and CV. In our case there is one-to-many relationship between company and its documents.

Any help / direction would be appreciated.

Thanks

3

3 Answers

0
votes

Azure Search Blob Indexer maps each blob to a document in the search index 1:1. At the moment, there isn't a way to merge the content of multiple blobs into a single document automatically. However, you can always write a client application that does this and pushes the aggregated content to the Azure Search index using our SDK or REST API..

I'm curious to learn more about scenario. With a single document in the index per company, you won't be able to search for individual documents from blob storage. Is that what want?

0
votes

It is possible to merge data from different datasources into a single document in a search index, as long as you're trying to "assemble" a document from multiple fields and not merging into a single field.
Note that:

  1. All the datasources agree on what the document key is. By default, the key is blob path. Since path are unique across blobs, the need to agree on keys means that you need to set a metadata property on your "secondary" blobs that correlates them with the "primary" blob.

  2. You can't use indexers to merge multiple source documents into a single index field such as content. Likely, this is not what you need anyway for JSON metadata stored in Cosmos DB, since you probably want to capture that metadata into its own set of fields. For merging into the content field, you would need to write your own merging logic as noted in the previous response.

It seems that the fundamental primitive that would make your scenario "just work" is collection merge - you would model content not as a string, but as a collection of strings, where each element is extracted from one of your blobs. Please feel free to add a suggestion for collection merge functionality to our UserVoice.

0
votes

One solution that I found is to compress the documents into ZIP and pass ZIP file to Azure Search indexer. Only problem with this solution is that I have to add another processing step for ZIP creation and additional storage cost for keeping ZIP