0
votes

I have the following use case for building a Data Lake (e.g. in Azure):

My organization deals with companies that go into bankruptcy. Once a company goes bankrupt, it needs to hand over all of their data to us, including structured data (e.g. CSVs) as well as semi-structured and unstructured data (e.g. PDFs, Word documents, images, JSON, .txt files etc.). Having a data lake would help here as the volumes of data can be large and unpredictable and Azure Data Lake seems like a relatively low-cost and scalable storage solution.

However, apart from storing all of that data we also need to give business users a tool that will enable them to search through all of that data. I can imagine two search types:

  • searching for specific files (using file names or part of file names as the search criteria)
  • searching through all text files (word documents, .txt and PDFs) and identifying those files that meet the search criteria (e.g. a specific phrase being searched for)

Are there any out of the box tools that can use Azure Data Lake as a data source that would enable users to perform such searches?

2
Hi RobW, If my answer is helpful for you, can you please mark it as answer? This can be beneficial to other community members. Thank you.Leon Yue

2 Answers

0
votes

Unfortunately, there isn't a tool can help you filter the files directly in Data Lake for now.

Even Azure Storage Explorer only support search by prefix.

Data Factory support we filter the files, but it usually used for copy and transfer data. Reference: Data Factory supports wildcard file filters for Copy Activity

Update:

Azure Cognitive Search seems to be a good choice.

Cognitive Search supports import source from Data Lake, and it provide the filter to help us search the files.

A filter provides criteria for selecting documents used in an Azure Cognitive Search query. Unfiltered search includes all documents in the index. A filter scopes a search query to a subset of documents.

We could reference from Filters in Azure Cognitive Search

Hope this helps.

0
votes

Cognitive Search with Azure Data Lake is definitely an option and it is Microsoft recommends. Several factors we need to consider:

  1. Price. https://azure.microsoft.com/en-us/pricing/details/search/. Not a cheap option.
  2. Size of your source data and index you need.
  3. Your acknowledgment of other open-source services. ELK is a popular open-source framework for full-text searching.