1
votes

I have an Azure durable function project where I'm working with relatively big CSV files. I have two trigger functions:


  • One trigger function needs to get a file from Azure Blob Storage, where that file is over 100MB, and It needs to make smaller chunks and put them in Azure Blob Storage (on different blob directory)

    This is the trigger orchestration function

[FunctionName(nameof(FileChunkerOrchestration))]
public async Task Run(
  [OrchestrationTrigger] IDurableOrchestrationContext context, 
  ILogger log)
{
  var blobName = context.GetInput<string>();  
  var chunkCounter = await context.CallActivityAsync<int>(nameof(FileChunkerActivity), blobName);
}

This is the activity function

[StorageAccount("AzureWebJobsStorage")]
[FunctionName(nameof(FileChunkerActivity))]
public async Task<int> Run(
  [ActivityTrigger]IDurableActivityContext context,
  string fileName,
  [Blob("vessel-container/csv-files/{fileName}.csv", FileAccess.Read)]TextReader blob,
  [Blob("vessel-container/csv-chunks", FileAccess.Write)] CloudBlobContainer container)
{
  // Uses TextReader blob to create chunk files 
  // Then stores chunk by chunk (as soon as one chunk is created then it's being uploaded to Blob storage) in CloudBlobContainer container
}

  • The second trigger function gets triggered on each and every chunks that have been created
[StorageAccount("AzureWebJobsStorage")]
[FunctionName(nameof(FileTrigger))]
public async Task Run(
  [DurableClient] IDurableClient starter,  
  [BlobTrigger("vessel-container/csv-chunks/{name}.csv")] Stream blob,
  string name,
  ILogger log)
{
  // Here the processing chunk files start            
}

The problem I'm having is that the second function triggers every and each CSV chunk files and starts to run in parallel which leads my project to use too many available RAM.

I need to fix this in such a manner that my second function (it's orchestration) needs to process file by file.

Please share any idea how to overcome this problem, thanks in advance.

1
Could you share your code a bit? I'm thinking you could split the chunks into groups and await Task.WhenAll() on each group in succession.juunas
Source code I can't share because it's work-related. I could write you a pseudo-code for those functions if you think that can help you. Basically, one trigger function does the splitting and at the same time uploads chunk by chunk while the second function triggers on each chunk and the problem is that the first function does it's job very quickly so at the end the second function is being triggered on all chunks. In ideal, I need to make that second function take one chunk, complete it's processing and only then it can get the second chunk, third chunk, etc.Petar Kovac
Pseudocode would definitely help I think. And so these functions are blob triggered? Or is it some other trigger?juunas
Yes, all of them are blob triggered but on different Blob dir. Edited the question with some examples.Petar Kovac
IIRC there is a json config file you can set these parameters.leppie

1 Answers

0
votes

A possible solution to this problem could be a decoupling of the processing by catching the events generated by the creation of the csv chunk files in the Azure Storage via EventGrid and writing them to a storage queue as target (minimal setup and pretty easy). The storage queue is than consumed by an Azure Function which is limited in the concurrency via settings in host.json (may collide with other function scale targets in one function app and may require a second function app with dedicated settings)