Huge Class 2 Queue operation count with Durable function

Question

I've created my first big Durable Function in Azure, it runs about 12 activity functions for each page of a document. The other day I processed at most 5000 pages. I understand each activity puts an item on the work items Q so in theory I wrote 60k queue messages which also needed to be read so that's 120k actions.

North Europe LRS GPv2 storage is £0.003/10K Class 2 Q actions

So that's 12 * £0.003 = 3.6p however my subscription shows over 90p worth of Class 2 Q actions so that's equivalent to 3M Q operations or 600 Q actions per page. This was substantially more than my compute costs on a consumption plan over the same period which was 29p for the day in question. I appreciate there are other messages that need to go onto the Q but wasn't expecting this many!

Am I missing something in terms of how Durable Functions are using the Q, is there anything I can look out for or monitor to try and count for a single page all the Q (cost) actions it generates so I can maybe work out how to reduce them.

Appreciate any insights as I'm loving durable functions but the storage costs are starting to get prohibitive! I'm going to move to v1 to cut costs but would still like to understand this to know if this is just the cost of functions or whether I'm doing something inefficient in my functions.

UPDATE 3 PROBABLY MOST USEFUL

I created a new version that just takes a file and one activity splits it into single page pdfs and then each page is processed by another activity to convert to a png. I created a new storage account and turn on every logging option and the log logging proved the most useful, I uploaded a 100 page PDF then when I downloaded and analysed the logs for the period the function ran I saw the following:

Approx processing between 10:31 and 10:43 during which time I saw from the log file:

203 PutMessage (which makes sense)
203 DeleteMessage

7868 GetMessage - all from the workitems queue

26445 GetMessages - all from the control queues, breakdown was
control-00 - 6217
control-01 - 6375
control-02 - 5134
control-03 - 8719

Outside of this I left the app idle (on consumption plan) and each hour I see approx:

3500 - GetQueueMetadata - 700 against each Q, control 00-03 and workitems
3500 - PeekMessage - 700 against each Q, control 00-03 and workitems

(There may have been a missing final log here, I think it's polling each of the 5 Qs with each type of message every 10s so that would be 3600 Q actions an hour, not sure why it needs to poll the control and workitem Qs if nothing has been submitted for processing?)

Seems to me like there is an inordinate amount of GetMessage and GetMessages being fired while the function is actually doing stuff, looking in the logs, I can provide the logs if that will be of help diagnosing the issue.

(if it's relevant the instance count went from 0 to 8 by the time it was finished, wondering if there is a correlation between instances and the abundance of Q requests)

UPDATE 1

So I ran the following which I'm hoping gives me all the function calls for the day in question:

let start = datetime(2018-06-12T00:00:00);
traces
| where timestamp > start and timestamp < start + 24h
| where message startswith "function started" 
| summarize count()

This gave me a total of 19,939 which is about right as while there are multiple functions only some of them are called for every page I actually processed about 3,500 pages.

I then managed to work out how to query the metrics for the Q on that storage account and once I worked out the filter I found the following which is where all the Q transactions went:

Seems like a ridiculous amount of Q reading? Just to check I also looked and the volume of messages put on the Q and 25K seems approximately right:

I've since moved to a V1 storage account and tried another run today of about 8K pages so will see how the metrics look for that when they've all filtered through to the metrics for the storage account.

NB: Looking at the charts I note the data for the 13th seems more sensible, I don't believe I changed anything significant other than possibly turning sampling on for app insights as that was generating a lot of data!

UPDATE 2 Seeing similarly huge numbers for the 8k ish page processing yesterday on the new v1 storage account.

Chris Gillum Chris Gillum · Accepted Answer · 2018-06-15T17:18:14

That order of magnitude for just queues seems a little odd.

Generally speaking, each activity function call will result in two queue messages, one for the request and one for the response. There is also the cost of table storage operations and queue polling operations, but I assume you're just focusing on queue messages for this question?

One thing that would be helpful to diagnose is to see whether everything is executing correctly. Do you have Application Insights set up? Take a look at our Diagnostics documentation about how you can use Application Insights to query the orchestration tracking data. Specifically, check and see if you are actually executing 12 activity functions per document or if something else might be going on that's causing these extremely high costs.

Huge Class 2 Queue operation count with Durable function

2 Answers