2
votes

I am in the process of building a solution which processes data through a pipeline of Azure Functions. In total there are over 10 and data can fork off in different directions.

In development, we have been using App Insights, and the built in correlation has been invaluable, being able to see how one item of data travels through the system is amazing.

Up until this point, we have been using ingestion sampling to limit the App Insights cost, which has worked fine, and this has preserved related events as it is handled App Insights service side (from what I understand).

We are considering Adaptive Sampling instead to give us more control how the sampling occurs, but my concern is that because this is Client (Azure Function side), it won't respect correlation, and we won't be able to see the full journey of a request. I have looked through the docs and can't confirm this - does anyone know the answer?

thanks!

1

1 Answers

2
votes

*Ingestion side sampling works the same way - only in this case all items are uploaded and then discarding happens. This is similar to fixed sampling.

Here is how sampling (fixed or adaptive) works within a boundary of one application:

  1. Incoming request has operation id (either a new one of received from upstream)
  2. Every telemetry item (request, dependency, exception, etc.) is stamped with this operation id
  3. For every telemetry item AI SDK hashes operation id and transforms to [0,1] range
  4. This sampling ratio is compared to current sampling threshold (either fixed or adaptive)
  5. If the value is less than threshold then item is sampled in

This guarantees that all items within one application are either sampled out or sampled in (with a corner case when adaptive sampling threshold changes during transaction and transaction sampling ratio was at the borderline).

Now how it expands to distributed applications:

  1. Operation id is propagated not only to other telemetry items but also as a part of outgoing calls (HTTPS, queue items, etc.) (according to W3C standard)
  2. Same sampling logic is applied in every application independently. But operation id is the same, so, sampling ratio for a particular transaction is the same.
  3. If we have a distributed system with fixed sampling and ratio is exactly the same for every application - then there is a guarantee that transaction will be either fully sampled in or out.
  4. If distributed system uses adaptive sampling and various applications have different load then transactions might be partially sampled in

How Application Insights UX helps to find a full transaction:

  1. In main two troubleshooting experiences - Failures and Performance - after slicing and dicing (narrowing with various filters) the offered items are sorted by "Relevance"
  2. Since all items satisfy filtered criteria, they are essentially sorted by sampling ratio (UX uses exactly the same algorithm)
  3. Now a little bit of math =) With let's say 100 telemetry items to pick from, with 99.5% probability a sample will be found which would be sampled in for application with adaptive sampling ratio of 1% (100 requests / sec / VM).

enter image description here