Still using Jaeger/Sentry? Uptrace is a distributed tracing tool that monitors performance, errors, and logs using OpenTelemetry.

Head-based and tail-based sampling, rate-limiting

Introduction

Sampling reduces the cost and verbosity of tracing by reducing the number of created (sampled) spans. Sampling may happen in different stages of spans processing:

Sampling probability

Sampling provides a sampling probability which enables accurate statistical counting of all spans using only a portion of sampled spans. For example, if the sampling probability is 50% and the number of sampled spans is 10, then the adjusted (total) number of spans is 10 / 50% = 20.

NameSideAdjusted countAccuracy
Head-based samplingClient-sideYes100%
Rate-limiting samplingServer-sideYes<90%
Tail-based samplingServer-sideYes<90%

Head-based sampling

Head-based sampling makes the sampling decision as early as possible and propagates it to other participants using the context. This allows saving a lot of resources by not collecting any telemetry data for dropped spans. It is the simplest, most accurate, and most reliable sampling method which you should prefer over all other methods.

OpenTelemetry has 2 span properties responsible for client sampling:

  • IsRecording - when false, span discards attributes, events, links etc.
  • Sampled - when false, OpenTelemetry drops the span.

You should check IsRecording property to avoid collecting expensive telemetry data.

if span.IsRecording() {
    // collect expensive data
}

Sampler is a function that accepts a root span about to be created. The function returns a sampling decision which must be one of:

  • Drop - trace is dropped. IsRecording = false, Sampled = false.
  • RecordOnly - trace is recorded but not sampled. IsRecording = true, Sampled = false.
  • RecordAndSample - trace is recorded and sampled. IsRecording = true, Sampled = true.

By default, OpenTelemetry samples all traces, but you can configure it to sample a portion of traces. In that case, backends use the sampling probability to adjust the number of spans.

Head-based sampling is efficient and accurate but not very flexible. It does not account for traffic spikes and may collect more data than desired. This is where rate-limiting sampling becomes handy. It ensures that backends do not exceed certain limits when receiving spans from clients.

Rate-limiting sampling

Rate-limiting sampling happens on the server side and ensures that you don't exceed certain limits, for example, it allows to sample 10 or less traces per seconds.

Rate-limiting sampling supports adjusted counts but the accuracy is rather low. To achieve better results and improve performance, you should use rate-limiting sampling together with head-based sampling which is more efficient and accurate.

Most backends (including Uptrace) automatically apply rate-limiting sampling when necessary.

Tail-based sampling

With head-based sampling the sampling decision is made upfront and usually at random. Head-based sampling can't sample failed or unusually long operations, because that information is only available at the end of a trace.

With tail-based sampling we delay the sampling decision until all spans of a trace are available which enables better sampling decisions based on all data from the trace. For example, we can sample failed or unusually long traces.

Most backends (including Uptrace) automatically apply tail-based sampling when necessary, but you can also use OpenTelemetry Collector with tailsamplingprocessoropen in new window to configure sampling according to your needs.

Last Updated: