Head-based and tail-based sampling, rate-limiting
Introduction
Sampling reduces the cost and verbosity of tracing by reducing the number of created (sampled) spans. Sampling may happen in different stages of spans processing:
- when a span is created (head-based sampling);
- when a span is received by a backend (rate-limiting sampling);
- when a complete trace is fully assembled (tail-based sampling).
Sampling probability
Sampling provides a sampling probability which enables accurate statistical counting of all spans using only a portion of sampled spans. For example, if the sampling probability is 50% and the number of sampled spans is 10, then the adjusted (total) number of spans is 10 / 50% = 20
.
Name | Side | Adjusted count | Accuracy |
---|---|---|---|
Head-based sampling | Client-side | Yes | 100% |
Rate-limiting sampling | Server-side | Yes | <90% |
Tail-based sampling | Server-side | Yes | <90% |
Head-based sampling
Head-based sampling makes the sampling decision as early as possible and propagates it to other participants using the context. This allows saving a lot of resources by not collecting any telemetry data for dropped spans. It is the simplest, most accurate, and most reliable sampling method which you should prefer over all other methods.
OpenTelemetry has 2 span properties responsible for client sampling:
IsRecording
- whenfalse
, span discards attributes, events, links etc.Sampled
- whenfalse
, OpenTelemetry drops the span.
You should check IsRecording
property to avoid collecting expensive telemetry data.
if span.IsRecording() {
// collect expensive data
}
Sampler is a function that accepts a root span about to be created. The function returns a sampling decision which must be one of:
- Drop - trace is dropped.
IsRecording = false
,Sampled = false
. - RecordOnly - trace is recorded but not sampled.
IsRecording = true
,Sampled = false
. - RecordAndSample - trace is recorded and sampled.
IsRecording = true
,Sampled = true
.
By default, OpenTelemetry samples all traces, but you can configure it to sample a portion of traces. In that case, backends use the sampling probability to adjust the number of spans.
Head-based sampling is efficient and accurate but not very flexible. It does not account for traffic spikes and may collect more data than desired. This is where rate-limiting sampling becomes handy. It ensures that backends do not exceed certain limits when receiving spans from clients.
Rate-limiting sampling
Rate-limiting sampling happens on the server side and ensures that you don't exceed certain limits, for example, it allows to sample 10 or less traces per seconds.
Rate-limiting sampling supports adjusted counts but the accuracy is rather low. To achieve better results and improve performance, you should use rate-limiting sampling together with head-based sampling which is more efficient and accurate.
Most backends (including Uptrace) automatically apply rate-limiting sampling when necessary.
Tail-based sampling
With head-based sampling the sampling decision is made upfront and usually at random. Head-based sampling can't sample failed or unusually long operations, because that information is only available at the end of a trace.
With tail-based sampling we delay the sampling decision until all spans of a trace are available which enables better sampling decisions based on all data from the trace. For example, we can sample failed or unusually long traces.
Most backends (including Uptrace) automatically apply tail-based sampling when necessary, but you can also use OpenTelemetry Collector with tailsamplingprocessor to configure sampling according to your needs.