Skip to content

feat(trace-sampler): add shared datadog-agent-trace-sampler crate#141

Draft
lucaspimentel wants to merge 1 commit into
mainfrom
lpimentel/datadog-agent-trace-sampler
Draft

feat(trace-sampler): add shared datadog-agent-trace-sampler crate#141
lucaspimentel wants to merge 1 commit into
mainfrom
lpimentel/datadog-agent-trace-sampler

Conversation

@lucaspimentel

@lucaspimentel lucaspimentel commented Jul 2, 2026

Copy link
Copy Markdown
Member

TL;DR

New dependency-free crate datadog-agent-trace-sampler — a 1:1 Rust port of the Go trace agent's error sampler. It rescues error traces that would otherwise be dropped when the agent computes stats and drops P0s, keeping error visibility under aggressive sampling. Pure leaf crate (no new deps, no protobuf types in the API), shared by bottlecap now and SCL later. This PR is the crate only; consumer wiring is separate.


What does this PR do?

Adds a new dependency-free leaf crate, datadog-agent-trace-sampler, that ports the Go trace agent's error sampler (ScoreSampler targeting ErrorTPS) from DataDog/datadog-agent.

The error sampler is a rescue sampler: after an agent decides to drop a trace, it gets a second look, and if it contains an error it is kept, up to a budget of target_tps (default 10) error traces/sec distributed fairly across distinct trace signatures. This guarantees error visibility even under aggressive sampling.

The public API takes primitives in (SpanView / TraceView) and returns a SampleDecision out. It never exposes a protobuf Span type, so consumers pinning different libdatadog revisions can share it without compiling incompatible pb::Span types into their build graphs.

The port includes:

  • FNV-1a 32-bit trace signatures (signature.rs)
  • deterministic sample-by-rate (integer comparison, matching the Go agent's SampleByRate)
  • the 6×5s rolling-bucket TPS budget with unused-budget cascade, 20% rate-increase cap, moving-max decay, and cardinality shrink (score_sampler.rs)
  • unit tests mirroring the Go table tests (signatures, computeTPSPerSig, zeroAndGetMax, rate increase/eviction, default rate, disable, shrink, target-TPS effectiveness)

This crate is added up front so both serverless agents can consume it:

  • bottlecap (APMSVLS-469) consumes it via the existing serverless-components git dependency and wires the rescue into its lambda_extension_compute_stats P0-drop path (separate PR in datadog-lambda-extension).
  • SCL (APMSVLS-472) becomes wiring-only once its agent-side P0-drop step lands.

No new third-party dependencies, so LICENSE-3rdparty.csv is unchanged.

Motivation

Closes the "Error sampler (ScoreSampler)" agent-side sampling parity gap for the serverless agents: when the agent computes stats and drops P0 chunks, error chunks are currently lost. See APMSVLS-469 (bottlecap) and APMSVLS-472 (SCL).

The rare-sampler half of those tickets is intentionally not ported: it is deprecated in the Go trace agent and no longer enabled by default.

Additional Notes

  • Dependency-free by design rather than folding into datadog-trace-agent (which pulls in the full hyper/reqwest HTTP stack) or libdatadog. This also sidesteps the libdatadog rev drift between consumers (bottlecap pins 48da0d82, SCL pins a8206994) that a shared pb::Span API would break.
  • now_unix_secs is passed into sample() rather than read from a clock, keeping the crate dependency-free and the rolling-window logic deterministically testable.
  • One documented, intentional narrowing from the Go source: weight_root uses only the root's global sample rate (there is no agent pre-sampler in serverless, so the pre-sampler rate is always 1.0).

Describe how to test/QA your changes

Automated unit + doc tests in the new crate:

cargo test -p datadog-agent-trace-sampler
cargo clippy -p datadog-agent-trace-sampler --all-targets

20 tests pass (19 unit + 1 doctest); clippy and rustfmt are clean. The unit tests are 1:1 ports of the Go trace agent's sampler table tests, validating signature stability/collisions, the TPS budget cascade, bucket rotation/decay, rate-increase capping, cardinality shrink, and that survivors are stamped with _dd.errors_sr.

🤖

Add a dependency-free leaf crate that ports the Go trace agent error
sampler (ScoreSampler targeting ErrorTPS) so serverless agents can rescue
error traces from an agent-side P0 drop, keeping error visibility under
aggressive sampling.

The crate takes primitives in (SpanView/TraceView) and returns a
SampleDecision out, exposing no protobuf Span type, so consumers pinning
different libdatadog revisions can share it. bottlecap consumes it via the
existing serverless-components git dependency; SCL wiring follows later.

Ports FNV-1a signatures, the 6x5s rolling-bucket TPS budget with cascade
and 20% rate-increase cap, deterministic sample-by-rate, and cardinality
shrink, with unit tests mirroring the Go table tests.

APMSVLS-469
🤖
@lucaspimentel lucaspimentel changed the title feat(trace-sampler): add shared datadog-agent-trace-sampler crate feat(trace-sampler): add shared datadog-agent-trace-sampler crate Jul 2, 2026
@lucaspimentel lucaspimentel requested a review from Copilot July 2, 2026 20:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new dependency-free leaf crate, datadog-agent-trace-sampler, intended to provide a shared Rust port of the Datadog Agent’s error “rescue” sampler so serverless agents can retain error visibility under aggressive sampling while keeping the public API protobuf-free.

Changes:

  • Introduces the new crate with a minimal public API (SpanView, TraceView, ErrorsSampler, SampleDecision, ErrorSamplerConfig).
  • Implements trace signature computation + deterministic sample_by_rate, and ports the rolling-bucket TPS-driven ErrorsSampler logic with unit tests.
  • Wires the crate into the workspace (via crates/*) and updates Cargo.lock.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/datadog-agent-trace-sampler/src/lib.rs Defines the public API and crate-level documentation/doctest.
crates/datadog-agent-trace-sampler/src/signature.rs Implements trace signature hashing and deterministic sample-by-rate helper + tests.
crates/datadog-agent-trace-sampler/src/score_sampler.rs Implements the rolling TPS budget sampler and ErrorsSampler + tests.
crates/datadog-agent-trace-sampler/README.md Documents crate purpose and basic usage.
crates/datadog-agent-trace-sampler/Cargo.toml Adds the new crate manifest (no dependencies).
Cargo.lock Adds the new crate entry to the lockfile.

Comment on lines +256 to +261
// A malformed chunk (empty, or root_index past the end) cannot be scored;
// do not rescue it. Guards the slice indexing in the signature computation.
if self.disabled || trace.root_index >= trace.spans.len() {
return SampleDecision::Drop;
}

Comment on lines +273 to +280
fn apply_sample_rate(&self, trace: &TraceView, rate: f64) -> SampleDecision {
let new_rate = trace.root_global_sample_rate * rate;
if sample_by_rate(trace.trace_id, new_rate) {
SampleDecision::Keep { errors_sr: rate }
} else {
SampleDecision::Drop
}
}
Comment on lines +127 to +132
fn get_signature_sample_rate(&self, sig: Signature) -> f64 {
match self.rates.get(&sig) {
Some(&rate) => rate * self.extra_rate,
None => self.default_rate(),
}
}
Comment on lines +290 to +293
let sampler = &self.sampler;
let allow_list = self
.shrink_allow_list
.get_or_insert_with(|| sampler.rates.keys().copied().collect());
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants