Add _dd.p.ksr propagated tag for Knuth sampling rate#10802
Add _dd.p.ksr propagated tag for Knuth sampling rate#10802gh-worker-dd-mergequeue-cf854d[bot] merged 16 commits intomasterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove unused imports in KnuthSamplingRateTest (CodeNarc violations) - Update OT31ApiTest and OT33ApiTest to expect _dd.p.ksr in x-datadog-tags when agent-rate sampler runs (UNSET priority) Co-Authored-By: Claude Opus 4.6 <[email protected]>
The _dd.p.ksr propagated tag also appears in W3C tracestate as t.ksr. Update OT31 and OT33 test expectations for the UNSET context priority case where the agent-rate sampler runs. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replace String#replaceAll (a forbidden API in this codebase) with manual character-based trailing-zero stripping logic that has the same semantics but avoids the regex-based method. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The PTagsCodec headerValue method outputs tags in order: dm, tid, ksr. The test datadogTags list had ksr before tid, causing a comparison failure. Reorder to match the actual output: dm, tid, ksr. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The ksr implementation now adds _dd.p.ksr tag to spans with agent sampling rate, so the msgpack serialization test expectations need to include it. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- OpenTelemetryTest: fix x-datadog-tags ordering (tid before ksr) - DatadogPropagatorTest: add ksr to expected tags when UNSET priority - OpenTracing32Test: add ksr and tid handling for UNSET priority case All follow PTagsCodec ordering: dm → tid → ksr Co-Authored-By: Claude Opus 4.6 <[email protected]>
Update the test inject extract expectations to include _dd.p.ksr=1 in x-datadog-tags and t.ksr:1 in tracestate for the UNSET sampling case, following PTagsCodec ordering: dm -> tid -> ksr. Co-Authored-By: Claude Opus 4.6 <[email protected]>
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 60 metrics, 11 unstable metrics. Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.056 s) : 0, 1056468
Total [baseline] (11.131 s) : 0, 11130887
Agent [candidate] (1.065 s) : 0, 1064650
Total [candidate] (11.083 s) : 0, 11083200
section appsec
Agent [baseline] (1.255 s) : 0, 1254982
Total [baseline] (11.194 s) : 0, 11193627
Agent [candidate] (1.259 s) : 0, 1258865
Total [candidate] (11.231 s) : 0, 11231163
section iast
Agent [baseline] (1.239 s) : 0, 1239033
Total [baseline] (11.333 s) : 0, 11332757
Agent [candidate] (1.239 s) : 0, 1239337
Total [candidate] (11.336 s) : 0, 11336233
section profiling
Agent [baseline] (1.193 s) : 0, 1192814
Total [baseline] (11.116 s) : 0, 11116312
Agent [candidate] (1.19 s) : 0, 1190387
Total [candidate] (11.041 s) : 0, 11041461
gantt
title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.19 ms) : 0, 1190
crashtracking [candidate] (1.203 ms) : 0, 1203
BytebuddyAgent [baseline] (628.481 ms) : 0, 628481
BytebuddyAgent [candidate] (633.418 ms) : 0, 633418
AgentMeter [baseline] (29.134 ms) : 0, 29134
AgentMeter [candidate] (29.318 ms) : 0, 29318
GlobalTracer [baseline] (256.993 ms) : 0, 256993
GlobalTracer [candidate] (258.85 ms) : 0, 258850
AppSec [baseline] (31.639 ms) : 0, 31639
AppSec [candidate] (31.961 ms) : 0, 31961
Debugger [baseline] (60.165 ms) : 0, 60165
Debugger [candidate] (60.786 ms) : 0, 60786
Remote Config [baseline] (606.811 µs) : 0, 607
Remote Config [candidate] (591.091 µs) : 0, 591
Telemetry [baseline] (7.998 ms) : 0, 7998
Telemetry [candidate] (8.827 ms) : 0, 8827
Flare Poller [baseline] (4.233 ms) : 0, 4233
Flare Poller [candidate] (3.547 ms) : 0, 3547
section appsec
crashtracking [baseline] (1.202 ms) : 0, 1202
crashtracking [candidate] (1.219 ms) : 0, 1219
BytebuddyAgent [baseline] (663.341 ms) : 0, 663341
BytebuddyAgent [candidate] (666.203 ms) : 0, 666203
AgentMeter [baseline] (12.15 ms) : 0, 12150
AgentMeter [candidate] (12.118 ms) : 0, 12118
GlobalTracer [baseline] (259.709 ms) : 0, 259709
GlobalTracer [candidate] (260.078 ms) : 0, 260078
IAST [baseline] (24.372 ms) : 0, 24372
IAST [candidate] (24.372 ms) : 0, 24372
AppSec [baseline] (178.177 ms) : 0, 178177
AppSec [candidate] (178.356 ms) : 0, 178356
Debugger [baseline] (66.797 ms) : 0, 66797
Debugger [candidate] (67.006 ms) : 0, 67006
Remote Config [baseline] (626.999 µs) : 0, 627
Remote Config [candidate] (636.107 µs) : 0, 636
Telemetry [baseline] (8.439 ms) : 0, 8439
Telemetry [candidate] (8.721 ms) : 0, 8721
Flare Poller [baseline] (3.643 ms) : 0, 3643
Flare Poller [candidate] (3.673 ms) : 0, 3673
section iast
crashtracking [baseline] (1.199 ms) : 0, 1199
crashtracking [candidate] (1.196 ms) : 0, 1196
BytebuddyAgent [baseline] (804.686 ms) : 0, 804686
BytebuddyAgent [candidate] (806.171 ms) : 0, 806171
AgentMeter [baseline] (11.614 ms) : 0, 11614
AgentMeter [candidate] (11.496 ms) : 0, 11496
GlobalTracer [baseline] (249.25 ms) : 0, 249250
GlobalTracer [candidate] (248.94 ms) : 0, 248940
IAST [baseline] (25.502 ms) : 0, 25502
IAST [candidate] (25.487 ms) : 0, 25487
AppSec [baseline] (26.829 ms) : 0, 26829
AppSec [candidate] (26.728 ms) : 0, 26728
Debugger [baseline] (70.021 ms) : 0, 70021
Debugger [candidate] (70.061 ms) : 0, 70061
Remote Config [baseline] (527.66 µs) : 0, 528
Remote Config [candidate] (524.188 µs) : 0, 524
Telemetry [baseline] (9.738 ms) : 0, 9738
Telemetry [candidate] (9.125 ms) : 0, 9125
Flare Poller [baseline] (3.497 ms) : 0, 3497
Flare Poller [candidate] (3.353 ms) : 0, 3353
section profiling
crashtracking [baseline] (1.186 ms) : 0, 1186
crashtracking [candidate] (1.174 ms) : 0, 1174
BytebuddyAgent [baseline] (688.848 ms) : 0, 688848
BytebuddyAgent [candidate] (688.963 ms) : 0, 688963
AgentMeter [baseline] (8.666 ms) : 0, 8666
AgentMeter [candidate] (8.64 ms) : 0, 8640
GlobalTracer [baseline] (216.798 ms) : 0, 216798
GlobalTracer [candidate] (216.59 ms) : 0, 216590
AppSec [baseline] (32.565 ms) : 0, 32565
AppSec [candidate] (32.422 ms) : 0, 32422
Debugger [baseline] (65.501 ms) : 0, 65501
Debugger [candidate] (65.992 ms) : 0, 65992
Remote Config [baseline] (572.529 µs) : 0, 573
Remote Config [candidate] (555.434 µs) : 0, 555
Telemetry [baseline] (8.584 ms) : 0, 8584
Telemetry [candidate] (7.694 ms) : 0, 7694
Flare Poller [baseline] (3.578 ms) : 0, 3578
Flare Poller [candidate] (3.464 ms) : 0, 3464
ProfilingAgent [baseline] (95.16 ms) : 0, 95160
ProfilingAgent [candidate] (93.747 ms) : 0, 93747
Profiling [baseline] (95.736 ms) : 0, 95736
Profiling [candidate] (94.302 ms) : 0, 94302
Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.061 s) : 0, 1060558
Total [baseline] (8.869 s) : 0, 8869449
Agent [candidate] (1.06 s) : 0, 1059646
Total [candidate] (8.833 s) : 0, 8832892
section iast
Agent [baseline] (1.229 s) : 0, 1228933
Total [baseline] (9.581 s) : 0, 9580810
Agent [candidate] (1.237 s) : 0, 1236963
Total [candidate] (9.555 s) : 0, 9555218
gantt
title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.203 ms) : 0, 1203
crashtracking [candidate] (1.21 ms) : 0, 1210
BytebuddyAgent [baseline] (630.533 ms) : 0, 630533
BytebuddyAgent [candidate] (632.007 ms) : 0, 632007
AgentMeter [baseline] (29.279 ms) : 0, 29279
AgentMeter [candidate] (29.065 ms) : 0, 29065
GlobalTracer [baseline] (257.777 ms) : 0, 257777
GlobalTracer [candidate] (257.116 ms) : 0, 257116
AppSec [baseline] (31.774 ms) : 0, 31774
AppSec [candidate] (31.637 ms) : 0, 31637
Debugger [baseline] (59.61 ms) : 0, 59610
Debugger [candidate] (59.655 ms) : 0, 59655
Remote Config [baseline] (601.083 µs) : 0, 601
Remote Config [candidate] (581.823 µs) : 0, 582
Telemetry [baseline] (7.943 ms) : 0, 7943
Telemetry [candidate] (7.995 ms) : 0, 7995
Flare Poller [baseline] (5.724 ms) : 0, 5724
Flare Poller [candidate] (4.253 ms) : 0, 4253
section iast
crashtracking [baseline] (1.194 ms) : 0, 1194
crashtracking [candidate] (1.201 ms) : 0, 1201
BytebuddyAgent [baseline] (797.944 ms) : 0, 797944
BytebuddyAgent [candidate] (803.663 ms) : 0, 803663
AgentMeter [baseline] (11.328 ms) : 0, 11328
AgentMeter [candidate] (11.586 ms) : 0, 11586
GlobalTracer [baseline] (247.522 ms) : 0, 247522
GlobalTracer [candidate] (248.891 ms) : 0, 248891
AppSec [baseline] (26.476 ms) : 0, 26476
AppSec [candidate] (26.732 ms) : 0, 26732
Debugger [baseline] (69.418 ms) : 0, 69418
Debugger [candidate] (68.269 ms) : 0, 68269
Remote Config [baseline] (524.505 µs) : 0, 525
Remote Config [candidate] (518.334 µs) : 0, 518
Telemetry [baseline] (9.641 ms) : 0, 9641
Telemetry [candidate] (10.597 ms) : 0, 10597
Flare Poller [baseline] (3.477 ms) : 0, 3477
Flare Poller [candidate] (3.759 ms) : 0, 3759
IAST [baseline] (25.311 ms) : 0, 25311
IAST [candidate] (25.523 ms) : 0, 25523
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 4 performance regressions! Performance is the same for 17 metrics, 15 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section baseline
no_agent (19.134 ms) : 18939, 19330
. : milestone, 19134,
appsec (18.724 ms) : 18534, 18914
. : milestone, 18724,
code_origins (17.549 ms) : 17376, 17722
. : milestone, 17549,
iast (18.004 ms) : 17823, 18185
. : milestone, 18004,
profiling (18.523 ms) : 18341, 18706
. : milestone, 18523,
tracing (17.801 ms) : 17621, 17982
. : milestone, 17801,
section candidate
no_agent (18.279 ms) : 18090, 18468
. : milestone, 18279,
appsec (19.806 ms) : 19597, 20016
. : milestone, 19806,
code_origins (17.529 ms) : 17355, 17703
. : milestone, 17529,
iast (17.642 ms) : 17469, 17815
. : milestone, 17642,
profiling (18.444 ms) : 18258, 18630
. : milestone, 18444,
tracing (17.724 ms) : 17547, 17900
. : milestone, 17724,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section baseline
no_agent (1.189 ms) : 1177, 1201
. : milestone, 1189,
iast (3.235 ms) : 3189, 3280
. : milestone, 3235,
iast_FULL (5.785 ms) : 5727, 5842
. : milestone, 5785,
iast_GLOBAL (3.421 ms) : 3374, 3468
. : milestone, 3421,
profiling (2.202 ms) : 2183, 2222
. : milestone, 2202,
tracing (1.805 ms) : 1790, 1820
. : milestone, 1805,
section candidate
no_agent (1.183 ms) : 1172, 1195
. : milestone, 1183,
iast (3.215 ms) : 3173, 3257
. : milestone, 3215,
iast_FULL (5.778 ms) : 5721, 5835
. : milestone, 5778,
iast_GLOBAL (3.611 ms) : 3563, 3658
. : milestone, 3611,
profiling (2.178 ms) : 2157, 2198
. : milestone, 2178,
tracing (1.781 ms) : 1767, 1796
. : milestone, 1781,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section baseline
no_agent (15.595 s) : 15595000, 15595000
. : milestone, 15595000,
appsec (14.758 s) : 14758000, 14758000
. : milestone, 14758000,
iast (18.707 s) : 18707000, 18707000
. : milestone, 18707000,
iast_GLOBAL (17.712 s) : 17712000, 17712000
. : milestone, 17712000,
profiling (15.483 s) : 15483000, 15483000
. : milestone, 15483000,
tracing (14.931 s) : 14931000, 14931000
. : milestone, 14931000,
section candidate
no_agent (15.457 s) : 15457000, 15457000
. : milestone, 15457000,
appsec (14.621 s) : 14621000, 14621000
. : milestone, 14621000,
iast (18.365 s) : 18365000, 18365000
. : milestone, 18365000,
iast_GLOBAL (17.989 s) : 17989000, 17989000
. : milestone, 17989000,
profiling (15.343 s) : 15343000, 15343000
. : milestone, 15343000,
tracing (14.776 s) : 14776000, 14776000
. : milestone, 14776000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~bcc0957a74, baseline=1.61.0-SNAPSHOT~a953f33c70
dateFormat X
axisFormat %s
section baseline
no_agent (1.479 ms) : 1468, 1491
. : milestone, 1479,
appsec (2.521 ms) : 2466, 2576
. : milestone, 2521,
iast (2.261 ms) : 2191, 2331
. : milestone, 2261,
iast_GLOBAL (2.312 ms) : 2242, 2382
. : milestone, 2312,
profiling (2.121 ms) : 2064, 2179
. : milestone, 2121,
tracing (2.074 ms) : 2020, 2128
. : milestone, 2074,
section candidate
no_agent (1.479 ms) : 1468, 1491
. : milestone, 1479,
appsec (3.762 ms) : 3546, 3979
. : milestone, 3762,
iast (2.274 ms) : 2204, 2343
. : milestone, 2274,
iast_GLOBAL (2.317 ms) : 2247, 2386
. : milestone, 2317,
profiling (2.094 ms) : 2039, 2150
. : milestone, 2094,
tracing (2.072 ms) : 2019, 2126
. : milestone, 2072,
|
|
The following files add Groovy tests to modules that are candidates for migration to Java / JUnit 5:
Consider writing these tests in Java / JUnit 5 instead to help with the ongoing migration effort. |
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
# What does this PR do? Adds `_dd.p.ksr` (Knuth Sampling Rate) as a propagated tag set when agent-based or rule-based sampling decisions are made. The tag is stored in span `meta` (string type) with up to 6 significant digits and no trailing zeros. `format_sampling_rate` now returns `Option<String>` and guards against invalid inputs (negative, >1.0, NaN, infinity), returning `None` instead of producing garbage output. # Motivation To enable consistent sampling across tracers and backend retention filters, the backend needs to know the sampling rate applied by the tracer. Without transmitting the tracer's rate via `_dd.p.ksr`, backend resampling cannot correctly compute effective rates in multi-stage sampling scenarios. See RFC: "Transmit Knuth sampling rate to backend" # Additional Notes Key files changed: - `datadog-opentelemetry/src/core/constants.rs` — Added `SAMPLING_KNUTH_RATE_TAG_KEY` constant - `datadog-opentelemetry/src/sampling/datadog_sampler.rs` — Added `format_sampling_rate()` helper (returns `Option<String>`, defensive against invalid rates) and set ksr in agent/rule sampling paths - Updated 2 snapshot JSON files Related PRs across tracers: - Java: DataDog/dd-trace-java#10802 - .NET: DataDog/dd-trace-dotnet#8287 - Ruby: DataDog/dd-trace-rb#5436 - Node.js: DataDog/dd-trace-js#7741 - PHP: DataDog/dd-trace-php#3701 - C++: DataDog/dd-trace-cpp#288 - System tests: DataDog/system-tests#6466 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
...ent/instrumentation/opentelemetry/opentelemetry-0.3/src/test/groovy/OpenTelemetryTest.groovy
Outdated
Show resolved
Hide resolved
dd-trace-core/src/main/java/datadog/trace/core/propagation/ptags/PTagsFactory.java
Show resolved
Hide resolved
…r as primitive double Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
There was a problem hiding this comment.
Unfortunately, I think we need to replace the String.format call before this can be merged.
As is, this change increases the memory consumption in a span creation stress test by 2x.
In my local test, 16 threads x 10_000_000 traces x 2 spans, current head allocates 241 GiB. After this change, the same test is allocating 584 GiB. That corresponds to a 15 sec increase in execution time from ~65 secs to ~80 secs. Presumably mostly from GC, but I haven't verified that yet.
### What does this PR do? Fixes `_dd.p.ksr` (Knuth Sampling Rate) to only be set on spans when the agent has provided sampling rates via `readRatesJSON()`. Previously, ksr was unconditionally set in `prioritySampler.apply()`, including when the rate was the initial client-side default (1.0) before any agent response arrived. Also refactors `prioritySampler` to consolidate lock acquisitions: extracts `getRateLocked()` so `apply()` acquires `ps.mu.RLock` only once to read both the rate and `agentRatesLoaded`. ### Motivation Cross-language consistency: Python, Java, PHP, and other tracers only set ksr when actual agent rates or sampling rules are applied, not for the default fallback. This aligns Go with that behavior. See RFC: "Transmit Knuth sampling rate to backend" ### Additional Notes - Added `agentRatesLoaded` bool field to `prioritySampler`, set to `true` in `readRatesJSON()` - `apply()` now gates ksr behind `agentRatesLoaded` check - Extracted `getRateLocked()` to avoid double lock acquisition in `apply()` - Rule-based sampling path (`applyTraceRuleSampling` in span.go) unchanged — correctly always sets ksr - Tests added: `ksr-not-set-without-agent-rates` and `ksr-set-after-agent-rates-received` Related PRs across tracers: - Java: DataDog/dd-trace-java#10802 - .NET: DataDog/dd-trace-dotnet#8287 - Ruby: DataDog/dd-trace-rb#5436 - Node.js: DataDog/dd-trace-js#7741 - PHP: DataDog/dd-trace-php#3701 - Rust: DataDog/dd-trace-rs#180 - C++: DataDog/dd-trace-cpp#288 - System tests: DataDog/system-tests#6466 ### Reviewer's Checklist - [x] Changed code has unit tests for its functionality at or near 100% coverage. - [x] [System-Tests](https://github.com/DataDog/system-tests/) covering this feature have been added and enabled with the va.b.c-dev version tag. - [ ] There is a benchmark for any new code, or changes to existing code. - [x] If this interacts with the agent in a new way, a system test has been added. - [x] New code is free of linting errors. You can check this by running `make lint` locally. - [x] New code doesn't break existing tests. You can check this by running `make test` locally. - [ ] Add an appropriate team label so this PR gets put in the right place for the release notes. - [ ] All generated files are up to date. You can check this by running `make generate` locally. - [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running `make fix-modules` locally. Unsure? Have a question? Request a review! 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: Dario Castañé <[email protected]> Co-authored-by: Mikayla Toffler <[email protected]>
…ay arithmetic Replace String.format(Locale.ROOT, "%.6g", rate) with manual char-array formatting that avoids Formatter/stream/boxing allocations. Benchmarks show ~30x improvement (320ns -> 11ns). Additionally, cache the TagValue in updateKnuthSamplingRate() so that getKnuthSamplingRateTagValue() is a simple volatile read (~1.2ns) instead of re-formatting and re-looking up the TagValue on every header injection. Add JMH benchmark (KnuthSamplingRateFormatBenchmark) comparing all three approaches and expand test coverage to all magnitude buckets. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Implement scientific notation (rates < 1e-4) with manual char-array formatting, removing the last String.format dependency. The method now uses zero Formatter/Locale allocations for all rate values. Add test coverage for scientific notation: 1e-05, 5e-05, 1.23457e-05, 1e-07, 5.5e-10. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
...core/src/jmh/java/datadog/trace/core/propagation/ptags/KnuthSamplingRateFormatBenchmark.java
Show resolved
Hide resolved
Two changes: 1. Benchmark (address dougqh feedback): - Switch to Throughput mode + @threads(8) to surface GC pressure - @State(Scope.Thread): each thread gets its own PTags, models real traces - Add updateRateFreshTrace: resets instance cache each iteration to model per-trace cost (the actual hot path Doug was concerned about) - Update run instructions to include -t 8 -prof gc 2. Static cache for KSR TagValue (Option A): - Add static volatile cachedKsrRate + cachedKsrTagValue to PTags - On updateKnuthSamplingRate, check static cache before formatting - Eliminates char[]+String allocation on the per-trace path in steady state (every new PTags starts with NaN; without the cache, each trace root re-formats even when the rate is constant) - Race is benign: two threads computing the same rate produce equal TagValues - gc.alloc.rate.norm: updateRateFreshTrace goes from 80 B/op → ≈ 0 B/op structurally (not JIT-dependent) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
|
Following up with overhead numbers, the final change reduced the allocation in my stress test by 23 GiB. Hopefully with later changes to lazily construct PropagationTags, we can reduce this further. |
The tibco-testing pipeline has hardcoded expected span meta maps that break whenever new propagation tags are added (e.g. _dd.p.ksr from #10802). Remove the trigger to unblock master. See also: #10906 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
## Summary of changes Add `_dd.p.ksr` (Knuth Sampling Rate) propagated tag to spans when sampling is applied via agent rates or trace sampling rules, per the [Transmit Knuth Sampling Rate to Backend RFC](https://docs.google.com/document/d/1Po3qtJb6PGheFeKFSUMv2pVY_y-HFAxTzNLuacCbCXY/edit). ## Reason for change The backend needs to know the exact sampling rate applied by the tracer to correctly compute effective rates during resampling (e.g., tracer 0.5 × backend 0.5 = effective 0.25). This tag enables that by propagating the rate via `x-datadog-tags` and W3C `tracestate`. ## Implementation details - Set `_dd.p.ksr` in `TraceContext.SetSamplingPriority()` for `AgentRate`, `LocalTraceSamplingRule`, `RemoteAdaptiveSamplingRule`, and `RemoteUserSamplingRule` mechanisms - Use `TryAddTag` to preserve the original rate (consistent with `AppliedSamplingRate ??= rate` semantics) - Format with `"0.######"` (up to 6 decimal digits, no trailing zeros, no scientific notation) per RFC spec - Added `.IsOptional("_dd.p.ksr")` to `SpanTagAssertion.cs` so integration test tag validators accept the new tag ## Test coverage - Unit tests in `TraceContextTests_KnuthSamplingRate.cs`: - KSR set for agent rate sampling - KSR set for trace sampling rules (local, remote adaptive, remote user) - KSR NOT set for manual, AppSec, rate limiter, or single span mechanisms - KSR preserved on subsequent sampling calls (TryAddTag semantics) - Formatting with up to 6 decimal digits (boundary values including small rates like 0.00001) - System tests in [system-tests #6466](DataDog/system-tests#6466) ## Other details Related PRs across tracers: - Java: DataDog/dd-trace-java#10802 - Ruby: DataDog/dd-trace-rb#5436 - Node.js: DataDog/dd-trace-js#7741 - PHP: DataDog/dd-trace-php#3701 - Rust: DataDog/dd-trace-rs#180 - C++: DataDog/dd-trace-cpp#288 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
What Does This Do
Adds
_dd.p.ksr(Knuth Sampling Rate) as a propagated tag set when agent-based or rule-based sampling decisions are made. The tag is stored in spanmeta(string type) with up to 6 significant digits and no trailing zeros. It propagates viax-datadog-tagsheader and W3C tracestate (t.ksr).Motivation
To enable consistent sampling across tracers and backend retention filters, the backend needs to know the sampling rate applied by the tracer. Without transmitting the tracer's rate via
_dd.p.ksr, backend resampling cannot correctly compute effective rates in multi-stage sampling scenarios.See RFC: "Transmit Knuth sampling rate to backend"
Additional Notes
Key files changed:
DDSpan.java—setSamplingPriority()integration (decides when to set ksr)PTagsFactory.java—formatKnuthSamplingRate(),updateKnuthSamplingRate(double), and static KSR cachePropagationTags.java— Abstract API accepting rawdoubleratePTagsCodec.java— Encoding/decoding ksr inx-datadog-tagsheaderKnuthSamplingRateTest.groovy— Unit tests for formatting, agent/rule sampling, propagationKnuthSamplingRateFormatBenchmark.java— JMH benchmark comparing formatting approachesDesign decisions:
formatKnuthSamplingRateuses manual char-array arithmetic instead ofString.formatto avoid Formatter/stream/boxing allocations (~130x faster on formatting, and 10x less allocation: 80 B/op → 0 B/op on the per-trace hot path)updateKnuthSamplingRatechecks astatic volatile (double, TagValue)pair before formatting. Every newPTagsinstance starts withNaN, so without the cache each trace root would format even when the rate is constant. With the cache, the per-trace path is just 2 volatile reads after warmup — allocation-free regardless of JIT escape analysis behavior.knuthSamplingRateTagValueis cached onPTagssogetKnuthSamplingRateTagValue()on the header-injection hot path is a single volatile read (~0 allocation)PTagsFactory), not inDDSpan, since_dd.p.ksris a propagated tagBenchmark results (8 threads,
-prof gc):stringFormat(oldString.format)customFormat(char-array, no cache)updateRateFreshTracewithout static cacheupdateRateFreshTracewith static cachecachedTagValue(injection hot path)The
updateRateFreshTracebenchmark simulates per-trace cost: resets the instance cache (like a newPTags) then callsupdateKnuthSamplingRate. Without the static cache, escape analysis may or may not eliminate the 80 B/op allocation depending on JIT mood. With the static cache, it's structurally zero after warmup.Related PRs across tracers:
tag: no release notetag: ai generated