Skip to content

[DBM] Add container tags hash to queries (if enabled)#8061

Merged
vandonr merged 45 commits intomasterfrom
vandonr/process2
Mar 25, 2026
Merged

[DBM] Add container tags hash to queries (if enabled)#8061
vandonr merged 45 commits intomasterfrom
vandonr/process2

Conversation

@vandonr
Copy link
Copy Markdown
Contributor

@vandonr vandonr commented Jan 14, 2026

Summary of changes

Add the ability to write the container tags hash to DBM queries + to the related span.
The goal is that DBM would then query the spans bearing that hash, and then use the container tags on this (those) spans(s) to enrich the queries with it.
This is controlled by a setting that is disabled by default, and would be enabled if propagation mode is "service" or greater

see RFC: https://docs.google.com/document/d/15GtNOKGBCt6Dc-HsDNnMmCdZwhewFQx8yUlI9in5n3M
related PR in python: DataDog/dd-trace-py#15293

Reason for change

DBM and DSM propagate service context in outbound communications (SQL comments, message headers), but neither product has awareness of the container environment (e.g., kube_cluster, namespace, pod_name). Propagating full container tags is not feasible due to cardinality constraints (query cache invalidation in OracleDB/SQLServer, exponential pathway growth in DSM) and size limitations (64–128 bytes for DBM non-comment methods).

This is needed for the service renaming initiative (defining services based on container names) and APM primary tags (container-based dimensions like Kubernetes cluster).

The solution: the agent computes a hash of low-cardinality container tags and back-propagates it to the tracer, which includes it in outbound DBM/DSM communications. DBM then resolves the hash by correlating with APM spans that carry the same hash as a span tag.

Implementation details

  • Add BaseHash static class that computes an FNV-64 hash of ProcessTags.SerializedTags combined with the container tags hash from the agent, encoded as base64
  • Read the container tags hash from the Datadog Agent via DiscoveryService, stored in ContainerMetadata.ContainerTagsHash
  • ContainerMetadata converted from static to instance class (singleton via ContainerMetadata.Instance) to improve testability
  • DatabaseMonitoringPropagator injects the base hash into SQL comments (as ddsh) when DD_DBM_INJECT_SQL_BASEHASH is true
  • Add _dd.dbm_container_tags_hash span tag on SqlTags so DBM can correlate the hash back to the span's container tags
  • New config key DD_DBM_INJECT_SQL_BASEHASH (disabled by default), intended to be enabled when DBM propagation mode is service or higher
  • Add container ID header to MinimalAgentHeaderHelper for agent communication

Test coverage

Adding a test in DbScopeFactoryTests.cs forced me to inject the value from pretty high, which I find a bit "dirty", but at least we don't have to rely on global static instance in tests.

Other details

vandonr and others added 10 commits December 2, 2025 18:03
## Summary of changes
Replaced custom mutex guard with `std::lock_guard`, using
`std::recursive_mutex` instead of `CRITICAL_SECTION` in windows and
`std::mutex` with railings in Linux

## Reason for change
Some locks have been spotted in smoke test wich could be cause by the
lack of thread recursive lock in the `std::mutex`

## Implementation details

## Test coverage

## Other details
<!-- Fixes #{issue} -->


<!--  ⚠️ Note:

Where possible, please obtain 2 approvals prior to merging. Unless
CODEOWNERS specifies otherwise, for external teams it is typically best
to have one review from a team member, and one review from apm-dotnet.
Trivial changes do not require 2 reviews.

MergeQueue is NOT enabled in this repository. If you have write access
to the repo, the PR has 1-2 approvals (see above), and all of the
required checks have passed, you can use the Squash and Merge button to
merge the PR. If you don't have write access, or you need help, reach
out in the #apm-dotnet channel in Slack.
-->
@vandonr vandonr requested review from a team as code owners January 14, 2026 15:03
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd01fab6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Jan 14, 2026

Benchmarks

Benchmark execution time: 2026-03-25 14:05:49

Comparing candidate commit 8fed285 in PR branch vandonr/process2 with baseline commit 6aadb93 in branch master.

Found 11 performance improvements and 11 performance regressions! Performance is the same for 253 metrics, 13 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:Benchmarks.Trace.ActivityBenchmark.StartStopWithChild net472

  • 🟥 throughput [-56888.098op/s; -55126.021op/s] or [-64.340%; -62.347%]
  • 🟩 allocated_mem [-3.899KB; -3.899KB] or [-100.006%; -99.994%]
  • 🟩 execution_time [-200.436ms; -199.789ms] or [-100.146%; -99.823%]

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces net6.0

  • 🟩 execution_time [-82.246ms; -82.145ms] or [-39.936%; -39.887%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.AllCycleSimpleBody netcoreapp3.1

  • 🟥 execution_time [+15.990ms; +22.293ms] or [+8.212%; +11.449%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest net6.0

  • 🟩 execution_time [-108.789ms; -106.892ms] or [-54.857%; -53.900%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net472

  • 🟩 execution_time [-50.803ms; -44.254ms] or [-22.574%; -19.664%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net6.0

  • 🟥 execution_time [+12.600ms; +16.745ms] or [+7.079%; +9.408%]
  • 🟥 throughput [-122.911op/s; -93.907op/s] or [-8.545%; -6.529%]

scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSlice net6.0

  • 🟥 execution_time [+80.255µs; +84.638µs] or [+5.696%; +6.007%]
  • 🟥 throughput [-40.288op/s; -38.193op/s] or [-5.676%; -5.381%]

scenario:Benchmarks.Trace.CharSliceBenchmark.OriginalCharSlice net6.0

  • 🟥 execution_time [+104.717µs; +184.616µs] or [+5.016%; +8.844%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearchAsync net472

  • 🟩 throughput [+15478.535op/s; +16997.591op/s] or [+5.219%; +5.731%]

scenario:Benchmarks.Trace.ILoggerBenchmark.EnrichedLog netcoreapp3.1

  • 🟥 execution_time [+10.659ms; +15.199ms] or [+5.470%; +7.800%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark netcoreapp3.1

  • 🟩 allocated_mem [-14.599KB; -14.572KB] or [-5.311%; -5.301%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatBenchmark net472

  • 🟥 allocated_mem [+8.189KB; +8.196KB] or [+16.661%; +16.675%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatBenchmark netcoreapp3.1

  • 🟩 throughput [+1127.918op/s; +2892.354op/s] or [+6.335%; +16.245%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore netcoreapp3.1

  • 🟩 throughput [+15394494.744op/s; +16296614.419op/s] or [+6.821%; +7.221%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishScope net6.0

  • 🟩 execution_time [-19.952ms; -14.236ms] or [-9.237%; -6.591%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes net6.0

  • 🟩 execution_time [-16.609ms; -12.588ms] or [-7.776%; -5.893%]

scenario:Benchmarks.Trace.SpanBenchmark.StartFinishTwoScopes netcoreapp3.1

  • 🟥 execution_time [+10.552ms; +13.668ms] or [+5.216%; +6.756%]

scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin netcoreapp3.1

  • 🟥 execution_time [+11.378ms; +16.621ms] or [+5.801%; +8.475%]

@vandonr
Copy link
Copy Markdown
Contributor Author

vandonr commented Jan 15, 2026

I just realized I need to put process tags in there too

@vandonr vandonr marked this pull request as draft January 15, 2026 14:53
Copy link
Copy Markdown
Collaborator

@bouwkast bouwkast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question that I have is that it appears this is correctly following the RFC in how we propagate the hash, but the merged Python implementation recomputes the hash.

But the RFC isn't precise enough in describing the hash and expected behavior / requirements for me to know which is correct really

Base automatically changed from vandonr/process3 to master February 3, 2026 18:49
getDiscoveryServiceFunc: static s => DiscoveryService.CreateUnmanaged(
s.TracerSettings.Manager.InitialExporterSettings,
ContainerMetadata.Instance,
new ServiceRemappingHash(null),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one I'm not 100% sure, but since it's only used for DBM for now, I don't think it'd play any role in that code path, so it should be safe to hardcode a disabled instance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this seems fine to me, I can't imagine we support DBM or DSM with CI Visibility today 🤔

if (!_warnedOnSet)
{
_warnedOnSet = true;
Log.Error("The code is trying to set the value '{Value}' to {Prop}, but this has no effect in .NET Framework.", value, nameof(ContainerTagsHash));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is now gone in the new version. Intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 🤷
I mean, I could add an #if NETFRAMEWORK in the set body, but I'm not sure it's really worth the extra code...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the conversation in #8363 (closed), it's ok to remove this.

@lucaspimentel
Copy link
Copy Markdown
Member

related: #8363

bouwkast added a commit that referenced this pull request Mar 24, 2026
Instead of guarding the caller with #if !NETFRAMEWORK, make the
setter a silent no-op. This avoids conflict with #8061 which
replaces the caller entirely.
Copy link
Copy Markdown
Member

@andrewlock andrewlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Job 👍

I added a bunch of comments with suggestions/questions/nitpics, but there's nothing really blocking (other than potentially setting names etc and hash versions)

Comment on lines +55 to +60
ContainerTagsHash = containerTagsHash;

if (_serializedProcessTags != null)
{
Base64Value = Compute(_serializedProcessTags, containerTagsHash);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to confirm, we only have a Base64Value if we have both process tags and a container hash. And we don't care about the race condition/consistency here? e.g. for outside observers it's possible for there to be a ContainerTagsHash but no Base64Value, and vice versa. As long as we're fine with those limitations, this is fine. Otherwise, we can create a "holder" type to hold both values, and you can swap that atomically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there can be (for a brief period) a ContainerTagsHash but a stale Base64Value, it's fine, no precise timing needed here.


private static string Compute(string processTags, string? containerTagsHash)
{
var hash = FnvHash64.GenerateHash(processTags, FnvHash64.Version.V1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we definitely want to use v1 here, not V1A? IIRC, most places use V1A? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python and Java both use V1 and not V1A, but what really matters is consistency within a tracer between the sql comment and the tag on the span, so either would be fine

#endif

BinaryPrimitives.WriteUInt64LittleEndian(buf, hash); // write 8 bytes into a 12-byte buffer
Base64.EncodeToUtf8InPlace(buf, dataLength: 8, out var bytesWritten);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we should check the return value for correct encoding - I'm kinda surprised the new analyzers don't flag it. But also, this is fine, and we can make other assumptions later, so big meh 🤷‍♂️

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah, but the return values are either Done or DestinationTooSmall. DestinationTooSmall should never happen because we computed the size pretty deterministically...
I can like, throw an exception if this happens so that it's catched in tests if this ever happens because... we put more data in there

telemetryProxyEndpoint: telemetryProxyEndpoint,
tracerFlareEndpoint: tracerFlareEndpoint,
containerTagsHash: _containerMetadata.ContainerTagsHash, // either the value just received, or the one we stored before (prevents overriding with null)
containerTagsHash: _serviceRemappingHash.ContainerTagsHash, // either the value just received, or the one we stored before (prevents overriding with null)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that correct though? 🤔 If the agent starts sending us null, is that not because something changed about the instance? Should we be second guessing the values we're sent here? (I haven't checked if the RFC spells this out)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see a practical case where we'd like to reset our container tags hash... Like, either they change or they stay the same, but they cannot really disappear, it'd mean that the app exited the container while still running ^^

At the same time, I don't see a real reason why the agent would reply with null, BUT it could happen if some code path don't request the value in the headers, which could happen for good reasons (we could, for performance reasons, say that we don't need the hash on every single request if we already have one). I feel like it's safer to write it that way, especialy considering how far in the code the code "requesting" the header is

getDiscoveryServiceFunc: static s => DiscoveryService.CreateUnmanaged(
s.TracerSettings.Manager.InitialExporterSettings,
ContainerMetadata.Instance,
new ServiceRemappingHash(null),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this seems fine to me, I can't imagine we support DBM or DSM with CI Visibility today 🤔

Configuration key for setting DBM propagation mode
Default value is disabled, expected values are either: disabled, service or full
<seealso cref="Datadog.Trace.Configuration.TracerSettings.DbmPropagationMode"/>
DD_DBM_INJECT_SQL_BASEHASH:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh this configuration key should end with _ENABLED so it should be something like DD_DBM_SQL_BASEHASH_INJECTION_ENABLED. If this ship hasn't already sailed in other languages, can we please fix it 🙏 🥺


namespace Datadog.Trace.Tests;

public class ServiceRemappingHashTests
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any example values from the RFC that we can use to validate the actual expected hash given a set of process tags and container tags hash? Would be useful for validating that the calculation is correct across languages

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no expected hash, as I said in an other reply above, we just need to be consistent with ourselves

vandonr and others added 3 commits March 25, 2026 13:41
@vandonr vandonr merged commit 0c457bf into master Mar 25, 2026
139 checks passed
@vandonr vandonr deleted the vandonr/process2 branch March 25, 2026 15:17
@github-actions github-actions bot added this to the vNext-v3 milestone Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants