system-probe: Add support for building and testing Rust binary#44635
system-probe: Add support for building and testing Rust binary#44635dd-mergequeue[bot] merged 6 commits intomainfrom
Conversation
Static quality checks✅ Please find below the results from static quality gates 31 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 0e29828 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -1.52 | [-4.50, +1.46] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | ddot_metrics | memory utilization | +1.04 | [+0.81, +1.26] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +0.98 | [-0.50, +2.46] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | +0.41 | [+0.26, +0.56] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.37 | [+0.14, +0.60] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.21 | [+0.15, +0.27] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.20 | [+0.15, +0.24] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_memory | memory utilization | +0.19 | [+0.12, +0.27] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.14 | [+0.07, +0.21] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.11 | [+0.07, +0.15] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.10 | [-0.05, +0.26] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.08, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.14, +0.12] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.14, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.03 | [-0.46, +0.39] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.03 | [-0.42, +0.35] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.05 | [-0.10, -0.01] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.06 | [-0.43, +0.31] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.09 | [-0.14, -0.03] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.16 | [-0.25, -0.06] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.43 | [-0.64, -0.22] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -1.52 | [-4.50, +1.46] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.79 | [-1.99, -1.60] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -2.28 | [-2.40, -2.17] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | links |
|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | lost_bytes | 10/10 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
cb095bd to
2012a5b
Compare
usamasaqib
left a comment
There was a problem hiding this comment.
Looks good to me. Just added a suggestion for resolving the glibc issue.
| platformVersion, err := kernel.PlatformVersion() | ||
| require.NoError(t, err) | ||
|
|
||
| if platform == "centos" && strings.HasPrefix(platformVersion, "7") { |
There was a problem hiding this comment.
When building test artifacts we generally use a custom gcc toolchain which has an appropriately old version of glibc. The path for this is passed through DD_CC env var in the gitlab job. I think you should be able to pass this through to cargo somehow to workaround this problem.
There was a problem hiding this comment.
DD_CC env var: we (@DataDog/agent-build) advise against using non-hermetic dependencies whenever possible, even for testing purposes. The available cc toolchains should allow you to reference their hermetic components (compiler, glibc, etc.)
That being said, we acknowledge the transitional state of the ongoing migration to bazel and can revisit at a later stage should you decide to go for a host-provided compiler at this stage.
There was a problem hiding this comment.
I think one part of the problem is that the C/C++ toolchain we use for Bazel doesn't support -static-pie, otherwise we could try to build the test binaries statically. That toolchain is built by us (in some other repo) so we could maybe change it to be built with the appropriate support.
vagrant@agent-dev-ubuntu-24:~$ echo 'int main(){}' | gcc -static-pie -x c -
[0.067s] vagrant@agent-dev-ubuntu-24:~$ echo 'int main(){}' | .cache/bazel/_bazel_vagrant/8a1be3ba35926e170a8cb57e37b5b3bd/external/gcc_toolchain++gcc_toolchains+gcc_toolchain_aarch64/xbin/gcc -static-pie -x c -
/home/vagrant/.cache/bazel/_bazel_vagrant/8a1be3ba35926e170a8cb57e37b5b3bd/external/gcc_toolchain++gcc_toolchains+gcc_toolchain_aarch64/bin/../lib/gcc/aarch64-unknown-linux-gnu/12.3.0/../../../../aarch64-unknown-linux-gnu/bin/ld.bfd: cannot find rcrt1.o: No such file or directory
collect2: error: ld returned 1 exit status
[0.046s] 1 vagrant@agent-dev-ubuntu-24:~$ echo 'int main(){}' | .cache/bazel/_bazel_vagrant/8a1be3ba35926e170a8cb57e37b5b3bd/external/gcc_toolchain++gcc_toolchains+gcc_toolchain_aarch64/xbin/gcc -static -x c -
[0.081s] vagrant@agent-dev-ubuntu-24:~$
In CI, enable rustfmt and clippy checks for Rust targets during the build phase. This ensures that the rules are enforced by CI while still allowing developers to temporarily ignore them locally during development.
Conflicts: MODULE.bazel MODULE.bazel.lock
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
### What does this PR do? Import the Rust code for the privileged discovery agent and hook it up to the system-probe build/test system, replacing the sample binary which was added previously in #44635. This binary is not currently shipped. The contents of pkg/discovery/module/rust are taken from DataDog/sd-agent@d1255b6. No modifications have been done in this PR except removing these files which are now unnecessary/unused: ``` delete mode 100644 pkg/discovery/module/rust/.bazelignore delete mode 100644 pkg/discovery/module/rust/.bazelrc delete mode 100644 pkg/discovery/module/rust/.bazelversion delete mode 100644 pkg/discovery/module/rust/.gitattributes delete mode 100644 pkg/discovery/module/rust/.github/CODEOWNERS delete mode 100644 pkg/discovery/module/rust/.github/chainguard/self.gitlab.read.sts.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/bazel-ci.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/ci.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/release.yaml delete mode 100644 pkg/discovery/module/rust/.gitignore delete mode 100644 pkg/discovery/module/rust/MODULE.bazel delete mode 100644 pkg/discovery/module/rust/MODULE.bazel.lock ``` Any comments on the code itself will preferably be addressed in follow-up PRs, to keep the import clean. Note also that there is also future work needed to ensure that the resulting binary is built with the exact size optimization options, by applying the build options from the Cargo.toml (or the removed .bazelrc) to a more suitable place. This is also postponed to a later point since the binary is not being shipped yet. ### Motivation https://datadoghq.atlassian.net/browse/DSCVR-313 ### Describe how you validated your changes CI. ### Additional Notes For the handling of external crates (dependencies), I've added them via `crate_universe`'s [direct dependencies](https://bazelbuild.github.io/rules_rust/crate_universe_bzlmod.html#direct-dependencies). The alternative would have been to point to a Cargo.toml. Pointing to pkg/discovery/module/rust's from the main MODULE.bazel didn't seem correct, and adding a top-level Cargo.toml didn't seem like the way to go since we'll only be supporting Bazel for (top-level) builds. The organization of the crates.MODULE.bazel file will likely have to be revisited in the future if we get different Rust packages with conflicting requirements for crate features or versions. Co-authored-by: vincent.whitchurch <[email protected]>
### What does this PR do? Import the Rust code for the privileged discovery agent and hook it up to the system-probe build/test system, replacing the sample binary which was added previously in #44635. This binary is not currently shipped. The contents of pkg/discovery/module/rust are taken from https://github.com/DataDog/sd-agent/commit/d1255b66edd221f07f7fa30b6372d310c9e97a1a. No modifications have been done in this PR except removing these files which are now unnecessary/unused: ``` delete mode 100644 pkg/discovery/module/rust/.bazelignore delete mode 100644 pkg/discovery/module/rust/.bazelrc delete mode 100644 pkg/discovery/module/rust/.bazelversion delete mode 100644 pkg/discovery/module/rust/.gitattributes delete mode 100644 pkg/discovery/module/rust/.github/CODEOWNERS delete mode 100644 pkg/discovery/module/rust/.github/chainguard/self.gitlab.read.sts.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/bazel-ci.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/ci.yaml delete mode 100644 pkg/discovery/module/rust/.github/workflows/release.yaml delete mode 100644 pkg/discovery/module/rust/.gitignore delete mode 100644 pkg/discovery/module/rust/MODULE.bazel delete mode 100644 pkg/discovery/module/rust/MODULE.bazel.lock ``` Any comments on the code itself will preferably be addressed in follow-up PRs, to keep the import clean. Note also that there is also future work needed to ensure that the resulting binary is built with the exact size optimization options, by applying the build options from the Cargo.toml (or the removed .bazelrc) to a more suitable place. This is also postponed to a later point since the binary is not being shipped yet. ### Motivation https://datadoghq.atlassian.net/browse/DSCVR-313 ### Describe how you validated your changes CI. ### Additional Notes For the handling of external crates (dependencies), I've added them via `crate_universe`'s [direct dependencies](https://bazelbuild.github.io/rules_rust/crate_universe_bzlmod.html#direct-dependencies). The alternative would have been to point to a Cargo.toml. Pointing to pkg/discovery/module/rust's from the main MODULE.bazel didn't seem correct, and adding a top-level Cargo.toml didn't seem like the way to go since we'll only be supporting Bazel for (top-level) builds. The organization of the crates.MODULE.bazel file will likely have to be revisited in the future if we get different Rust packages with conflicting requirements for crate features or versions. Co-authored-by: vincent.whitchurch <[email protected]>
What does this PR do?
As a first step towards integrating the minimal Rust privileged agent for
discovery, add basic support for building a Rust binary and invoking it from
the go tests, including in KMT. The binary is not shipped.
Since this is the first use of the Rust toolchain in the Datadog agent build,
this is done using a dummy program in order to separate any build, toolchain
and CI related issues and concerns from any concerns about the actual new
privileged agent (which will be imported in a separate, later step).
Note that this also does not deal with the question of invoking Rust unit tests
in the agent build and in KMT, that will also be dealt with separately.
Motivation
https://datadoghq.atlassian.net/browse/DSCVR-313
Describe how you validated your changes
CI
Additional Notes
The Rust binary is not built statically and due to this has issues running on
Centos 7.9 which likely has an older glibc (I wasn't able to debug this fully
since I had issues setting up that distro in KMT locally).
Building the binary statically with -C target-feature=+crt-static did not work
due to rcrt1.o apparently missing in the toolchain.
A fix could have been to include and use the musl toolchain which is the one
recommended for statically compliing Rust, but instead of complicating things
further I just skipped the tests on Centos 7.x for now.