ebpf: replace `binary.Read` with `binary.Decode` in `sysenc.Unmarshal` by kolesnikovae · Pull Request #1713 · cilium/ebpf

kolesnikovae · 2025-03-07T06:58:19Z

Hello team,

First off, thanks for the amazing work on this project! We use it extensively at Grafana, and we truly appreciate the effort that goes into maintaining it.

This PR removes unnecessary allocations in sysenc.Unmarshal caused by the binary.Read call. We propose using binary.Decode (introduced in Go 1.23) as an alternative. While the change may appear minor, its impact can be substantial depending on the specific use case.

In the flame graphs below, you can observe one of our services experiencing significant GC pressure, with around 30% of the CPU time spent in runtime.scanobject. The heap allocations profile further highlights the issue, tracing it to the sysenc.Unmarshal call within the ebpf.(*Map).Lookup -> ebpf.unmarshalPerCPUValue call chain.

CPU profile:

Heap allocs profile (alloc_objects):

The profiles were collected over the course of an hour from a fleet of observability agents using Pyroscope and the standard Go pprof endpoints (we also use cilium/ebpf for profiling, though not in this particular case).

Nearly half of all the allocations made by the program occur due to binary.Read in sysenc.Unmarshal. While sysenc.Unmarshal is generally well-optimized and only resorts to binary.Read as a last step, in our specific case, we miss out on most of these optimizations.

The exact locations where allocations occur:

https://github.com/cilium/ebpf/blob/v0.17.3/internal/sysenc/marshal.go#L108
- A negligible number of allocations attributed to sync.Pool
https://cs.opensource.google/go/go/+/refs/tags/go1.23.2:src/encoding/binary/binary.go;l=267
- d := &decoder is stack-allocated in binary.Decode

Benchmarks confirm the presence of an allocation in sysenc.Unmarshal. After applying the optimization from this PR, the benchmark results show zero allocations.

Benchmark Results

goos: darwin
goarch: arm64
pkg: github.com/cilium/ebpf/internal/sysenc
cpu: Apple M1 Pro
                                     │  before.txt  │              after.txt               │
                                     │    sec/op    │    sec/op     vs base                │
Unmarshal/*[1]uint64-10                 41.16n ± 0%   42.85n ±  1%   +4.11% (p=0.000 n=10)
Unmarshal/*int16-10                     19.30n ± 2%   19.58n ± 10%   +1.50% (p=0.009 n=10)
Unmarshal/*uint16-10                    19.27n ± 0%   19.35n ±  1%   +0.42% (p=0.019 n=10)
Unmarshal/*int32-10                     19.86n ± 1%   19.95n ±  1%        ~ (p=0.060 n=10)
Unmarshal/*uint32-10                    18.95n ± 0%   19.04n ±  1%        ~ (p=0.072 n=10)
Unmarshal/*int64-10                     19.58n ± 3%   19.65n ±  2%        ~ (p=0.643 n=10)
Unmarshal/*uint64-10                    19.64n ± 1%   19.58n ±  2%        ~ (p=0.255 n=10)
Unmarshal/[]uint8-10                    19.30n ± 0%   19.36n ±  1%        ~ (p=0.811 n=10)
Unmarshal/*sysenc.explicitPad-10        44.42n ± 1%   43.69n ±  2%   -1.65% (p=0.001 n=10)
Unmarshal/[]sysenc.explicitPad-10       43.81n ± 0%   42.86n ±  0%   -2.16% (p=0.000 n=10)
Unmarshal/[]sysenc.explicitPad#01-10    44.81n ± 1%   43.83n ±  1%   -2.21% (p=0.000 n=10)
Unmarshal/[]sysenc.explicitPad#02-10    44.53n ± 2%   44.77n ±  1%        ~ (p=0.127 n=10)
Unmarshal/*sysenc.struc-10              86.26n ± 0%   62.59n ±  0%  -27.44% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10             46.00n ± 1%   45.12n ±  0%   -1.91% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#01-10          95.92n ± 1%   70.00n ±  0%  -27.03% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         131.80n ± 0%   91.78n ±  0%  -30.37% (p=0.000 n=10)
geomean                                 36.10n        33.91n         -6.06%

                                     │  before.txt  │                after.txt                │
                                     │     B/op     │    B/op     vs base                     │
Unmarshal/*[1]uint64-10                0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int16-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint16-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int32-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint32-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int64-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint64-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]uint8-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.explicitPad-10       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad-10      0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#01-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#02-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.struc-10             16.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10            0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.struc#01-10         16.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         24.00 ± 0%      0.00 ± 0%  -100.00% (p=0.000 n=10)


                                     │  before.txt  │                after.txt                │
                                     │  allocs/op   │ allocs/op   vs base                     │
Unmarshal/*[1]uint64-10                0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int16-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint16-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int32-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint32-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*int64-10                    0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*uint64-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]uint8-10                   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.explicitPad-10       0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad-10      0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#01-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.explicitPad#02-10   0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/*sysenc.struc-10             1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc-10            0.000 ± 0%     0.000 ± 0%         ~ (p=1.000 n=10) ¹
Unmarshal/[]sysenc.struc#01-10         1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)
Unmarshal/[]sysenc.struc#02-10         1.000 ± 0%     0.000 ± 0%  -100.00% (p=0.000 n=10)

Thank you for considering this change. Please let me know if you need any additional information or clarifications.

internal/sysenc/marshal.go

lmb · 2025-03-10T08:43:40Z

Looks good to me, thanks for your contribution! Can you add a test using testing.AllocsPerRun which asserts that the slow path doesn't have to allocate?

kolesnikovae · 2025-03-11T05:11:28Z

@lmb Thanks for the review! I've added TestUnmarshalSlowPathZeroAllocations with a subset of test cases that previously caused an allocation in the slow path. Alternatively, we could remove the !test.zeroAllocs condition in TestUnmarshalAllocations, and the test would still pass – let me know if you'd like me to do that.

ti-mo · 2025-03-12T12:52:45Z

@lmb Thanks for the review! I've added TestUnmarshalSlowPathZeroAllocations with a subset of test cases that previously caused an allocation in the slow path. Alternatively, we could remove the !test.zeroAllocs condition in TestUnmarshalAllocations, and the test would still pass – let me know if you'd like me to do that.

Yes, I think that would be ideal! No need for a separate test, let's update the existing one to reflect the new situation. Thanks!

kolesnikovae · 2025-03-14T10:07:12Z

Thank you for the review @ti-mo! I've applied the requested changes. Could you please take another look?

Remove allocations when unmarshaling values by using the new binary.Decode API. Signed-off-by: Anton Kolesnikov <[email protected]> Signed-off-by: Anton Kolesnikov <[email protected]>

lmb

Thanks! Added a comment to the tests and squashed the commits.

kolesnikovae requested a review from a team as a code owner March 7, 2025 06:58

kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from 1730e57 to fefaf4a Compare March 7, 2025 07:01

kolesnikovae commented Mar 7, 2025

View reviewed changes

internal/sysenc/marshal.go Show resolved Hide resolved

brycekahle approved these changes Mar 7, 2025

View reviewed changes

kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from 343b82e to 87a8080 Compare March 11, 2025 05:04

kolesnikovae force-pushed the perf-sysenc-unmarshal-binary-decode branch from dc9d6e6 to 7c794bb Compare March 14, 2025 06:29

ebpf: replace binary.Read with binary.Decode in sysenc.Unmarshal

d620a32

Remove allocations when unmarshaling values by using the new binary.Decode API. Signed-off-by: Anton Kolesnikov <[email protected]> Signed-off-by: Anton Kolesnikov <[email protected]>

lmb force-pushed the perf-sysenc-unmarshal-binary-decode branch from 7c794bb to d620a32 Compare March 17, 2025 11:16

lmb suggested changes Mar 17, 2025

View reviewed changes

lmb approved these changes Mar 17, 2025

View reviewed changes

lmb merged commit 0bfb15f into cilium:main Mar 17, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ebpf: replace `binary.Read` with `binary.Decode` in `sysenc.Unmarshal`#1713

ebpf: replace `binary.Read` with `binary.Decode` in `sysenc.Unmarshal`#1713
lmb merged 1 commit intocilium:mainfrom
kolesnikovae:perf-sysenc-unmarshal-binary-decode

kolesnikovae commented Mar 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

lmb commented Mar 10, 2025

Uh oh!

kolesnikovae commented Mar 11, 2025

Uh oh!

ti-mo commented Mar 12, 2025

Uh oh!

kolesnikovae commented Mar 14, 2025

Uh oh!

lmb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

kolesnikovae commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lmb commented Mar 10, 2025

Uh oh!

kolesnikovae commented Mar 11, 2025

Uh oh!

ti-mo commented Mar 12, 2025

Uh oh!

kolesnikovae commented Mar 14, 2025

Uh oh!

lmb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kolesnikovae commented Mar 7, 2025 •

edited

Loading