Skip to content

Comments

[PoC] Introduce new flag SpecCacheDisabled & Parse only the requires BTF types#1755

Closed
burak-ok wants to merge 1 commit intocilium:mainfrom
inspektor-gadget:burak/btf_filter
Closed

[PoC] Introduce new flag SpecCacheDisabled & Parse only the requires BTF types#1755
burak-ok wants to merge 1 commit intocilium:mainfrom
inspektor-gadget:burak/btf_filter

Conversation

@burak-ok
Copy link
Contributor

@burak-ok burak-ok commented Apr 16, 2025

Based on #1589

This is a proof-of-concept. The code is not ready for merging but it shows it is possible to significantly reduce the memory consumption (by 27MB).


This PR aims to lower the memory footprint when using the cilium/ebpf library. This is achieved in two ways:

  1. Reducing memory usage: Adding a new flag which disables caching of the BTF spec
  2. Parsing only the needed BTF types

The tradeoff is for lowering the memory footprint is of course performance while loading eBPF programs etc...

In this PoC when using the new SpecCacheDisabled flag (1) it will also automatically check which BTF types are needed and load/parse only those (2)


Benchmarks:

  1. ParseVmlinux is always the base for both tests
  2. The new.txt Inspektor gadget run is with the types from [PoC] Introduce new flag SpecCacheDisabled & Parse only the requires BTF types #1755
  3. The new.txt ParseVmlinux run is called BenchmarkParseVmlinuxWithoutFilter in this PR
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/btf
cpu: 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
                     │   old.txt   │               new.txt                │
                     │   sec/op    │   sec/op     vs base                 │
ParseVmlinux-8         29.28m ± 3%   60.62m ± 8%  +107.06% (p=0.000 n=10)
InspektorGadget-8      29.28m ± 3%   12.94m ± 3%   -55.79% (p=0.000 n=10)
geomean                29.28m        28.01m         -4.32%

                     │   old.txt    │                new.txt                │
                     │     B/op     │     B/op      vs base                 │
ParseVmlinux-8         24.43Mi ± 0%   53.58Mi ± 0%  +119.37% (p=0.000 n=10)
InspektorGadget-8      24.43Mi ± 0%   10.89Mi ± 0%   -55.40% (p=0.000 n=10)
geomean                24.43Mi        24.16Mi         -1.08%

                     │   old.txt   │               new.txt               │
                     │  allocs/op  │  allocs/op   vs base                │
ParseVmlinux-8         271.9k ± 0%   365.1k ± 0%  +34.30% (p=0.000 n=10)
InspektorGadget-8      271.9k ± 0%   110.1k ± 0%  -59.51% (p=0.000 n=10)
geomean                271.9k        200.5k       -26.26%

This flag disables the caching of the BTF which reduces the memory
footprint.
Furthermore this also only parses the needed symbols out of the BTF
instead of reading and interpreting everything.

Signed-off-by: Burak Ok <[email protected]>
Co-authored-by: Alban Crequy <[email protected]>
@lmb
Copy link
Contributor

lmb commented Apr 16, 2025

Seems like vmlinux spec keeps being a problem! Just checking: when you say memory usage you mean heap at idle? Calling https://pkg.go.dev/github.com/cilium/ebpf/btf#FlushKernelSpec does not help?

How would you determine which types to parse from vmlinux?

@burak-ok
Copy link
Contributor Author

Seems like vmlinux spec keeps being a problem! Just checking: when you say memory usage you mean heap at idle? Calling https://pkg.go.dev/github.com/cilium/ebpf/btf#FlushKernelSpec does not help?

Yes, that would help for having a lower heap usage after starting the program. But if one sets a low memory limit in a pod spec, one also needs to avoid high memory while initializing -> while loading every program.

Furthermore with FlushKernelSpec we need to find and identify every possible call into the ebpf library which might parse and save the while KernelSpec. With a flag we can set the flag and forget about flushing the kernel spec.

How would you determine which types to parse from vmlinux?

For that we are reading the all relo.TypeNames out of the program where the relocations are getting applied: https://github.com/cilium/ebpf/pull/1755/files#diff-981ef293a9c93614e843135eb5b207f951babe5604aee33c72f66567ccaa01de

@ti-mo
Copy link
Contributor

ti-mo commented Apr 17, 2025

@lmb To me it sounds like lazy-decoding could've been a better avenue to explore after all?

Not sure how (in)feasible it is today, but iirc we had btf.Spec.Add() back in the day which was a blocker. Now we have btf.Builder, we could technically, hypothetically, make btf.Spec a querying layer over an encoded btf blob and only inflate what's queried, and cache the results to enable type comparisons. Or, implement comparers on all types if we don't want to cache anything. Seems like some (most?) users care more about keeping both resident and peak memory usage low rather than speed.

@lmb
Copy link
Contributor

lmb commented Apr 21, 2025

My main concern is / was complexity of a lazy decoder. The whole "fixups" concept needs to be redone... I think you are right that peak usage seems more important.

I see two avenues: there is some perf to be gained by not unmarshaling into an interface for rawType, I think. Two is the lazy decode you mentioned. I still have some old proof of concepts lying around, I'll push those somewhere.

@lmb
Copy link
Contributor

lmb commented Apr 21, 2025

@burak-ok could you come up with a list of types which you most frequently need from vmlinux? That way we can add a benchmark we can start optimising against. Right now that benchmark is decoding all of vmlinux which isn't useful.

@burak-ok
Copy link
Contributor Author

I hope the following helps:

Here is a list from a single program which gets loaded every time for Inspektor Gadget:

syscall_trace_enter
task_struct
nsproxy
mnt_namespace

A list of from multiple programs combined which gets loaded every time for Inspektor Gadget:

pt_regs
file
inode
super_block
socket
syscall_trace_enter
task_struct
nsproxy
mnt_namespace
fanotify_event
pid
trace_event_raw_sched_process_exec

And another list from multiple programs, which get loaded every time and 4 gadgets(trace_tcp, trace_dns, trace_exec, top_file):

pt_regs
file
inode
super_block
socket
syscall_trace_enter
task_struct
nsproxy
mnt_namespace
fanotify_event
pid
trace_event_raw_sched_process_exec
fs_struct
path
mount
qstr
vfsmount
dentry
bpf_func_id
mm_struct
syscall_trace_exit
linux_binprm
sock
net
inet_sock

lmb added a commit to lmb/ebpf that referenced this pull request Apr 23, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common confiuration.

See cilium#1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
lmb added a commit to lmb/ebpf that referenced this pull request Apr 23, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common confiuration.

See cilium#1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
lmb added a commit to lmb/ebpf that referenced this pull request Apr 23, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common confiuration.

See cilium#1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
@lmb lmb mentioned this pull request Apr 23, 2025
@lmb
Copy link
Contributor

lmb commented Apr 23, 2025

@burak-ok can you take a look at #1763? Doesn't address FlushKernelSpec, but I think that might be better done by relying on weak in Go 1.24.

@burak-ok
Copy link
Contributor Author

@burak-ok can you take a look at #1763?

I added roughly the same benchmarks that you posted for this PR in the top post.
Comparing these results your PR saves roughly the same amount of memory and allocations for the InspektorGadget benchmark.
But the ParseVmlinux Benchmark shows that my Draft PR has more disadvantages for the general usecase of reading and parsing every type.

I'll take a deeper look into your PR some time later, thanks for opening it.

Doesn't address FlushKernelSpec, but I think that might be better done by relying on weak in Go 1.24.

With having memory limits I think this would be the best case scenario for us.

lmb added a commit to lmb/ebpf that referenced this pull request Apr 28, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common configuration. Also add a benchmark which explicitly
iterates all types in vmlinux, which is similar to what pwru does.

See cilium#1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
lmb added a commit to lmb/ebpf that referenced this pull request Apr 30, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common configuration. Also add a benchmark which explicitly
iterates all types in vmlinux, which is similar to what pwru does.

See cilium#1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
lmb added a commit that referenced this pull request May 1, 2025
Add a benchmark which replicates the types used by Inspektor Gadget
for a common configuration. Also add a benchmark which explicitly
iterates all types in vmlinux, which is similar to what pwru does.

See #1755 (comment)

Signed-off-by: Lorenz Bauer <[email protected]>
@lmb
Copy link
Contributor

lmb commented May 7, 2025

The new lazy BTF code is in. @burak-ok could you try the code and report back how much of a difference it makes?

@lmb lmb closed this May 7, 2025
gustavo-iniguez-goya added a commit to evilsocket/opensnitch that referenced this pull request Jul 7, 2025
This version introduces a lot of improvements.

It's worth mentioning this one cilium/ebpf#1755,
which reduces memory usage by ~25MB.

It also bumps go version to 1.23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants