Conversation
burak-ok
left a comment
There was a problem hiding this comment.
Thank you Alban for making it less memory hungry and faster :)
| fields := bytes.Fields(scanner.Bytes()) | ||
| line := scanner.Bytes() | ||
| // bytes.Fields is expensive, so filter for '[' first | ||
| if !bytes.ContainsRune(line, '[') { |
There was a problem hiding this comment.
I am not that sure (since I don't know the format that well), but if the closing bracket ] is always at the end of the line we could utilize that to improve performance even further
There was a problem hiding this comment.
Do you mean replacing bytes.ContainsRune by something like this?
if strings.HasSuffix(line, "]")That should filter the same amount of lines as bytes.ContainsRune, so it's a matter of if strings.HasSuffix is faster than bytes.ContainsRune. I have not tested it but I suspect it is similar for short lines like the ones in kallsyms, and in any cases would not change the amount of memory allocations.
The printfs for /proc/kallsyms are implemented here:
https://github.com/torvalds/linux/blob/eca631b8fe808748d7585059c4307005ca5c5820/kernel/kallsyms.c#L446-L465
The last change was 3 years ago to add the buildid inside the brackets:
torvalds/linux@9294523#diff-9538b26d3e082f233e6adac664cd2c14cbf2d510d5d7f286eef329c58de87eadR451
This is only for kernels with CONFIG_STACKTRACE_BUILD_ID (which is not set for my kernel). But the way cilium/ebpf parses the line is fine with regards to that change.
So I would say the format seems more or less stable but can change. And testing "contains-[" should be more stable than testing "suffix-]", in case Linux adds new fields at the end.
That is what I meant. Instead of looping through every rune in the string it only needs to check the last one.
That is what I had feared |
|
This is a nice band-aid, but conflicts with what #1578 is trying to do. Reading this PR here made me realize the proposal in #1578 is even slower and more expensive, that will need to be improved, and we shouldn't have multiple ways of scanning kallsyms. As an experiment, I wrote a naive barebones line parser for kallsyms that calls I got your results down to this, basically cutting down another 3/4 of the size and number of allocations: Note: this similarly skips lines containing Perhaps we can write a simple kallsyms iterator on top of This avoids having to allocate an individual |
In tests with pprof in Inspektor Gadget, I see allocations getting
reduced from 34MB to 12MB.
I also added a standalone benchmark.
tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations
$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee main.bench
$ go test -run=BenchmarkKernelModuleMemory -bench=. ./internal/kallsyms/... -count 10 | tee patched.bench
$ benchstat main.bench patched.bench
goos: linux
goarch: amd64
pkg: github.com/cilium/ebpf/internal/kallsyms
cpu: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
│ main.bench │ patched.bench │
│ sec/op │ sec/op vs base │
KernelModuleMemory-4 299.5m ± 9% 262.3m ± 11% -12.44% (p=0.000 n=10)
│ main.bench │ patched.bench │
│ B/op │ B/op vs base │
KernelModuleMemory-4 37.70Mi ± 0% 21.67Mi ± 0% -42.53% (p=0.000 n=10)
│ main.bench │ patched.bench │
│ allocs/op │ allocs/op vs base │
KernelModuleMemory-4 432.1k ± 0% 221.9k ± 0% -48.64% (p=0.000 n=10)
Signed-off-by: Alban Crequy <[email protected]>
|
I rebased my PR to fix the conflict with #1578. Your idea of custom scanner/iterator looks interesting. Is it something you want to write after this PR gets merged? |
In tests with pprof in Inspektor Gadget, I see allocations getting reduced from 34MB to 12MB.
I also added a standalone benchmark.
tl;dr: 12% faster, 42% less allocations in bytes, 48% less allocations
xref inspektor-gadget/inspektor-gadget#3571 (comment)
cc @burak-ok