Skip to content

Conversation

@manugupt1
Copy link
Contributor

@manugupt1 manugupt1 commented Apr 1, 2023

PR captures the reasoning behind this:
https://github.com/containerd/cgroups/pull/275/files#r1152562543

Reduce allocs in ReadUint64 by pre-allocating byte buffer

Before

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkReaduint64$ github.com/containerd/cgroups/v3/cgroup1 -count=1 -timeout 30m

goos: linux
goarch: amd64
pkg: github.com/containerd/cgroups/v3/cgroup1
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkReaduint64-8             105118              9762 ns/op             848 B/op          5 allocs/op
PASS
ok      github.com/containerd/cgroups/v3/cgroup1        1.151s

After

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkReaduint64$ github.com/containerd/cgroups/v3/cgroup1 -count=1 -timeout 30m

goos: linux
goarch: amd64
pkg: github.com/containerd/cgroups/v3/cgroup1
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkReaduint64-8             141484              8322 ns/op             128 B/op          3 allocs/op
PASS
ok      github.com/containerd/cgroups/v3/cgroup1        1.274s

@manugupt1 manugupt1 force-pushed the without-sync branch 3 times, most recently from 04e821f to 213ea6a Compare April 1, 2023 23:58
@manugupt1 manugupt1 changed the title Have a byte buffer in ReadUint64 and use io.Copy to call into that buffer Reduce allocs in ReadUint64 by pre-allocating byte buffer Apr 2, 2023
@manugupt1
Copy link
Contributor Author

Cc @fuweid @dcantah @AkihiroSuda

cgroup1/utils.go Outdated
Comment on lines 140 to 148
defer f.Close()
b := new(bytes.Buffer)
b.Reset()
_, err = io.Copy(b, f)
if err != nil {
return 0, err
}

return parseUint(strings.TrimSpace(b.String()), 10, 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems just using a raw byte slice cuts allocs down to 4 (although I get 7 allocs on my machine while testing your change instead of 6 as you posted). Doesn't need to be 80 bytes either, but > 64 might be necessary seeming as how this code took precautions to trim the string, so I'm assuming there's some cases where there may be trailing space in these files.

f, err := os.Open(path)
if err != nil {
	return 0, err
}
defer f.Close()

buf := make([]byte, 80)
n, err := f.Read(buf)
if err != nil {
	return 0, err
}

return parseUint(strings.TrimSpace(string(buf[:n])), 10, 64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goos: linux
goarch: amd64
pkg: github.com/containerd/cgroups/v3/cgroup1
cpu: AMD Ryzen 9 5900X 12-Core Processor            
BenchmarkReaduint64-24            849498              1342 ns/op             136 B/op          4 allocs/op
PASS
ok      github.com/containerd/cgroups/v3/cgroup1        1.160s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! It is significantly lower. I used buffer because of TrimSpace but adding extra bytes works as well.
I think our benchmarks are different as well (mine ends with -8 and yours with -24); maybe that is why allocs are different.

Copy link
Contributor Author

@manugupt1 manugupt1 Apr 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can benchmark bytebuffer (mostly because I think it is better not to be too smart in guessing file lengths while reading from them) in places where we are doing Readfile in entire package and see if we can reduce allocations in follow up PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me

Copy link
Member

@dcantah dcantah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (just one suggestion)

cgroup1/utils.go Outdated
}
return parseUint(strings.TrimSpace(string(v)), 10, 64)
defer f.Close()
b := make([]byte, 80)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually one more comment. Perhaps we should put a note above this for the reasoning behind 80 e.g. arbitrary value above 64 to account for trailing space (most I remember seeing was just one space but 🤷‍♂️), should fit on stack, extra 16 from 64 shouldn't hurt anyone etc.

Copy link
Contributor Author

@manugupt1 manugupt1 Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, if the comment needs more modification. Can you add a suggestion/the entire wording; so that I can commit directly in. Also thanks for the quick reviews

@kolyshkin
Copy link
Contributor

Before

goos: linux goarch: amd64 pkg: github.com/containerd/cgroups/v3/cgroup1 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz BenchmarkPids-8 59758 19905 ns/op 2064 B/op 15 allocs/op

Not sure why you use BenchmarkPids before and BenchmarkReaduint64 after.

@kolyshkin
Copy link
Contributor

Technically speaking, the benchmark should use a real cgroupfs file (or any other file from /proc). The reason is, normal files have size, and os.ReadFile uses that information to prepare the buffer of the size needed. OTOH many cgroupfs (and /proc) files report the size of 0, which makes os.ReadFile to allocate 512 bytes buffer.

I would recommend reading something like /proc/self/loginuid.

@kolyshkin
Copy link
Contributor

I see number of allocations going down from 5 to 3, when using Go 1.20.2 (or 6 to 4 when using Go 1.19.6). This is far from savings reported in the PR description (which is obviously wrong as it says BenchmarkPid for "before" measurement).

Out of these 2 allocations, one comes from the f.Stat call from os.OpenFile. Indeed, we don't need stat() when working with cgroup files. The other, I guess, comes from inlining (if I move file reading to a separate function, the number of allocs goes up from 3 to 4).

So yes, the commit makes sense, the description though needs to be modified to

  • show the correct before and after measurements
  • tell that we avoid os.OpenFile because:
    • it calls os.Stat which we do not need
    • it pre-allocates 512 bytes while we're fine with 80

In addition, I would change 80 to 128 for alignment.

@kolyshkin
Copy link
Contributor

Finally, note that there are a few other implementations in this repo, that are very similar to readUint. They all are in cgroup2 package, I was able to find two:

  • getStatFileContentUint64 (uses os.ReadFile)
  • readSingleFile (uses io.ReadAll)

I think those needs to be either consolidated, or fixed in a similar manner. Otherwise, we'll have the same problem in a couple of years once everyone has switched to cgroup v2.

@dcantah
Copy link
Member

dcantah commented Apr 4, 2023

@kolyshkin Oop, you beat me to your last comment. I was drafting an issue for similar fixes we could make here. Have the same two you found so far, may be more.

@dcantah
Copy link
Member

dcantah commented Apr 4, 2023

Technically speaking, the benchmark should use a real cgroupfs file (or any other file from /proc). The reason is, normal files have size, and os.ReadFile uses that information to prepare the buffer of the size needed. OTOH many cgroupfs (and /proc) files report the size of 0, which makes os.ReadFile to allocate 512 bytes buffer.

Probably is better to test on some magic file so +1, but given the change here is for us to provide the buffer, the ReadFile behavior doesn't come into the picture right?

@manugupt1
Copy link
Contributor Author

manugupt1 commented Apr 5, 2023

Yes to all of the above. I will do follow ups in coming days. I also noticed the same things.

😂

@kolyshkin
Copy link
Contributor

Probably is better to test on some magic file so +1, but given the change here is for us to provide the buffer, the ReadFile behavior doesn't come into the picture right?

The ReadFile is used in the "before" benchmark.

@dcantah
Copy link
Member

dcantah commented Apr 5, 2023

Probably is better to test on some magic file so +1, but given the change here is for us to provide the buffer, the ReadFile behavior doesn't come into the picture right?

The ReadFile is used in the "before" benchmark.

Ahh, I see what you were getting at now 👍.

@manugupt1 manugupt1 force-pushed the without-sync branch 2 times, most recently from 31cb059 to d17b8e1 Compare April 5, 2023 05:14
@manugupt1
Copy link
Contributor Author

Sorry; I got mixed up while working. @kolyshkin is right! It is 5 --> 3

@manugupt1 manugupt1 requested a review from kolyshkin April 5, 2023 05:14
Before

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkReaduint64$ github.com/containerd/cgroups/v3/cgroup1 -count=1 -timeout 30m

goos: linux
goarch: amd64
pkg: github.com/containerd/cgroups/v3/cgroup1
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkReaduint64-8   	  105118	      9762 ns/op	     848 B/op	       5 allocs/op
PASS
ok  	github.com/containerd/cgroups/v3/cgroup1	1.151s

After

Running tool: /usr/local/go/bin/go test -benchmem -run=^$ -bench ^BenchmarkReaduint64$ github.com/containerd/cgroups/v3/cgroup1 -count=1 -timeout 30m

goos: linux
goarch: amd64
pkg: github.com/containerd/cgroups/v3/cgroup1
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkReaduint64-8   	  141484	      8322 ns/op	     128 B/op	       3 allocs/op
PASS
ok  	github.com/containerd/cgroups/v3/cgroup1	1.274s

Signed-off-by: Manu Gupta <[email protected]>
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except we need to also address these: #278 (comment)

Copy link
Member

@dcantah dcantah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @kolyshkin, I was working on fixing the two cases we found for cg2 (which reminds me I should open my issue here..)

@brancz
Copy link

brancz commented Apr 6, 2023

Ah awesome, I was going to get back to this today, glad you already worked it out! Re-running the benchmarks I now see a total improvement of ~25%!

$ benchstat old.txt new.txt
name          old time/op    new time/op    delta
Readuint64-8    5.19µs ± 2%    3.87µs ± 1%  -25.48%  (p=0.000 n=10+10)

name          old alloc/op   new alloc/op   delta
Readuint64-8      840B ± 0%      120B ± 0%  -85.71%  (p=0.000 n=10+10)

name          old allocs/op  new allocs/op  delta
Readuint64-8      5.00 ± 0%      3.00 ± 0%  -40.00%  (p=0.000 n=10+10)

That means we should reap a total improvement of about 4%! Awesome!

I'm going to go ahead and close my PR.

@manugupt1
Copy link
Contributor Author

Thanks for all the help @brancz and finding this issue.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants