Skip to content

Speed up pull with larger buffers while writing to layer tempfile#48602

Closed
akx wants to merge 1 commit intomoby:masterfrom
akx:48601-pull-opt-1
Closed

Speed up pull with larger buffers while writing to layer tempfile#48602
akx wants to merge 1 commit intomoby:masterfrom
akx:48601-pull-opt-1

Conversation

@akx
Copy link
Copy Markdown
Contributor

@akx akx commented Oct 8, 2024

What I did

I investigated dockerd with strace and noticed it was doing a lot of rather small (less than 16k) writes to the GetImageBlob temporary files.

How I did it

This PR improves upon that by allocating larger buffers for the copy-to-tempfile operation, so not as much time is spent in tiny little syscalls.

However, due to an apparent quirk in how golang's io.Copy works (see comments within), just using a buffered writer as the write target doesn't actually end up filling the buffered writer before it flushes to disk, so there's a (likely misplaced!) copy of io.copyBuffer, sans its ReadFrom optimization, vendored in.

How to verify it

Empirically, in a lima Linux aarch64 virtual machine on my Macbook, this seems to improve hyperfine 'docker rmi -f quay.io/singularity/singularity:v4.1.0-arm64 && docker system prune -f && docker pull quay.io/singularity/singularity:v4.1.0-arm64' (image chosen because it's not small, and not local – pulling from a localhost:5000 registry registry, or pulling a more trivial image, does not show as much improvement).

The below seems to show a 2% improvement. I'd love for others to replicate (or prove wrong) these results!

docker-ce 5:27.3.1-1~ubuntu.24.04~noble

Benchmark 1: docker rmi -f quay.io/singularity/singularity:v4.1.0-arm64 && docker system prune -f && docker pull quay.io/singularity/singularity:v4.1.0-arm64
  Time (mean ± σ):     39.294 s ±  2.167 s    [User: 0.050 s, System: 0.047 s]
  Range (min … max):   37.764 s … 45.146 s    10 runs

This branch

Benchmark 1: docker rmi -f quay.io/singularity/singularity:v4.1.0-arm64 && docker system prune -f && docker pull quay.io/singularity/singularity:v4.1.0-arm64
  Time (mean ± σ):     38.288 s ±  0.634 s    [User: 0.041 s, System: 0.038 s]
  Range (min … max):   37.652 s … 39.400 s    10 runs

Description for the changelog

Improve layer download speed by using larger write buffers

A picture of a cute animal (not mandatory but encouraged)

A closeup of Saarinen, a cat. From the archives, vintage 2008.

img

@akx
Copy link
Copy Markdown
Contributor Author

akx commented Oct 8, 2024

(I'll take care of the lint error if there's some review activity on this 👍)

@thaJeztah
Copy link
Copy Markdown
Member

Thanks for contributing! I brought this PR (and your other PR) up in the maintainers call for more people to have a look. There's also some discussion to look if there's possibly an issue / missing optimization in Go stdlib.

(just to let you know this PR is not overlooked, just needs more eyes 🤗)

Comment thread distribution/pull_v2.go
Comment on lines +256 to +259
// `CopyBufferDirect` avoids a mis-optimization in
// `io.Copy`/`io.CopyBuffer` that would use the `ReadFrom` interface.
// For some reason, `ReadFrom` being used will cause short reads and writes.
_, err = utils.CopyBufferDirect(tmpFileBufWriter, io.TeeReader(reader, ld.verifier), buf)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a (likely misplaced!) copy of io.copyBuffer, sans its ReadFrom optimization, vendored in.

Surprisingly, the use of the ReadFrom optimization is the root cause of the suboptimal performance! See https://go.dev/issue/16474. And as mentioned in that proposal description, there is another workaround which does not require the use of any modified stdlib functions.

_, err = pools.Copy(struct{io.Writer}{tmpFileBufWriter}, io.TeeReader(struct{io.Reader}{reader}, ld.verifier))

Comment thread distribution/pull_v2.go
Comment on lines +252 to +255
// Allocate rather large buffers to avoid small
// `write()` syscalls to the underlying temporary file.
var buf = make([]byte, 524288)
var tmpFileBufWriter = bufio.NewWriterSize(tmpFile, 524288)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 kiB is a very large buffer -- sixteen times the default 32 kiB buffer size of our pools.Copy. How did you pick this value? I have my doubts that the increased memory pressure is worth the 2% speed bost. How much performance would we be leaving on the table if pools.Copy was used instead, with the default buffer size?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try and run some better numbers tomorrow (it's 22 PM, I'm at home and don't have any sort of Go toolchain on this machine, etc. etc.), but my idea was that a layer generally is, say, 20 megs to multiple gigabytes, so a combined megabyte of buffers seems to be pretty peanuts. The size of the buffer could be made a tunable (envvar, maybe?) though?

Without an explicit buffer here, we (empirically) end up doing a lot of pretty small writes; io.Copy and pools.Copy just use the size of the temporary buffer as the maximum size of a single read, and that single read is then written to the writer - and if that writer is a File, it'll just do the write syscall.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have pooled bufio, too.

Suggested change
// Allocate rather large buffers to avoid small
// `write()` syscalls to the underlying temporary file.
var buf = make([]byte, 524288)
var tmpFileBufWriter = bufio.NewWriterSize(tmpFile, 524288)
var tmpFileBufWriter = pools.BufioWriter32KPool.NewWriteCloserWrapper(pools.BufioWriter32KPool.Get(tmpFile), nil)
defer tmpFileBufWriter.Close()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to dig a little deeper, I think...

I wrote a small standalone program that uses the same basic ideas as dockerd here – basically copies bytes from an URL opened with http.Get() to a local file with

  • various copy buffer sizes
  • various write buffer sizes
  • the io.Reader cast/hack workaround from the other comment turned on or off

, and looking at the results I'm not really sure I'm measuring the right thing, because it looks like none of this matters in a statistically significant way. (Or I might be measuring the right thing, and none of this actually matters, and it's pretty much as optimum as it can be?)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the largest read size, regardless of buffer size, is going to be 16KB.
Having a buffered writer could improve things somewhat but perhaps not in a significant way. In this case you are probably only saving the syscall overhead of the write call.

@cpuguy83
Copy link
Copy Markdown
Member

cpuguy83 commented Jul 10, 2025

We discussed this today and came to the conclusion that:

  1. Increasing buffer sizes is not something we want to do
  2. Any such changes need to go into containerd as the place you have made these changes is going away and should be considered maintenance mode only.
  3. In general, I think there are other places in the pull pipeline that would have greater impact in terms of optimization without having to blow up memory usage.

@cpuguy83 cpuguy83 closed this Jul 10, 2025
@thaJeztah thaJeztah removed this from the 29.1.0 milestone Jul 11, 2025
@thompson-shaun thompson-shaun moved this from Open to Complete in 🔦 Maintainer spotlight Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants