-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Multipart layer fetch #10177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multipart layer fetch #10177
Conversation
|
Hi @azr. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2531c18 to
0748b0f
Compare
|
@azr Thanks for the PR, this looks promising. I wonder if you were able to get any memory usage data from your tests? Previous effort to use ECR containerd resolver , which has similar multipart layer download, showed that it can take up disproportionate amount of memory specially when we increase the number of parallel chunks(without providing significant benefit to latency). The high memory utilization was mainly from the Also can you share some information about your test image? Number of layers? Size of individual layers? |
|
/ok-to-test |
|
Hey @swagatbora90 , of course ! The theory in my mind is that this should use max/worst-case I think memory usage would be better if we were to directly write in parallel in a file at different positions, with 'holes'. And, sort of tell our progress to the checksumer with no-op writers that tell where we are, etc. (DL actually was so much faster this way in a test program I did, but it was not doing any unpacking, etc.) I also think it could be nice to be able to have a per registry parallelism setting, because not all registries are s3 backed, and docker.io seems to throttle things at 60mb/s. Topology of images: ~8GB imageFrom dive infos: ~27GB imageFrom dive infos: Here are memory usages, where I'm periodically storing ~27GB image pull, max_concurrent_downloads: 2, 0 parallelism (before)~27GB image pull, max_concurrent_downloads: 2, 110 parallelism, 32mb chunksGC traces: 8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 328GB image with `GODEBUG=gctrace=1`, parallelism set to 0 ( existing code )Grpc tracing screenshots from the same run (8GB image with `GODEBUG=gctrace=1`, parallelism set to 110 and chunksize set to 32):
Screenshot from another run for a ~27GB image, after a while, all chunks seem to take the same amount of time, ~22s, we've probably reached the writing speed burst limit, and are slowly taking more time to do things:
|
This comment was marked as outdated.
This comment was marked as outdated.
c13969f to
8fc47db
Compare
|
@azr Thanks for adding the performance numbers. I ran some tests as well using your patch and the memory usage looks better than what I saw in the htcat implementation specially with high parallelism count. However, I do observe that increasing parallelism does not yield better latency and may lead to higher memory usage (I think there is a number of other factors to consider here mainly type of instance used for testing, network bandwidth). I tried to limit the test to a single image with a single layer and fixing the chunk size to 20 MB. A lower parallelism count(3 or 4) may be preferable than setting parallelism to upwards of 10. Using a c7.12xlarge instance to pull a 3GB single layer image from ECR private repo.
Also the network download time was much faster (see Network Pull time) (~15sec) while containerd took additional ~20secs to complete the pull (before it started unpacking). I calculated the Network Download time by periodically calling |
4e43047 to
4bcf933
Compare
Signed-off-by: Adrien Delorme <[email protected]>
4bcf933 to
a196ee6
Compare
|
/test pull-containerd-k8s-e2e-ec2 |
dmcgowan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just get this in, there are still maybe a few interface tweaks we can make before final release that won't affect the functionality.
| r.Release(1) | ||
| for range parallelism { | ||
| go func() { | ||
| for i := range queue { // first in first out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there is competive condition in http2 stdlib on window updating.
Will check this part later. it's not blocker.
Signed-off-by: Adrien Delorme <[email protected]>
|
@azr thank you for all the hard work on this, it's an awesome improvement! 🙏 |
|
Aw thanks ! ❤️ You're welcome !! 🚀 |
containerd 2.1.0 Welcome to the v2.1.0 release of containerd! The first minor release of containerd 2.x focuses on continued stability alongside new features and improvements. This is the first time-based released for containerd. Most the feature set and core functionality has long been stable and harderened in production environments, so now we transition to a balance of timely delivery of new functionality with the same high confidence in stability and performance. * Add no_sync option to boost boltDB performance on ephemeral environments ([containerd#10745](containerd#10745)) * Add content create event ([containerd#11006](containerd#11006)) * Erofs snapshotter and differ ([containerd#10705](containerd#10705)) * Update CRI to use transfer service for image pull by default ([containerd#8515](containerd#8515)) * Support multiple cni plugin bin dirs ([containerd#11311](containerd#11311)) * Support container restore through CRI/Kubernetes ([containerd#10365](containerd#10365)) * Add OCI/Image Volume Source support ([containerd#10579](containerd#10579)) * Enable Writable cgroups for unprivileged containers ([containerd#11131](containerd#11131)) * Fix recursive RLock() mutex acquisition ([containerd/go-cni#126](containerd/go-cni#126)) * Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123)) * Retry last registry host on 50x responses ([containerd#11484](containerd#11484)) * Multipart layer fetch ([containerd#10177](containerd#10177)) * Enable HTTP debug and trace for transfer based puller ([containerd#10762](containerd#10762)) * Add support for unpacking custom media types ([containerd#11744](containerd#11744)) * Add dial timeout field to hosts toml configuration ([containerd#11106](containerd#11106)) * Expose Pod assigned IPs to NRI plugins ([containerd#10921](containerd#10921)) * Support multiple uid/gid mappings ([containerd#10722](containerd#10722)) * Fix race between serve and immediate shutdown on the server ([containerd/ttrpc#175](containerd/ttrpc#175)) * Update FreeBSD defaults and re-organize platform defaults ([containerd#11017](containerd#11017)) * Postpone cri config deprecations to v2.2 ([containerd#11684](containerd#11684)) * Remove deprecated dynamic library plugins ([containerd#11683](containerd#11683)) * Remove the support for Schema 1 images ([containerd#11681](containerd#11681)) Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues. * Derek McGowan * Phil Estes * Akihiro Suda * Maksym Pavlenko * Jin Dong * Wei Fu * Sebastiaan van Stijn * Samuel Karp * Mike Brown * Adrien Delorme * Austin Vazquez * Akhil Mohan * Kazuyoshi Kato * Henry Wang * Gao Xiang * ningmingxiao * Krisztian Litkey * Yang Yang * Archit Kulkarni * Chris Henzie * Iceber Gu * Alexey Lunev * Antonio Ojea * Davanum Srinivas * Marat Radchenko * Michael Zappa * Paweł Gronowski * Rodrigo Campos * Alberto Garcia Hierro * Amit Barve * Andrey Smirnov * Divya * Etienne Champetier * Kirtana Ashok * Philip Laine * QiPing Wan * fengwei0328 * zounengren * Adrian Reber * Alfred Wingate * Amal Thundiyil * Athos Ribeiro * Brian Goff * Cesar Talledo * ChengyuZhu6 * Chongyi Zheng * Craig Ingram * Danny Canter * David Son * Fupan Li * HirazawaUi * Jing Xu * Jonathan A. Sternberg * Jose Fernandez * Kaita Nakamura * Kohei Tokunaga * Lei Liu * Marco Visin * Mike Baynton * Qiyuan Liang * Sameer * Shiming Zhang * Swagat Bora * Teresaliu * Tony Fang * Tõnis Tiigi * Vered Rosen * Vinayak Goyal * bo.jiang * chriskery * luchenhan * mahmut * zhaixiaojuan * **github.com/Microsoft/hcsshim** v0.12.9 -> v0.13.0-rc.3 * **github.com/cilium/ebpf** v0.11.0 -> v0.16.0 * **github.com/containerd/cgroups/v3** v3.0.3 -> v3.0.5 * **github.com/containerd/containerd/api** v1.8.0 -> v1.9.0 * **github.com/containerd/continuity** v0.4.4 -> v0.4.5 * **github.com/containerd/go-cni** v1.1.10 -> v1.1.12 * **github.com/containerd/imgcrypt/v2** v2.0.0-rc.1 -> v2.0.1 * **github.com/containerd/otelttrpc** ea5083fda723 -> v0.1.0 * **github.com/containerd/platforms** v1.0.0-rc.0 -> v1.0.0-rc.1 * **github.com/containerd/ttrpc** v1.2.6 -> v1.2.7 * **github.com/containerd/typeurl/v2** v2.2.2 -> v2.2.3 * **github.com/containernetworking/cni** v1.2.3 -> v1.3.0 * **github.com/containernetworking/plugins** v1.5.1 -> v1.7.1 * **github.com/containers/ocicrypt** v1.2.0 -> v1.2.1 * **github.com/davecgh/go-spew** d8f796af33cc -> v1.1.1 * **github.com/fsnotify/fsnotify** v1.7.0 -> v1.9.0 * **github.com/go-jose/go-jose/v4** v4.0.4 -> v4.0.5 * **github.com/google/go-cmp** v0.6.0 -> v0.7.0 * **github.com/grpc-ecosystem/grpc-gateway/v2** v2.22.0 -> v2.26.1 * **github.com/klauspost/compress** v1.17.11 -> v1.18.0 * **github.com/mdlayher/socket** v0.4.1 -> v0.5.1 * **github.com/moby/spdystream** v0.4.0 -> v0.5.0 * **github.com/moby/sys/user** v0.3.0 -> v0.4.0 * **github.com/opencontainers/image-spec** v1.1.0 -> v1.1.1 * **github.com/opencontainers/runtime-spec** v1.2.0 -> v1.2.1 * **github.com/opencontainers/selinux** v1.11.1 -> v1.12.0 * **github.com/pelletier/go-toml/v2** v2.2.3 -> v2.2.4 * **github.com/petermattis/goid** 4fcff4a6cae7 **_new_** * **github.com/pmezard/go-difflib** 5d4384ee4fb2 -> v1.0.0 * **github.com/prometheus/client_golang** v1.20.5 -> v1.22.0 * **github.com/prometheus/common** v0.55.0 -> v0.62.0 * **github.com/sasha-s/go-deadlock** v0.3.5 **_new_** * **github.com/smallstep/pkcs7** v0.1.1 **_new_** * **github.com/stretchr/testify** v1.9.0 -> v1.10.0 * **github.com/tchap/go-patricia/v2** v2.3.1 -> v2.3.2 * **github.com/urfave/cli/v2** v2.27.5 -> v2.27.6 * **github.com/vishvananda/netlink** v1.3.0 -> 0e7078ed04c8 * **github.com/vishvananda/netns** v0.0.4 -> v0.0.5 * **go.etcd.io/bbolt** v1.3.11 -> v1.4.0 * **go.opentelemetry.io/auto/sdk** v1.1.0 **_new_** * **go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/otel** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/metric** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/sdk** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/trace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/proto/otlp** v1.3.1 -> v1.5.0 * **golang.org/x/crypto** v0.28.0 -> v0.36.0 * **golang.org/x/exp** aacd6d4b4611 -> 2d47ceb2692f * **golang.org/x/mod** v0.21.0 -> v0.24.0 * **golang.org/x/net** v0.30.0 -> v0.38.0 * **golang.org/x/oauth2** v0.22.0 -> v0.27.0 * **golang.org/x/sync** v0.8.0 -> v0.14.0 * **golang.org/x/sys** v0.26.0 -> v0.33.0 * **golang.org/x/term** v0.25.0 -> v0.30.0 * **golang.org/x/text** v0.19.0 -> v0.23.0 * **golang.org/x/time** v0.3.0 -> v0.7.0 * **google.golang.org/genproto/googleapis/api** 5fefd90f89a9 -> 56aae31c358a * **google.golang.org/genproto/googleapis/rpc** 324edc3d5d38 -> 56aae31c358a * **google.golang.org/grpc** v1.67.1 -> v1.72.0 * **google.golang.org/protobuf** v1.35.1 -> v1.36.6 * **k8s.io/api** v0.31.2 -> v0.32.3 * **k8s.io/apimachinery** v0.31.2 -> v0.32.3 * **k8s.io/apiserver** v0.31.2 -> v0.32.3 * **k8s.io/client-go** v0.31.2 -> v0.32.3 * **k8s.io/cri-api** v0.31.2 -> v0.32.3 * **k8s.io/kubelet** v0.31.2 -> v0.32.3 * **k8s.io/utils** 18e509b52bc8 -> 3ea5e8cea738 * **sigs.k8s.io/json** bc3834ca7abd -> 9aa6b5e7a4b3 * **sigs.k8s.io/structured-merge-diff/v4** v4.4.1 -> v4.4.2 * **tags.cncf.io/container-device-interface** v0.8.0 -> v1.0.1 * **tags.cncf.io/container-device-interface/specs-go** v0.8.0 -> v1.0.0 Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0) * `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`: ✅Recommended. Dynamically linked with glibc 2.35 (Ubuntu 22.04). * `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`: Statically linked. Expected to be used on Linux distributions that do not use glibc >= 2.35. Not position-independent. In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases) and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too. See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.







TLDR: this makes pulls of big images ~2x faster (edit: and a bit more in the latest iteration), and closes #9922.
cc: #8160, #4989
Hello Containerd People, I have this draft PR I would like to get your eyes on.
It basically makes pulls faster, but also tries to have not such a big memory impact, by getting consecutive chunks of the layers and immediately pushing them in the pipe (that writes to a file + that signature checksum thing).
I noticed it made pulls ~2x faster, when using the correct settings.
The settings have a big impact, and so I did a bunch of perf tests with different settings, here are some results on a ~8GB image using a
r6id.4xlargeinstance, pulling it from s3.Gains are somewhat similar on a ~27GB and a ~100GB image (with a little tiny bit of slowdown)
I also tried on an nvme, and a ebs drives, they are ofc slower but gains are still the same.
Metrics on a
r6id.4xlargetimingcrictl pullof a 8.6GB image.The first one with 13 tries is with 0 parallelism, it's the current code.
The rest are tries with different settings
tmpfs tests:
nvme (885GB) tests:
Observations, I did a little go program to multipart download big files directly into a file at different positions with different requests, and that was much faster than piping single threadedly into a file. Containerd pipes in a checksumer and then pipes into a file. I think that this can in some conditions create some sort of thrashing, hence why the parameters are very important here.
That simple go program had pretty bad perfs with one connection, but I was able to saturate the network with multiple connection, with better or on par perfs with aws-crt.
I think that for maximum perfs, we could try to re-architecture things a bit; like concurrently write directly into the tmpfile, and then tell the checksumer our progress, so that it can do that in parallel, and then carry on like usual.