Skip to content

Conversation

@mxpv
Copy link
Member

@mxpv mxpv commented Sep 27, 2024

This PR exposes BoltDB options via TOML configuration.
On certain ephemeral environments its possible to squeeze a significant performance gains via more precise boltDB configuration by going async.
In our case we were able to reduce pull time under certain (heavy) conditions from 30s to just a few seconds.

Additionally this PR introduces optional args in MetaStore to allow (external) snapshotters configure boltDB options for same reasons.

bin/containerd config default

  [plugins.'io.containerd.metadata.v1.bolt']
    content_sharing_policy = 'shared'
    no_sync = true

Troubleshooting and research credits belong to @xinyangge-db

We found that the checkpoint image pull timeout is due to frequent lock contentions on containerd’s metadata db, and the lock contention occurs because synchronized file writes are performed while holding the metadata db lock.

Containerd uses a metadata db (backed by BoltDB) to maintain container objects such as created/running containers, pulled container images, created snapshots, etc. Updates to the metadata db are wrapped inside an exclusive transaction such that only one update can be performed at a time. For example, snapshotter.Commit updates the metadata db within a transaction during which concurrent container creation and/or image pulls can be blocked briefly.

When committing a transaction, the BoltDB implementation performs a synchronous write to the metadata db, which can incur a delay of several seconds as this is susceptible to the remote disk write latency.

We can reduce the lock contention by converting the above synchronous writes to asynchronous. The implication is that, if a VM loses power or crashes, then the metadata db can be in a corrupted state coming out of a reboot (as the in-memory writes were not flushed to the disk). Fortunately, this is not a concern when VMs are ephemeral.

@ningmingxiao
Copy link
Contributor

ningmingxiao commented Sep 29, 2024

can we set boltdb file in custom directory? Because boltdb on different file system has different stability and performance.

@mxpv
Copy link
Member Author

mxpv commented Sep 29, 2024

can we set boltdb file in custom directory? Because boltdb on different file system has different stability and performance.

Not at the moment, at least not at per plugin granularity. Though you can use these to tell containerd where to store its data.

	// Root is the path to a directory where containerd will store persistent data
	Root string `toml:"root"`
	// State is the path to a directory where containerd will store transient data
	State string `toml:"state"`

@mxpv mxpv requested a review from samuelkarp September 29, 2024 17:42
@samuelkarp
Copy link
Member

@mxpv and I had a bit of a chat offline in slack. I think I have a couple concerns with the approach in this PR:

  1. We're exposing implementation details of our use of BoltDB through the configuration file, which limits our ability to replace BoltDB in the future (without going through a deprecation cycle)
  2. We need to continuously mirror all of the options into our struct, which is maintenance toil for BoltDB upgrades that add options
  3. BoltDB may deprecate or remove some options in version bumps. This then impacts containerd users who have set those options; they should be informed of the deprecation and given a migration path
  4. It's possible that BoltDB will have a security vulnerability in the future that we will need to address in containerd. We do not align our support horizon with BoltDB's, and have not needed to constrain dependency updates to happen only at minor containerd version bumps in the past. If there is such a vulnerability and it is only fixed in a newer version of BoltDB where the options have changed, this makes it much more difficult to fix in containerd as we either need to deal with the option change or fork BoltDB ourselves to fix the vulnerability
  5. The set of options that BoltDB exposes is reasonably large, but right now we know of only one use-case for modifying them which is increased performance in ephemeral VMs

Thinking about number 5 above, what if we instead expose a slightly higher abstraction like a "data consistency profile" that can be set to "ephemeral", and internally set options to BoltDB like the ones in the PR description? This would retain our flexibility for dependency upgrades, option additions/removals, and migration to a different backend (if we ever decided we wanted to do so). Our challenge is: (a) are "default" and "ephemeral" really the only two profiles, and (b) if there is a suggestion for some new option, do we tweak the existing "ephemeral" profile or decide to add a new one?

@mxpv
Copy link
Member Author

mxpv commented Oct 1, 2024

To sum up, possible ways to go:

1. Mirror current bolt's options in TOML (currently implemented in this PR).

These change rarely, but we can add a warning banner that these are subject to change with newer bolt versions and out of support scope, so users take responsibility for breaking changes.

1.1 Variation of 1. is something like:

[plugins.'io.containerd.metadata.v1.bolt']
extra_boltdb_raw_options = '{uninterpreted json blob here}'

But we have to mirror bolt.Options anyway, because its not unmarhshallable as it contains func fields. Probably can make a contribution as a longer path option.

2. Introduce "default" and "ephemeral" profiles to cover current use cases we're aware.

This remains to be a concern:

Our challenge is: (a) are "default" and "ephemeral" really the only two profiles, and (b) if there is a suggestion for some new option, do we tweak the existing "ephemeral" profile or decide to add a new one?

@mxpv
Copy link
Member Author

mxpv commented Oct 10, 2024

As discussed in the containerd community meeting:
Another options to consider: the async mode represents a fairly isolated use case. It’s basically a choice between playing it safe with sync behavior and solid performance, or going for better performance with the risk of losing some data. There isn’t really a middle ground. So, the idea is to add a no_sync flag that lets BoltDB go async if needed. If we want more flexibility later, we can always consider adding extra options.

@mxpv
Copy link
Member Author

mxpv commented Oct 10, 2024

/test pull-containerd-node-e2e

@containerd containerd deleted a comment from k8s-ci-robot Oct 10, 2024
@mxpv mxpv requested a review from dmcgowan October 10, 2024 18:13
@samuelkarp
Copy link
Member

Another options to consider: the async mode represents a fairly isolated use case. It’s basically a choice between playing it safe with sync behavior and solid performance, or going for better performance with the risk of losing some data. There isn’t really a middle ground. So, the idea is to add a no_sync flag that lets BoltDB go async if needed. If we want more flexibility later, we can always consider adding extra options.

A boolean no_sync flag is fine, but I do think a profile (effectively an enum) is a better option here. Both a no_sync flag and an enum capture the limited, specific use-case concern, but a boolean flag paints us into a bit of a corner if we ever want to have some other specific use-case beyond "go as fast as possible with no concern for data safety" and the default "keep the data safe" modes.

@mxpv
Copy link
Member Author

mxpv commented Oct 18, 2024

/test pull-containerd-node-e2e

@dmcgowan dmcgowan added this to the 2.1 milestone Mar 6, 2025
@mxpv mxpv changed the title Allow BoltDB configuration in TOML Add no_sync option to boost boltDB performance on ephemeral environments Mar 6, 2025
Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but maybe rebase to run the tests again

mxpv added 2 commits April 22, 2025 09:27
Signed-off-by: Maksym Pavlenko <[email protected]>
Copy link
Member

@dims dims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from Needs Reviewers to Review In Progress in Pull Request Review Apr 22, 2025
@mxpv mxpv added this pull request to the merge queue Apr 22, 2025
Merged via the queue into containerd:main with commit d5534c6 Apr 22, 2025
58 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Apr 22, 2025
@mxpv mxpv deleted the db branch May 7, 2025 17:14
mansikulkarni96 added a commit to mansikulkarni96/containerd that referenced this pull request Dec 4, 2025
containerd 2.1.0

Welcome to the v2.1.0 release of containerd!

The first minor release of containerd 2.x focuses on continued stability alongside
new features and improvements. This is the first time-based released for containerd.
Most the feature set and core functionality has long been stable and harderened in production
environments, so now we transition to a balance of timely delivery of new functionality
with the same high confidence in stability and performance.

* Add no_sync option to boost boltDB performance on ephemeral environments ([containerd#10745](containerd#10745))
* Add content create event ([containerd#11006](containerd#11006))
* Erofs snapshotter and differ ([containerd#10705](containerd#10705))

* Update CRI to use transfer service for image pull by default ([containerd#8515](containerd#8515))
* Support multiple cni plugin bin dirs ([containerd#11311](containerd#11311))
* Support container restore through CRI/Kubernetes ([containerd#10365](containerd#10365))
* Add OCI/Image Volume Source support ([containerd#10579](containerd#10579))
* Enable Writable cgroups for unprivileged containers ([containerd#11131](containerd#11131))
* Fix recursive RLock() mutex acquisition ([containerd/go-cni#126](containerd/go-cni#126))
* Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123))

* Retry last registry host on 50x responses ([containerd#11484](containerd#11484))
* Multipart layer fetch ([containerd#10177](containerd#10177))
* Enable HTTP debug and trace for transfer based puller ([containerd#10762](containerd#10762))
* Add support for unpacking custom media types  ([containerd#11744](containerd#11744))
* Add dial timeout field to hosts toml configuration ([containerd#11106](containerd#11106))

* Expose Pod assigned IPs to NRI plugins ([containerd#10921](containerd#10921))

* Support multiple uid/gid mappings ([containerd#10722](containerd#10722))
* Fix race between serve and immediate shutdown on the server ([containerd/ttrpc#175](containerd/ttrpc#175))

* Update FreeBSD defaults and re-organize platform defaults ([containerd#11017](containerd#11017))

* Postpone cri config deprecations to v2.2 ([containerd#11684](containerd#11684))
* Remove deprecated dynamic library plugins ([containerd#11683](containerd#11683))
* Remove the support for Schema 1 images ([containerd#11681](containerd#11681))

Please try out the release binaries and report any issues at
https://github.com/containerd/containerd/issues.

* Derek McGowan
* Phil Estes
* Akihiro Suda
* Maksym Pavlenko
* Jin Dong
* Wei Fu
* Sebastiaan van Stijn
* Samuel Karp
* Mike Brown
* Adrien Delorme
* Austin Vazquez
* Akhil Mohan
* Kazuyoshi Kato
* Henry Wang
* Gao Xiang
* ningmingxiao
* Krisztian Litkey
* Yang Yang
* Archit Kulkarni
* Chris Henzie
* Iceber Gu
* Alexey Lunev
* Antonio Ojea
* Davanum Srinivas
* Marat Radchenko
* Michael Zappa
* Paweł Gronowski
* Rodrigo Campos
* Alberto Garcia Hierro
* Amit Barve
* Andrey Smirnov
* Divya
* Etienne Champetier
* Kirtana Ashok
* Philip Laine
* QiPing Wan
* fengwei0328
* zounengren
* Adrian Reber
* Alfred Wingate
* Amal Thundiyil
* Athos Ribeiro
* Brian Goff
* Cesar Talledo
* ChengyuZhu6
* Chongyi Zheng
* Craig Ingram
* Danny Canter
* David Son
* Fupan Li
* HirazawaUi
* Jing Xu
* Jonathan A. Sternberg
* Jose Fernandez
* Kaita Nakamura
* Kohei Tokunaga
* Lei Liu
* Marco Visin
* Mike Baynton
* Qiyuan Liang
* Sameer
* Shiming Zhang
* Swagat Bora
* Teresaliu
* Tony Fang
* Tõnis Tiigi
* Vered Rosen
* Vinayak Goyal
* bo.jiang
* chriskery
* luchenhan
* mahmut
* zhaixiaojuan

* **github.com/Microsoft/hcsshim**                                                 v0.12.9 -> v0.13.0-rc.3
* **github.com/cilium/ebpf**                                                       v0.11.0 -> v0.16.0
* **github.com/containerd/cgroups/v3**                                             v3.0.3 -> v3.0.5
* **github.com/containerd/containerd/api**                                         v1.8.0 -> v1.9.0
* **github.com/containerd/continuity**                                             v0.4.4 -> v0.4.5
* **github.com/containerd/go-cni**                                                 v1.1.10 -> v1.1.12
* **github.com/containerd/imgcrypt/v2**                                            v2.0.0-rc.1 -> v2.0.1
* **github.com/containerd/otelttrpc**                                              ea5083fda723 -> v0.1.0
* **github.com/containerd/platforms**                                              v1.0.0-rc.0 -> v1.0.0-rc.1
* **github.com/containerd/ttrpc**                                                  v1.2.6 -> v1.2.7
* **github.com/containerd/typeurl/v2**                                             v2.2.2 -> v2.2.3
* **github.com/containernetworking/cni**                                           v1.2.3 -> v1.3.0
* **github.com/containernetworking/plugins**                                       v1.5.1 -> v1.7.1
* **github.com/containers/ocicrypt**                                               v1.2.0 -> v1.2.1
* **github.com/davecgh/go-spew**                                                   d8f796af33cc -> v1.1.1
* **github.com/fsnotify/fsnotify**                                                 v1.7.0 -> v1.9.0
* **github.com/go-jose/go-jose/v4**                                                v4.0.4 -> v4.0.5
* **github.com/google/go-cmp**                                                     v0.6.0 -> v0.7.0
* **github.com/grpc-ecosystem/grpc-gateway/v2**                                    v2.22.0 -> v2.26.1
* **github.com/klauspost/compress**                                                v1.17.11 -> v1.18.0
* **github.com/mdlayher/socket**                                                   v0.4.1 -> v0.5.1
* **github.com/moby/spdystream**                                                   v0.4.0 -> v0.5.0
* **github.com/moby/sys/user**                                                     v0.3.0 -> v0.4.0
* **github.com/opencontainers/image-spec**                                         v1.1.0 -> v1.1.1
* **github.com/opencontainers/runtime-spec**                                       v1.2.0 -> v1.2.1
* **github.com/opencontainers/selinux**                                            v1.11.1 -> v1.12.0
* **github.com/pelletier/go-toml/v2**                                              v2.2.3 -> v2.2.4
* **github.com/petermattis/goid**                                                  4fcff4a6cae7 **_new_**
* **github.com/pmezard/go-difflib**                                                5d4384ee4fb2 -> v1.0.0
* **github.com/prometheus/client_golang**                                          v1.20.5 -> v1.22.0
* **github.com/prometheus/common**                                                 v0.55.0 -> v0.62.0
* **github.com/sasha-s/go-deadlock**                                               v0.3.5 **_new_**
* **github.com/smallstep/pkcs7**                                                   v0.1.1 **_new_**
* **github.com/stretchr/testify**                                                  v1.9.0 -> v1.10.0
* **github.com/tchap/go-patricia/v2**                                              v2.3.1 -> v2.3.2
* **github.com/urfave/cli/v2**                                                     v2.27.5 -> v2.27.6
* **github.com/vishvananda/netlink**                                               v1.3.0 -> 0e7078ed04c8
* **github.com/vishvananda/netns**                                                 v0.0.4 -> v0.0.5
* **go.etcd.io/bbolt**                                                             v1.3.11 -> v1.4.0
* **go.opentelemetry.io/auto/sdk**                                                 v1.1.0 **_new_**
* **go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc**  v0.56.0 -> v0.60.0
* **go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp**                v0.56.0 -> v0.60.0
* **go.opentelemetry.io/otel**                                                     v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/exporters/otlp/otlptrace**                            v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc**              v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp**              v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/metric**                                              v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/sdk**                                                 v1.31.0 -> v1.35.0
* **go.opentelemetry.io/otel/trace**                                               v1.31.0 -> v1.35.0
* **go.opentelemetry.io/proto/otlp**                                               v1.3.1 -> v1.5.0
* **golang.org/x/crypto**                                                          v0.28.0 -> v0.36.0
* **golang.org/x/exp**                                                             aacd6d4b4611 -> 2d47ceb2692f
* **golang.org/x/mod**                                                             v0.21.0 -> v0.24.0
* **golang.org/x/net**                                                             v0.30.0 -> v0.38.0
* **golang.org/x/oauth2**                                                          v0.22.0 -> v0.27.0
* **golang.org/x/sync**                                                            v0.8.0 -> v0.14.0
* **golang.org/x/sys**                                                             v0.26.0 -> v0.33.0
* **golang.org/x/term**                                                            v0.25.0 -> v0.30.0
* **golang.org/x/text**                                                            v0.19.0 -> v0.23.0
* **golang.org/x/time**                                                            v0.3.0 -> v0.7.0
* **google.golang.org/genproto/googleapis/api**                                    5fefd90f89a9 -> 56aae31c358a
* **google.golang.org/genproto/googleapis/rpc**                                    324edc3d5d38 -> 56aae31c358a
* **google.golang.org/grpc**                                                       v1.67.1 -> v1.72.0
* **google.golang.org/protobuf**                                                   v1.35.1 -> v1.36.6
* **k8s.io/api**                                                                   v0.31.2 -> v0.32.3
* **k8s.io/apimachinery**                                                          v0.31.2 -> v0.32.3
* **k8s.io/apiserver**                                                             v0.31.2 -> v0.32.3
* **k8s.io/client-go**                                                             v0.31.2 -> v0.32.3
* **k8s.io/cri-api**                                                               v0.31.2 -> v0.32.3
* **k8s.io/kubelet**                                                               v0.31.2 -> v0.32.3
* **k8s.io/utils**                                                                 18e509b52bc8 -> 3ea5e8cea738
* **sigs.k8s.io/json**                                                             bc3834ca7abd -> 9aa6b5e7a4b3
* **sigs.k8s.io/structured-merge-diff/v4**                                         v4.4.1 -> v4.4.2
* **tags.cncf.io/container-device-interface**                                      v0.8.0 -> v1.0.1
* **tags.cncf.io/container-device-interface/specs-go**                             v0.8.0 -> v1.0.0

Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0)
* `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`:         ✅Recommended. Dynamically linked with glibc 2.35 (Ubuntu 22.04).
* `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`:  Statically linked. Expected to be used on Linux distributions that do not use glibc >= 2.35. Not position-independent.

In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases)
and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too.

See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants