Add no_sync option to boost boltDB performance on ephemeral environments#10745
Add no_sync option to boost boltDB performance on ephemeral environments#10745mxpv merged 5 commits intocontainerd:mainfrom
Conversation
|
can we set boltdb file in custom directory? Because boltdb on different file system has different stability and performance. |
Not at the moment, at least not at per plugin granularity. Though you can use these to tell // Root is the path to a directory where containerd will store persistent data
Root string `toml:"root"`
// State is the path to a directory where containerd will store transient data
State string `toml:"state"` |
|
@mxpv and I had a bit of a chat offline in slack. I think I have a couple concerns with the approach in this PR:
Thinking about number 5 above, what if we instead expose a slightly higher abstraction like a "data consistency profile" that can be set to "ephemeral", and internally set options to BoltDB like the ones in the PR description? This would retain our flexibility for dependency upgrades, option additions/removals, and migration to a different backend (if we ever decided we wanted to do so). Our challenge is: (a) are "default" and "ephemeral" really the only two profiles, and (b) if there is a suggestion for some new option, do we tweak the existing "ephemeral" profile or decide to add a new one? |
|
To sum up, possible ways to go: 1. Mirror current bolt's options in TOML (currently implemented in this PR). These change rarely, but we can add a warning banner that these are subject to change with newer bolt versions and out of support scope, so users take responsibility for breaking changes. 1.1 Variation of 1. is something like: [plugins.'io.containerd.metadata.v1.bolt']
extra_boltdb_raw_options = '{uninterpreted json blob here}'But we have to mirror 2. Introduce "default" and "ephemeral" profiles to cover current use cases we're aware. This remains to be a concern:
|
|
As discussed in the containerd community meeting: |
|
/test pull-containerd-node-e2e |
A boolean no_sync flag is fine, but I do think a profile (effectively an enum) is a better option here. Both a no_sync flag and an enum capture the limited, specific use-case concern, but a boolean flag paints us into a bit of a corner if we ever want to have some other specific use-case beyond "go as fast as possible with no concern for data safety" and the default "keep the data safe" modes. |
|
/test pull-containerd-node-e2e |
dmcgowan
left a comment
There was a problem hiding this comment.
LGTM but maybe rebase to run the tests again
Signed-off-by: Maksym Pavlenko <[email protected]>
Signed-off-by: Maksym Pavlenko <[email protected]>
Signed-off-by: Maksym Pavlenko <[email protected]>
Signed-off-by: Maksym Pavlenko <[email protected]>
Signed-off-by: Maksym Pavlenko <[email protected]>
containerd 2.1.0 Welcome to the v2.1.0 release of containerd! The first minor release of containerd 2.x focuses on continued stability alongside new features and improvements. This is the first time-based released for containerd. Most the feature set and core functionality has long been stable and harderened in production environments, so now we transition to a balance of timely delivery of new functionality with the same high confidence in stability and performance. * Add no_sync option to boost boltDB performance on ephemeral environments ([containerd#10745](containerd#10745)) * Add content create event ([containerd#11006](containerd#11006)) * Erofs snapshotter and differ ([containerd#10705](containerd#10705)) * Update CRI to use transfer service for image pull by default ([containerd#8515](containerd#8515)) * Support multiple cni plugin bin dirs ([containerd#11311](containerd#11311)) * Support container restore through CRI/Kubernetes ([containerd#10365](containerd#10365)) * Add OCI/Image Volume Source support ([containerd#10579](containerd#10579)) * Enable Writable cgroups for unprivileged containers ([containerd#11131](containerd#11131)) * Fix recursive RLock() mutex acquisition ([containerd/go-cni#126](containerd/go-cni#126)) * Support CNI STATUS Verb ([containerd/go-cni#123](containerd/go-cni#123)) * Retry last registry host on 50x responses ([containerd#11484](containerd#11484)) * Multipart layer fetch ([containerd#10177](containerd#10177)) * Enable HTTP debug and trace for transfer based puller ([containerd#10762](containerd#10762)) * Add support for unpacking custom media types ([containerd#11744](containerd#11744)) * Add dial timeout field to hosts toml configuration ([containerd#11106](containerd#11106)) * Expose Pod assigned IPs to NRI plugins ([containerd#10921](containerd#10921)) * Support multiple uid/gid mappings ([containerd#10722](containerd#10722)) * Fix race between serve and immediate shutdown on the server ([containerd/ttrpc#175](containerd/ttrpc#175)) * Update FreeBSD defaults and re-organize platform defaults ([containerd#11017](containerd#11017)) * Postpone cri config deprecations to v2.2 ([containerd#11684](containerd#11684)) * Remove deprecated dynamic library plugins ([containerd#11683](containerd#11683)) * Remove the support for Schema 1 images ([containerd#11681](containerd#11681)) Please try out the release binaries and report any issues at https://github.com/containerd/containerd/issues. * Derek McGowan * Phil Estes * Akihiro Suda * Maksym Pavlenko * Jin Dong * Wei Fu * Sebastiaan van Stijn * Samuel Karp * Mike Brown * Adrien Delorme * Austin Vazquez * Akhil Mohan * Kazuyoshi Kato * Henry Wang * Gao Xiang * ningmingxiao * Krisztian Litkey * Yang Yang * Archit Kulkarni * Chris Henzie * Iceber Gu * Alexey Lunev * Antonio Ojea * Davanum Srinivas * Marat Radchenko * Michael Zappa * Paweł Gronowski * Rodrigo Campos * Alberto Garcia Hierro * Amit Barve * Andrey Smirnov * Divya * Etienne Champetier * Kirtana Ashok * Philip Laine * QiPing Wan * fengwei0328 * zounengren * Adrian Reber * Alfred Wingate * Amal Thundiyil * Athos Ribeiro * Brian Goff * Cesar Talledo * ChengyuZhu6 * Chongyi Zheng * Craig Ingram * Danny Canter * David Son * Fupan Li * HirazawaUi * Jing Xu * Jonathan A. Sternberg * Jose Fernandez * Kaita Nakamura * Kohei Tokunaga * Lei Liu * Marco Visin * Mike Baynton * Qiyuan Liang * Sameer * Shiming Zhang * Swagat Bora * Teresaliu * Tony Fang * Tõnis Tiigi * Vered Rosen * Vinayak Goyal * bo.jiang * chriskery * luchenhan * mahmut * zhaixiaojuan * **github.com/Microsoft/hcsshim** v0.12.9 -> v0.13.0-rc.3 * **github.com/cilium/ebpf** v0.11.0 -> v0.16.0 * **github.com/containerd/cgroups/v3** v3.0.3 -> v3.0.5 * **github.com/containerd/containerd/api** v1.8.0 -> v1.9.0 * **github.com/containerd/continuity** v0.4.4 -> v0.4.5 * **github.com/containerd/go-cni** v1.1.10 -> v1.1.12 * **github.com/containerd/imgcrypt/v2** v2.0.0-rc.1 -> v2.0.1 * **github.com/containerd/otelttrpc** ea5083fda723 -> v0.1.0 * **github.com/containerd/platforms** v1.0.0-rc.0 -> v1.0.0-rc.1 * **github.com/containerd/ttrpc** v1.2.6 -> v1.2.7 * **github.com/containerd/typeurl/v2** v2.2.2 -> v2.2.3 * **github.com/containernetworking/cni** v1.2.3 -> v1.3.0 * **github.com/containernetworking/plugins** v1.5.1 -> v1.7.1 * **github.com/containers/ocicrypt** v1.2.0 -> v1.2.1 * **github.com/davecgh/go-spew** d8f796af33cc -> v1.1.1 * **github.com/fsnotify/fsnotify** v1.7.0 -> v1.9.0 * **github.com/go-jose/go-jose/v4** v4.0.4 -> v4.0.5 * **github.com/google/go-cmp** v0.6.0 -> v0.7.0 * **github.com/grpc-ecosystem/grpc-gateway/v2** v2.22.0 -> v2.26.1 * **github.com/klauspost/compress** v1.17.11 -> v1.18.0 * **github.com/mdlayher/socket** v0.4.1 -> v0.5.1 * **github.com/moby/spdystream** v0.4.0 -> v0.5.0 * **github.com/moby/sys/user** v0.3.0 -> v0.4.0 * **github.com/opencontainers/image-spec** v1.1.0 -> v1.1.1 * **github.com/opencontainers/runtime-spec** v1.2.0 -> v1.2.1 * **github.com/opencontainers/selinux** v1.11.1 -> v1.12.0 * **github.com/pelletier/go-toml/v2** v2.2.3 -> v2.2.4 * **github.com/petermattis/goid** 4fcff4a6cae7 **_new_** * **github.com/pmezard/go-difflib** 5d4384ee4fb2 -> v1.0.0 * **github.com/prometheus/client_golang** v1.20.5 -> v1.22.0 * **github.com/prometheus/common** v0.55.0 -> v0.62.0 * **github.com/sasha-s/go-deadlock** v0.3.5 **_new_** * **github.com/smallstep/pkcs7** v0.1.1 **_new_** * **github.com/stretchr/testify** v1.9.0 -> v1.10.0 * **github.com/tchap/go-patricia/v2** v2.3.1 -> v2.3.2 * **github.com/urfave/cli/v2** v2.27.5 -> v2.27.6 * **github.com/vishvananda/netlink** v1.3.0 -> 0e7078ed04c8 * **github.com/vishvananda/netns** v0.0.4 -> v0.0.5 * **go.etcd.io/bbolt** v1.3.11 -> v1.4.0 * **go.opentelemetry.io/auto/sdk** v1.1.0 **_new_** * **go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp** v0.56.0 -> v0.60.0 * **go.opentelemetry.io/otel** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/metric** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/sdk** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/otel/trace** v1.31.0 -> v1.35.0 * **go.opentelemetry.io/proto/otlp** v1.3.1 -> v1.5.0 * **golang.org/x/crypto** v0.28.0 -> v0.36.0 * **golang.org/x/exp** aacd6d4b4611 -> 2d47ceb2692f * **golang.org/x/mod** v0.21.0 -> v0.24.0 * **golang.org/x/net** v0.30.0 -> v0.38.0 * **golang.org/x/oauth2** v0.22.0 -> v0.27.0 * **golang.org/x/sync** v0.8.0 -> v0.14.0 * **golang.org/x/sys** v0.26.0 -> v0.33.0 * **golang.org/x/term** v0.25.0 -> v0.30.0 * **golang.org/x/text** v0.19.0 -> v0.23.0 * **golang.org/x/time** v0.3.0 -> v0.7.0 * **google.golang.org/genproto/googleapis/api** 5fefd90f89a9 -> 56aae31c358a * **google.golang.org/genproto/googleapis/rpc** 324edc3d5d38 -> 56aae31c358a * **google.golang.org/grpc** v1.67.1 -> v1.72.0 * **google.golang.org/protobuf** v1.35.1 -> v1.36.6 * **k8s.io/api** v0.31.2 -> v0.32.3 * **k8s.io/apimachinery** v0.31.2 -> v0.32.3 * **k8s.io/apiserver** v0.31.2 -> v0.32.3 * **k8s.io/client-go** v0.31.2 -> v0.32.3 * **k8s.io/cri-api** v0.31.2 -> v0.32.3 * **k8s.io/kubelet** v0.31.2 -> v0.32.3 * **k8s.io/utils** 18e509b52bc8 -> 3ea5e8cea738 * **sigs.k8s.io/json** bc3834ca7abd -> 9aa6b5e7a4b3 * **sigs.k8s.io/structured-merge-diff/v4** v4.4.1 -> v4.4.2 * **tags.cncf.io/container-device-interface** v0.8.0 -> v1.0.1 * **tags.cncf.io/container-device-interface/specs-go** v0.8.0 -> v1.0.0 Previous release can be found at [v2.0.0](https://github.com/containerd/containerd/releases/tag/v2.0.0) * `containerd-<VERSION>-<OS>-<ARCH>.tar.gz`: ✅Recommended. Dynamically linked with glibc 2.35 (Ubuntu 22.04). * `containerd-static-<VERSION>-<OS>-<ARCH>.tar.gz`: Statically linked. Expected to be used on Linux distributions that do not use glibc >= 2.35. Not position-independent. In addition to containerd, typically you will have to install [runc](https://github.com/opencontainers/runc/releases) and [CNI plugins](https://github.com/containernetworking/plugins/releases) from their official sites too. See also the [Getting Started](https://github.com/containerd/containerd/blob/main/docs/getting-started.md) documentation.
The metadata plugin already exposes a no_sync option (PR containerd#10745) that disables F_FULLFSYNC on its bolt DB. This extends the same pattern to the two remaining bolt databases: the mount manager (mounts.db) and the erofs snapshotter (metadata.db). On macOS, F_FULLFSYNC costs ~8ms per bolt write transaction. Each container create triggers ~14 write transactions across these DBs, capping throughput at ~15 ops/sec. Disabling sync drops per-txn latency from ~8ms to ~0.02ms — a ~300x improvement. Benchmark (macOS, Apple M-series, bbolt v1.4.3): package main import ( "fmt"; "os"; "path/filepath"; "time" bolt "go.etcd.io/bbolt" ) func bench(label string, noSync bool) { dir, _ := os.MkdirTemp("", "bolt-*") defer os.RemoveAll(dir) db, _ := bolt.Open(filepath.Join(dir, "t.db"), 0600, &bolt.Options{NoSync: noSync, NoGrowSync: noSync}) defer db.Close() db.Update(func(tx *bolt.Tx) error { _, e := tx.CreateBucket([]byte("b")); return e }) const N = 200 start := time.Now() for i := range N { db.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("b")).Put([]byte(fmt.Sprintf("k%d", i)), []byte("v")) }) } d := time.Since(start) fmt.Printf("%-10s %d writes in %v (%.0f ops/s, %.2f ms/op)\n", label, N, d.Round(time.Millisecond), float64(N)/d.Seconds(), float64(d.Milliseconds())/float64(N)) } func main() { bench("sync:", false); bench("no_sync:", true) } Results: sync: 200 writes in 1.625s (123 ops/sec, 8.12 ms/op) no_sync: 200 writes in 5ms (41970 ops/sec, 0.02 ms/op) Configuration: [plugins.'io.containerd.mount-manager.v1.bolt'] no_sync = true [plugins.'io.containerd.snapshotter.v1.erofs'] no_sync = true Signed-off-by: David Scott <[email protected]>
The metadata plugin already exposes a no_sync option (PR containerd#10745) that disables F_FULLFSYNC on its bolt DB. This extends the same pattern to the two remaining bolt databases: the mount manager (mounts.db) and the erofs snapshotter (metadata.db). On macOS, F_FULLFSYNC costs ~8ms per bolt transaction, particularly affecting nerdbox, see this benchmark (Apple M-series, bbolt v1.4.3): ``` mkdir /tmp/test cat > /tmp/test/main.go <<EOT package main import ( "fmt"; "os"; "path/filepath"; "time" bolt "go.etcd.io/bbolt" ) func bench(label string, noSync bool) { dir, _ := os.MkdirTemp("", "bolt-*") defer os.RemoveAll(dir) db, _ := bolt.Open(filepath.Join(dir, "t.db"), 0600, &bolt.Options{NoSync: noSync, NoGrowSync: noSync}) defer db.Close() db.Update(func(tx *bolt.Tx) error { _, e := tx.CreateBucket([]byte("b")); return e }) const N = 200 start := time.Now() for i := range N { db.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("b")).Put([]byte(fmt.Sprintf("k%d", i)), []byte("v")) }) } d := time.Since(start) fmt.Printf("%-10s %d writes in %v (%.0f ops/s, %.2f ms/op)\n", label, N, d.Round(time.Millisecond), float64(N)/d.Seconds(), float64(d.Milliseconds())/float64(N)) } func main() { bench("sync:", false); bench("no_sync:", true) } EOT cd /tmp/test go mod init go mod tidy go run main.go ``` Results: ``` sync: 200 writes in 1.605s (125 ops/s, 8.03 ms/op) no_sync: 200 writes in 3ms (74203 ops/s, 0.01 ms/op) ``` Configuration: ``` [plugins.'io.containerd.mount-manager.v1.bolt'] no_sync = true [plugins.'io.containerd.snapshotter.v1.erofs'] no_sync = true ``` Signed-off-by: David Scott <[email protected]>
The metadata plugin already exposes a no_sync option (PR containerd#10745) that disables F_FULLFSYNC on its bolt DB. This extends the same pattern to the two remaining bolt databases: the mount manager (mounts.db) and the erofs snapshotter (metadata.db). On macOS, F_FULLFSYNC costs ~8ms per bolt transaction, particularly affecting nerdbox, see this benchmark (Apple M-series, bbolt v1.4.3): ``` mkdir /tmp/test cat > /tmp/test/main.go <<EOT package main import ( "fmt"; "os"; "path/filepath"; "time" bolt "go.etcd.io/bbolt" ) func bench(label string, noSync bool) { dir, _ := os.MkdirTemp("", "bolt-*") defer os.RemoveAll(dir) db, _ := bolt.Open(filepath.Join(dir, "t.db"), 0600, &bolt.Options{NoSync: noSync, NoGrowSync: noSync}) defer db.Close() db.Update(func(tx *bolt.Tx) error { _, e := tx.CreateBucket([]byte("b")); return e }) const N = 200 start := time.Now() for i := range N { db.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("b")).Put([]byte(fmt.Sprintf("k%d", i)), []byte("v")) }) } d := time.Since(start) fmt.Printf("%-10s %d writes in %v (%.0f ops/s, %.2f ms/op)\n", label, N, d.Round(time.Millisecond), float64(N)/d.Seconds(), float64(d.Milliseconds())/float64(N)) } func main() { bench("sync:", false); bench("no_sync:", true) } EOT cd /tmp/test go mod init test go mod tidy go run main.go ``` Results: ``` sync: 200 writes in 1.605s (125 ops/s, 8.03 ms/op) no_sync: 200 writes in 3ms (74203 ops/s, 0.01 ms/op) ``` Configuration: ``` [plugins.'io.containerd.mount-manager.v1.bolt'] no_sync = true [plugins.'io.containerd.snapshotter.v1.erofs'] no_sync = true ``` Signed-off-by: David Scott <[email protected]>
The metadata plugin already exposes a no_sync option (PR containerd#10745) that disables F_FULLFSYNC on its bolt DB. This extends the same pattern to two more bolt databases: the mount manager (mounts.db) and the erofs snapshotter (metadata.db). On macOS, F_FULLFSYNC costs ~8ms per bolt transaction, particularly affecting nerdbox, see this benchmark (Apple M-series, bbolt v1.4.3): ``` mkdir /tmp/test cat > /tmp/test/main.go <<EOT package main import ( "fmt"; "os"; "path/filepath"; "time" bolt "go.etcd.io/bbolt" ) func bench(label string, noSync bool) { dir, _ := os.MkdirTemp("", "bolt-*") defer os.RemoveAll(dir) db, _ := bolt.Open(filepath.Join(dir, "t.db"), 0600, &bolt.Options{NoSync: noSync, NoGrowSync: noSync}) defer db.Close() db.Update(func(tx *bolt.Tx) error { _, e := tx.CreateBucket([]byte("b")); return e }) const N = 200 start := time.Now() for i := range N { db.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("b")).Put([]byte(fmt.Sprintf("k%d", i)), []byte("v")) }) } d := time.Since(start) fmt.Printf("%-10s %d writes in %v (%.0f ops/s, %.2f ms/op)\n", label, N, d.Round(time.Millisecond), float64(N)/d.Seconds(), float64(d.Milliseconds())/float64(N)) } func main() { bench("sync:", false); bench("no_sync:", true) } EOT cd /tmp/test go mod init test go mod tidy go run main.go ``` Results: ``` sync: 200 writes in 1.605s (125 ops/s, 8.03 ms/op) no_sync: 200 writes in 3ms (74203 ops/s, 0.01 ms/op) ``` Configuration: ``` [plugins.'io.containerd.mount-manager.v1.bolt'] no_sync = true [plugins.'io.containerd.snapshotter.v1.erofs'] no_sync = true ``` Signed-off-by: David Scott <[email protected]>
The metadata plugin already exposes a no_sync option (PR containerd#10745) that disables F_FULLFSYNC on its bolt DB. This extends the same pattern to two more bolt databases: the mount manager (mounts.db) and the erofs snapshotter (metadata.db). On macOS, F_FULLFSYNC costs ~8ms per bolt transaction, particularly affecting nerdbox, see this benchmark (Apple M-series, bbolt v1.4.3): ``` mkdir /tmp/test cat > /tmp/test/main.go <<EOT package main import ( "fmt"; "os"; "path/filepath"; "time" bolt "go.etcd.io/bbolt" ) func bench(label string, noSync bool) { dir, _ := os.MkdirTemp("", "bolt-*") defer os.RemoveAll(dir) db, _ := bolt.Open(filepath.Join(dir, "t.db"), 0600, &bolt.Options{NoSync: noSync, NoGrowSync: noSync}) defer db.Close() db.Update(func(tx *bolt.Tx) error { _, e := tx.CreateBucket([]byte("b")); return e }) const N = 200 start := time.Now() for i := range N { db.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("b")).Put([]byte(fmt.Sprintf("k%d", i)), []byte("v")) }) } d := time.Since(start) fmt.Printf("%-10s %d writes in %v (%.0f ops/s, %.2f ms/op)\n", label, N, d.Round(time.Millisecond), float64(N)/d.Seconds(), float64(d.Milliseconds())/float64(N)) } func main() { bench("sync:", false); bench("no_sync:", true) } EOT cd /tmp/test go mod init test go mod tidy go run main.go ``` Results: ``` sync: 200 writes in 1.605s (125 ops/s, 8.03 ms/op) no_sync: 200 writes in 3ms (74203 ops/s, 0.01 ms/op) ``` Configuration: ``` [plugins.'io.containerd.mount-manager.v1.bolt'] no_sync = true [plugins.'io.containerd.snapshotter.v1.erofs'] no_sync = true ``` Signed-off-by: David Scott <[email protected]>
This PR exposes BoltDB options via TOML configuration.
On certain ephemeral environments its possible to squeeze a significant performance gains via more precise boltDB configuration by going async.
In our case we were able to reduce pull time under certain (heavy) conditions from 30s to just a few seconds.
Additionally this PR introduces optional args in
MetaStoreto allow (external) snapshotters configure boltDB options for same reasons.Troubleshooting and research credits belong to @xinyangge-db