Skip to content

Apply peformance tuning to new sandboxes also#42897

Closed
evol262 wants to merge 133 commits intomoby:masterfrom
evol262:feature/ingress-namespace-performance
Closed

Apply peformance tuning to new sandboxes also#42897
evol262 wants to merge 133 commits intomoby:masterfrom
evol262:feature/ingress-namespace-performance

Conversation

@evol262
Copy link

@evol262 evol262 commented Sep 29, 2021

relates to #35082, moby/libnetwork#2491

Previously, values for expire_quiescent_template, conn_reuse_mode,
and expire_nodest_conn were set only system-wide. Also apply them
for new lb_* and ingress_sbox sandboxes, so they are appropriately
propagated

Signed-off-by: Ryan Barry [email protected]

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

Previously, values for expire_quiescent_template, conn_reuse_mode,
and expire_nodest_conn were set only system-wide. Also apply them
for new lb_* and ingress_sbox sandboxes, so they are appropriately
propagated

Signed-off-by: Ryan Barry <[email protected]>
Comment on lines +1177 to 1185
err := sb.osSbox.InvokeFunc(func() {
sb.osSbox.ApplyOSTweaks(sb.oslTypes)
})

if err != nil {
logrus.Errorf("Failed to apply performance tuning sysctls the sandbox: %v", err)
}
// Keep this just so performance is not changed
sb.osSbox.ApplyOSTweaks(sb.oslTypes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why sb.osSbox.ApplyOSTweaks(sb.oslTypes) is called twice (using different mechanisms); could explain?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less due to longstanding discussions around inheritance of namespaced vs non-namespaced sysctls: https://lore.kernel.org/lkml/20160221071102.9686.63148.stgit@buzz/

Values set in outside of the namespace do not always propagate, but setting it both in the ns (through InvokeFunc->nsInvoke->netns.Set) provides a "safety valve" against the ongoing changes in kernel namespacing.

kernel.ApplyOSTweaks(loadBalancerConfig)
}
case SandboxTypeIngress:
kernel.ApplyOSTweaks(ingressConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like both Ingress and Loadbalancer are applying the same tweaks. The loadBalancerConfig was already there, but looking at that variable, I see it's only used in this function, but defined as a package level variable. I'm not a huge fan of package level variables (if not needed), especially if they're defined way too far away from where they're used.

To prevent the risk that they diverge (assuming they should always get the same tweaks - at least currently), perhaps either inline the settings we apply. The CheckFn: nil also seems a bit redundant (I think that'd be the default), so I assume we can remove those as well.
Finally, it looks like the description for ApplyOSTweaks no longer matches what it does (it should now mention the ingress network as well?);

// ApplyOSTweaks applies linux configs on sandboxes and ingress networks.
func (n *networkNamespace) ApplyOSTweaks(types []SandboxType) {
	for _, t := range types {
		switch t {
		case SandboxTypeLoadBalancer, SandboxTypeIngress:
			kernel.ApplyOSTweaks(map[string]*kernel.OSValue{
				// disables any special handling on port reuse of existing IPVS connection table entries
				// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L25:1
				"net.ipv4.vs.conn_reuse_mode": {Value: "0"},
				// expires connection from the IPVS connection table when the backend is not available
				// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L126:1
				"net.ipv4.vs.expire_nodest_conn": {Value: "1"},
				// expires persistent connections to destination servers with weights set to 0
				// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L144:1
				"net.ipv4.vs.expire_quiescent_template": {Value: "1"},
			})
		}
	}
}

Or if there's advantages to defining the variable once (not sure if there is), it'd be better to move the variable closer to this function (but I'm slightly in favour of the approach above instead);

var tweaks = map[string]*kernel.OSValue{
	// disables any special handling on port reuse of existing IPVS connection table entries
	// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L25:1
	"net.ipv4.vs.conn_reuse_mode": {Value: "0"},
	// expires connection from the IPVS connection table when the backend is not available
	// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L126:1
	"net.ipv4.vs.expire_nodest_conn": {Value: "1"},
	// expires persistent connections to destination servers with weights set to 0
	// more info: https://github.com/torvalds/linux/blob/master/Documentation/networking/ipvs-sysctl.txt#L144:1
	"net.ipv4.vs.expire_quiescent_template": {Value: "1"},
}

// ApplyOSTweaks applies linux configs on sandboxes and ingress networks.
func (n *networkNamespace) ApplyOSTweaks(types []SandboxType) {
	for _, t := range types {
		switch t {
		case SandboxTypeLoadBalancer, SandboxTypeIngress:
			kernel.ApplyOSTweaks(tweaks)
		}
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if these are settings that can be controlled through a com.docker.network option on docker network create (https://docs.docker.com/engine/reference/commandline/network_create/#bridge-driver-options)?

Double checking if this is something that a user could have customised when creating a custom ingress network (and which we would then override with fixed defaults)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can't be set through com.docker.network, unfortunately, but I'm in total agreement about package-level variables which only get moved once. I'll inline them.

@thaJeztah
Copy link
Member

ping @evol262 - could you have a look at my review comments?

@evol262
Copy link
Author

evol262 commented Nov 11, 2021

Whew, github didn't send me any notifications for this. Most of this look like a new neovim setup behaving badly. I'll get a cleaned up version posted shortly

@evol262 evol262 force-pushed the feature/ingress-namespace-performance branch from 2697771 to a9349b2 Compare November 11, 2021 03:49
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evol262 (sorry, thought I commented already, but seems I forgot); looks like this doesn't compile currently; can you check?

@thaJeztah thaJeztah added this to the 21.xx milestone Nov 25, 2021
@evol262 evol262 force-pushed the feature/ingress-namespace-performance branch 2 times, most recently from 3d06dd8 to 943aac0 Compare January 12, 2022 03:50
Ryan Barry and others added 15 commits January 11, 2022 23:01
This includes additional fixes for CVE-2021-39293.

go1.17.1 (released 2021-09-09) includes a security fix to the archive/zip package,
as well as bug fixes to the compiler, linker, the go command, and to the crypto/rand,
embed, go/types, html/template, and net/http packages. See the Go 1.17.1 milestone
on the issue tracker for details:

https://github.com/golang/go/issues?q=milestone%3AGo1.17.1+label%3ACherryPickApproved

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Looks like we don't need sprintf for how it's used. Replacing sprintf makes it
more performant (~2.4x as fast), and less memory, allocations:

    BenchmarkGetRandomName-8      	 8203230	       142.4 ns/op	      37 B/op	       2 allocs/op
    BenchmarkGetRandomNameOld-8   	 3499509	       342.9 ns/op	      85 B/op	       5 allocs/op

Signed-off-by: Sebastiaan van Stijn <[email protected]>
zstd is a compression algorithm that has a very fast decoder, while
providing also good compression ratios.  The fast decoder makes it
suitable for container images, as decompressing the tarballs is a very
expensive operation.

opencontainers/image-spec#788 added support
for zstd to the OCI image specs.

Signed-off-by: Giuseppe Scrivano <[email protected]>
Because FreeBSD uses 64-bit device nodes (see
https://reviews.freebsd.org/rS318736), Linux implementation of
`system.Mknod` & `system.Mkdev` is not sufficient.

This change adds freebsd-specific implementations for `Mknod` and
Mkdev`.

Signed-off-by: Artem Khramov <[email protected]>
This adds support for 2 runtimes on Windows, one that uses the built-in
HCSv1 integration and another which uses containerd with the runhcs
shim.

Signed-off-by: Brian Goff <[email protected]>
Commit dae652e added support for non-privileged
containers to use ICMP_PROTO (used for `ping`). This option cannot be set for
containers that have user-namespaces enabled.

However, the detection looks to be incorrect; HostConfig.UsernsMode was added
in 6993e89 / ee21838,
and the property only has meaning if the daemon is running with user namespaces
enabled. In other situations, the property has no meaning.
As a result of the above, the sysctl would only be set for containers running
with UsernsMode=host on a daemon running with user-namespaces enabled.

This patch adds a check if the daemon has user-namespaces enabled (RemappedRoot
having a non-empty value), or if the daemon is running inside a user namespace
(e.g. rootless mode) to fix the detection.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Docker 17.07 and up allow the CLI to be configured to set default proxy
env-vars to be used (both as build-arg and as env for docker run), see
docker/cli#93, so setting these here should be redundant. If someone
needs these env-vars set, they should be configured in the cli's
`~/.docker/config.json` instead.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
This moves installers that are only used during CI into the Dockerfile. Some
installers are still used in the release-pipeline, so keeping thos for now.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
- using v2.1.0 for the "v1" registry (last release with only v1)
- using v2.3.0 as "current" version (was v2.3.0-rc.0)

Signed-off-by: Sebastiaan van Stijn <[email protected]>
This error was no longer in use after the v1 push code was removed
in 53dad9f.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
This is a horrible thing to do, but CRIU installed here is only used as
part of our CI / integration tests. We should of course remove this
hack ASAP once the opensuse packagers have set up a new key, but at
least this allows us to unblock CI, which is currently completely
broken:

    ADD --chmod=0644 https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10/Release.key /etc/apt/trusted.gpg.d/criu.gpg.asc
    RUN --mount=type=cache,sharing=locked,id=moby-criu-aptlib,target=/var/lib/apt \
        --mount=type=cache,sharing=locked,id=moby-criu-aptcache,target=/var/cache/apt \
             echo 'deb https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10/ /' > /etc/apt/sources.list.d/criu.list \
             && apt-get update \
             && apt-get install -y --no-install-recommends criu \
             && install -D /usr/sbin/criu /build/criu

    Hit:1 http://cdn-fastly.deb.debian.org/debian bullseye InRelease
    Hit:2 http://cdn-fastly.deb.debian.org/debian-security bullseye-security InRelease
    Hit:3 http://cdn-fastly.deb.debian.org/debian bullseye-updates InRelease
    Get:4 https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10  InRelease [1540 B]
    Err:4 https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10  InRelease
      The following signatures were invalid: EXPKEYSIG 30A8343A498D5A23 devel:tools OBS Project <devel:[email protected]>
    Reading package lists...
    W: GPG error: https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10  InRelease: The following signatures were invalid: EXPKEYSIG 30A8343A498D5A23 devel:tools OBS Project <devel:[email protected]>
    E: The repository 'https://download.opensuse.org/repositories/devel:/tools:/criu/Debian_10  InRelease' is not signed.

And, checking the signing key (with `apt-key list`):

    /etc/apt/trusted.gpg.d/criu.gpg.asc
    -----------------------------------
    pub   rsa2048 2015-05-03 [SC] [expired: 2021-10-13]
          428E 4E34 8405 CE79 00DB  99C2 30A8 343A 498D 5A23
    uid           [ expired] devel:tools OBS Project <devel:[email protected]>

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Do not use 0701 perms.
0701 dir perms allows anyone to traverse the docker dir.
It happens to allow any user to execute, as an example, suid binaries
from image rootfs dirs because it allows traversal AND critically
container users need to be able to do execute things.

0701 on lower directories also happens to allow any user to modify
     things in, for instance, the overlay upper dir which neccessarily
     has 0755 permissions.

This changes to use 0710 which allows users in the group to traverse.
In userns mode the UID owner is (real) root and the GID is the remapped
root's GID.

This prevents anyone but the remapped root to traverse our directories
(which is required for userns with runc).

Signed-off-by: Brian Goff <[email protected]>
(cherry picked from commit ef7237442147441a7cadcda0600be1186d81ac73)
Signed-off-by: Brian Goff <[email protected]>
(cherry picked from commit 93ac040)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
conorevans and others added 23 commits January 11, 2022 23:01
 * When async is enabled, this option defines the interval (ms) at which the connection
to the fluentd-address is re-established. This option is useful if the address
may resolve to one or more IP addresses, e.g. a Consul service address.

While the change in moby#42979 resolves the issue where a Docker container can be stuck
if the fluentd-address is unavailable, this functionality adds an additional benefit
in that a new and healthy fluentd-address can be resolved, allowing logs to flow once again.

This adds a `fluentd-async-reconnect-interval` log-opt for the fluentd logging driver.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Conor Evans <[email protected]>

Co-authored-by: Sebastiaan van Stijn <[email protected]>
Co-authored-by: Conor Evans <[email protected]>
Trying to reduce the use of libcontainer/devices, as it's considered
to be an "internal" package by runc.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
This system call is only available on the 32- and 64-bit PowerPC, it is
used by modern programming language implementations (such as gcc-go) to
implement coroutine features through userspace context switches.

Other container environment, such as Systemd nspawn already whitelist
this system call in their seccomp profile [1] [2]. As such, it would be
nice to also whitelist it in moby.

This issue was encountered on Alpine Linux GitLab CI system, which uses
moby, when attempting to execute gcc-go compiled software on ppc64le.

[1]: systemd/systemd#9487
[2]: systemd/systemd#9485

Signed-off-by: Sören Tempel <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Use unique names to prevent tests from interfering, using a shorter
name, as there's a maximum length for these.

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Anca Iordache <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Based on randomLocalStore() in libnetwork/ipam/allocator_test.go

Signed-off-by: Sebastiaan van Stijn <[email protected]>
This takes the changes from 1a933e1 and
834272f, and applies them to older API
versions in the docs directory (which are used for the actual documentation).

Signed-off-by: Sebastiaan van Stijn <[email protected]>
followLogs() is getting really long (170+ lines) and complex.
The function has multiple inner functions that mutate its variables.

To refactor the function, this change introduces follow{} struct.
The inner functions are now defined as ordinal methods, which are
accessible from tests.

Signed-off-by: Kazuyoshi Kato <[email protected]>
@evol262 evol262 force-pushed the feature/ingress-namespace-performance branch from 943aac0 to a07867d Compare January 12, 2022 04:03
@evol262 evol262 closed this Jan 12, 2022
@evol262 evol262 deleted the feature/ingress-namespace-performance branch January 12, 2022 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.