Kubernetes Tasks
Kubernetes Tasks
tasks. A task page shows how to do a single thing, typically by giving a short sequence of steps.
If you would like to write a task page, see Creating a Documentation Pull Request.
Install Tools
Administer a Cluster
Declarative and imperative paradigms for interacting with the Kubernetes API.
Managing Secrets
Specify configuration and other data for the Pods that run your workload.
Run Applications
Run Jobs
Configure load balancing, port forwarding, or setup firewall or DNS configurations to access
applications in a cluster.
Extend Kubernetes
Understand advanced ways to adapt your Kubernetes cluster to the needs of your work
environment.
TLS
Understand how to protect traffic within your cluster using Transport Layer Security (TLS).
Perform common tasks for managing a DaemonSet, such as performing a rolling update.
Networking
Manage HugePages
Schedule GPUs
Install Tools
Set up Kubernetes tools on your computer.
kubectl
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes
clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and
view logs. For more information including a complete list of kubectl operations, see the kubectl
reference documentation.
kubectl is installable on a variety of Linux platforms, macOS and Windows. Find your preferred
operating system below.
The kind Quick Start page shows you what you need to do to get up and running with kind.
minikube
Like kind, minikube is a tool that lets you run Kubernetes locally. minikube runs an all-in-one
or a multi-node local Kubernetes cluster on your personal computer (including Windows,
macOS and Linux PCs) so that you can try out Kubernetes, or for daily development work.
You can follow the official Get Started! guide if your focus is on getting the tool installed.
Once you have minikube working, you can use it to run a sample application.
kubeadm
You can use the kubeadm tool to create and manage Kubernetes clusters. It performs the actions
necessary to get a minimum viable, secure cluster up and running in a user friendly way.
Installing kubeadm shows you how to install kubeadm. Once installed, you can use it to create a
cluster.
◦ x86-64
◦ ARM64
Note:
◦ x86-64
◦ ARM64
kubectl: OK
If the check fails, sha256 exits with nonzero status and prints output similar to:
kubectl: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
3. Install kubectl
Note:
If you do not have root access on the target system, you can still install kubectl to the
~/.local/bin directory:
chmod +x kubectl
mkdir -p ~/.local/bin
mv ./kubectl ~/.local/bin/kubectl
# and then append (or prepend) ~/.local/bin to $PATH
• Debian-based distributions
• Red Hat-based distributions
• SUSE-based distributions
1. Update the apt package index and install packages needed to use the Kubernetes apt
repository:
2. Download the public signing key for the Kubernetes package repositories. The same
signing key is used for all repositories so you can disregard the version in the URL:
3. Add the appropriate Kubernetes apt repository. If you want to use Kubernetes version
different than v1.28, replace v1.28 with the desired minor version in the command below:
Note: In releases older than Debian 12 and Ubuntu 22.04, /etc/apt/keyrings does not exist by
default, and can be created using sudo mkdir -m 755 /etc/apt/keyrings
1. Add the Kubernetes yum repository. If you want to use Kubernetes version different than
v1.28, replace v1.28 with the desired minor version in the command below.
Note: To upgrade kubectl to another minor release, you'll need to bump the version in /etc/
yum.repos.d/kubernetes.repo before running yum update. This procedure is described in more
detail in Changing The Kubernetes Package Repository.
1. Add the Kubernetes zypper repository. If you want to use Kubernetes version different
than v1.28, replace v1.28 with the desired minor version in the command below.
Note: To upgrade kubectl to another minor release, you'll need to bump the version in /etc/
zypp/repos.d/kubernetes.repo before running zypper update. This procedure is described in
more detail in Changing The Kubernetes Package Repository.
• Snap
• Homebrew
If you are on Ubuntu or another Linux distribution that supports the snap package manager,
kubectl is available as a snap application.
If you are on Linux and using Homebrew package manager, kubectl is available for installation.
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or
port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save
you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
• Bash
• Fish
• Zsh
Introduction
The kubectl completion script for Bash can be generated with the command kubectl completion
bash. Sourcing the completion script in your shell enables kubectl autocompletion.
However, the completion script depends on bash-completion, which means that you have to
install this software first (you can test if you have bash-completion already installed by running
type _init_completion).
Install bash-completion
bash-completion is provided by many package managers (see here). You can install it with apt-
get install bash-completion or yum install bash-completion, etc.
To find out, reload your shell and run type _init_completion. If the command succeeds, you're
already set, otherwise add the following to your ~/.bashrc file:
source /usr/share/bash-completion/bash_completion
Reload your shell and verify that bash-completion is correctly installed by typing type
_init_completion.
Bash
You now need to ensure that the kubectl completion script gets sourced in all your shell
sessions. There are two ways in which you can do this:
• User
• System
If you have an alias for kubectl, you can extend shell completion to work with that alias:
source ~/.bashrc
The kubectl completion script for Fish can be generated with the command kubectl completion
fish. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:
The kubectl completion script for Zsh can be generated with the command kubectl completion
zsh. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc file:
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
If you get an error like 2: command not found: compdef, then add the following to the
beginning of your ~/.zshrc file:
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests
between different API versions. This can be particularly helpful to migrate manifests to a non-
deprecated api version with newer Kubernetes release. For more info, visit migrate to non
deprecated apis
◦ x86-64
◦ ARM64
◦ x86-64
◦ ARM64
kubectl-convert: OK
If the check fails, sha256 exits with nonzero status and prints output similar to:
kubectl-convert: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
3. Install kubectl-convert
rm kubectl-convert kubectl-convert.sha256
What's next
• Install Minikube
• See the getting started guides for more about creating clusters.
• Learn how to launch and expose your application.
• If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
• Read the kubectl reference docs
Install and Set Up kubectl on macOS
Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For
example, a v1.28 client can communicate with v1.27, v1.28, and v1.29 control planes. Using the
latest compatible version of kubectl helps avoid unforeseen issues.
◦ Intel
◦ Apple Silicon
Note:
◦ Intel
◦ Apple Silicon
kubectl: OK
If the check fails, shasum exits with nonzero status and prints output similar to:
kubectl: FAILED
shasum: WARNING: 1 computed checksum did NOT match
chmod +x ./kubectl
rm kubectl kubectl.sha256
Install with Homebrew on macOS
If you are on macOS and using Homebrew package manager, you can install kubectl with
Homebrew.
or
If you are on macOS and using Macports package manager, you can install kubectl with
Macports.
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or
port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:
kubectl cluster-info dump
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell which can save
you a lot of typing.
Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.
• Bash
• Fish
• Zsh
Introduction
The kubectl completion script for Bash can be generated with kubectl completion bash.
Sourcing this script in your shell enables kubectl completion.
However, the kubectl completion script depends on bash-completion which you thus have to
previously install.
Warning: There are two versions of bash-completion, v1 and v2. V1 is for Bash 3.2 (which is
the default on macOS), and v2 is for Bash 4.1+. The kubectl completion script doesn't work
correctly with bash-completion v1 and Bash 3.2. It requires bash-completion v2 and Bash
4.1+. Thus, to be able to correctly use kubectl completion on macOS, you have to install and use
Bash 4.1+ (instructions). The following instructions assume that you use Bash 4.1+ (that is, any
Bash version of 4.1 or newer).
Upgrade Bash
The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:
echo $BASH_VERSION
Reload your shell and verify that the desired version is being used:
Install bash-completion
Note: As mentioned, these instructions assume you use Bash 4.1+, which means you will install
bash-completion v2 (in contrast to Bash 3.2 and bash-completion v1, in which case kubectl
completion won't work).
You can test if you have bash-completion v2 already installed with type _init_completion. If
not, you can install it with Homebrew:
As stated in the output of this command, add the following to your ~/.bash_profile file:
Reload your shell and verify that bash-completion v2 is correctly installed with type
_init_completion.
You now have to ensure that the kubectl completion script gets sourced in all your shell
sessions. There are multiple ways to achieve this:
• If you have an alias for kubectl, you can extend shell completion to work with that alias:
• If you installed kubectl with Homebrew (as explained here), then the kubectl completion
script should already be in /usr/local/etc/bash_completion.d/kubectl. In that case, you
don't need to do anything.
Note: The Homebrew installation of bash-completion v2 sources all the files in the
BASH_COMPLETION_COMPAT_DIR directory, that's why the latter two methods work.
In any case, after reloading your shell, kubectl completion should be working.
The kubectl completion script for Fish can be generated with the command kubectl completion
fish. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:
The kubectl completion script for Zsh can be generated with the command kubectl completion
zsh. Sourcing the completion script in your shell enables kubectl autocompletion.
To do so in all your shell sessions, add the following to your ~/.zshrc file:
source <(kubectl completion zsh)
If you have an alias for kubectl, kubectl autocompletion will automatically work with it.
If you get an error like 2: command not found: compdef, then add the following to the
beginning of your ~/.zshrc file:
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests
between different API versions. This can be particularly helpful to migrate manifests to a non-
deprecated api version with newer Kubernetes release. For more info, visit migrate to non
deprecated apis
◦ Intel
◦ Apple Silicon
◦ Intel
◦ Apple Silicon
kubectl-convert: OK
If the check fails, shasum exits with nonzero status and prints output similar to:
kubectl-convert: FAILED
shasum: WARNING: 1 computed checksum did NOT match
chmod +x ./kubectl-convert
rm kubectl-convert kubectl-convert.sha256
Depending on how you installed kubectl, use one of the following methods.
which kubectl
sudo rm <path>
Replace <path> with the path to the kubectl binary from the previous step. For example,
sudo rm /usr/local/bin/kubectl.
Note: To find out the latest stable version (for example, for scripting), take a look at
https://dl.k8s.io/release/stable.txt.
◦ Using PowerShell to automate the verification using the -eq operator to get a True
or False result:
$(Get-FileHash -Algorithm SHA256 .\kubectl.exe).Hash -eq $(Get-Content .
\kubectl.exe.sha256)
3. Append or prepend the kubectl binary folder to your PATH environment variable.
Note: Docker Desktop for Windows adds its own version of kubectl to PATH. If you have
installed Docker Desktop before, you may need to place your PATH entry before the one added
by the Docker Desktop installer or remove the Docker Desktop's kubectl.
1. To install kubectl on Windows you can use either Chocolatey package manager, Scoop
command-line installer, or winget package manager.
◦ choco
◦ scoop
◦ winget
mkdir .kube
cd .kube
Note: Edit the config file with a text editor of your choice, such as Notepad.
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a kubeconfig file, which is
created automatically when you create a cluster using kube-up.sh or successfully deploy a
Minikube cluster. By default, kubectl configuration is located at ~/.kube/config.
kubectl cluster-info
If you see a URL response, kubectl is correctly configured to access your cluster.
If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.
The connection to the server <server-name:port> was refused - did you specify the right host or
port?
For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.
If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:
kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save
you a lot of typing.
The kubectl completion script for PowerShell can be generated with the command kubectl
completion powershell.
To do so in all your shell sessions, add the following line to your $PROFILE file:
This command will regenerate the auto-completion script on every PowerShell start up. You can
also add the generated script directly to your $PROFILE file.
To add the generated script to your $PROFILE file, run the following line in your powershell
prompt:
A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests
between different API versions. This can be particularly helpful to migrate manifests to a non-
deprecated api version with newer Kubernetes release. For more info, visit migrate to non
deprecated apis
◦ Using PowerShell to automate the verification using the -eq operator to get a True
or False result:
$($(CertUtil -hashfile .\kubectl-convert.exe SHA256)[1] -replace " ", "") -eq $(type .\k
ubectl-convert.exe.sha256)
What's next
• Install Minikube
• See the getting started guides for more about creating clusters.
• Learn how to launch and expose your application.
• If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
• Read the kubectl reference docs
Administer a Cluster
Learn common tasks for administering a cluster.
Namespaces Walkthrough
Securing a Cluster
Upgrade A Cluster
Client certificates generated by kubeadm expire after 1 year. This page explains how to manage
certificate renewals with kubeadm. It also covers other tasks related to kubeadm certificate
management.
To do so, you must place them in whatever directory is specified by the --cert-dir flag or the
certificatesDir field of kubeadm's ClusterConfiguration. By default this is /etc/kubernetes/pki.
If a given certificate and private key pair exists before running kubeadm init, kubeadm does not
overwrite them. This means you can, for example, copy an existing CA into /etc/kubernetes/
pki/ca.crt and /etc/kubernetes/pki/ca.key, and kubeadm will use this CA for signing the rest of
the certificates.
External CA mode
It is also possible to provide only the ca.crt file and not the ca.key file (this is only available for
the root CA file, not other cert pairs). If all other certificates and kubeconfig files are in place,
kubeadm recognizes this condition and activates the "External CA" mode. kubeadm will proceed
without the CA key on disk.
Instead, run the controller-manager standalone with --controllers=csrsigner and point to the
CA certificate and key.
PKI certificates and requirements includes guidance on setting up a cluster to use an external
CA.
The command shows expiration/residual time for the client certificates in the /etc/kubernetes/
pki folder and for the client certificate embedded in the kubeconfig files used by kubeadm
(admin.conf, controller-manager.conf and scheduler.conf).
Additionally, kubeadm informs the user if the certificate is externally managed; in this case, the
user should take care of managing certificate renewal manually/using other tools.
On nodes created with kubeadm init, prior to kubeadm version 1.17, there is a bug where you
manually have to modify the contents of kubelet.conf. After kubeadm init finishes, you should
update kubelet.conf to point to the rotated kubelet client certificates, by replacing client-
certificate-data and client-key-data with:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
This feature is designed for addressing the simplest use cases; if you don't have specific
requirements on certificate renewal and perform Kubernetes version upgrades regularly (less
than 1 year in between each upgrade), kubeadm will take care of keeping your cluster up to
date and reasonably secure.
Note: It is a best practice to upgrade your cluster frequently in order to stay secure.
If you have more complex requirements for certificate renewal, you can opt out from the
default behavior by passing --certificate-renewal=false to kubeadm upgrade apply or to
kubeadm upgrade node.
Warning: Prior to kubeadm version 1.17 there is a bug where the default value for --certificate-
renewal is false for the kubeadm upgrade node command. In that case, you should explicitly set
--certificate-renewal=true.
This command performs the renewal using CA (or front-proxy-CA) certificate and key stored in
/etc/kubernetes/pki.
After running the command you should restart the control plane Pods. This is required since
dynamic certificate reload is currently not supported for all components and certificates. Static
Pods are managed by the local kubelet and not by the API Server, thus kubectl cannot be used
to delete and restart them. To restart a static Pod you can temporarily remove its manifest file
from /etc/kubernetes/manifests/ and wait for 20 seconds (see the fileCheckFrequency value in
KubeletConfiguration struct. The kubelet will terminate the Pod if it's no longer in the manifest
directory. You can then move the file back and after another fileCheckFrequency period, the
kubelet will recreate the Pod and the certificate renewal for the component can complete.
Warning: If you are running an HA cluster, this command needs to be executed on all the
control-plane nodes.
Note: certs renew uses the existing certificates as the authoritative source for attributes
(Common Name, Organization, SAN, etc.) instead of the kubeadm-config ConfigMap. It is
strongly recommended to keep them both in sync.
kubeadm certs renew can renew any specific certificate or, with the subcommand all, it can
renew all of them, as shown below:
Note:
Clusters built with kubeadm often copy the admin.conf certificate into $HOME/.kube/config, as
instructed in Creating a cluster with kubeadm. On such a system, to update the contents of
$HOME/.kube/config after renewing the admin.conf, you must run the following commands:
Caution: These are advanced topics for users who need to integrate their organization's
certificate infrastructure into a kubeadm-built cluster. If the default kubeadm configuration
satisfies your needs, you should let kubeadm manage certificates instead.
Set up a signer
The Kubernetes Certificate Authority does not work out of the box. You can configure an
external signer such as cert-manager, or you can use the built-in signer.
To activate the built-in signer, you must pass the --cluster-signing-cert-file and --cluster-
signing-key-file flags.
If you're creating a new cluster, you can use a kubeadm configuration file:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controllerManager:
extraArgs:
cluster-signing-cert-file: /etc/kubernetes/pki/ca.crt
cluster-signing-key-file: /etc/kubernetes/pki/ca.key
See Create CertificateSigningRequest for creating CSRs with the Kubernetes API.
To better integrate with external CAs, kubeadm can also produce certificate signing requests
(CSRs). A CSR represents a request to a CA for a signed certificate for a client. In kubeadm
terms, any certificate that would normally be signed by an on-disk CA can be produced as a
CSR instead. A CA, however, cannot be produced as a CSR.
You can create certificate signing requests with kubeadm certs renew --csr-only.
Both the CSR and the accompanying private key are given in the output. You can pass in a
directory with --csr-dir to output the CSRs to the specified location. If --csr-dir is not specified,
the default certificate directory (/etc/kubernetes/pki) is used.
Certificates can be renewed with kubeadm certs renew --csr-only. As with kubeadm init, an
output directory can be specified with the --csr-dir flag.
A CSR contains a certificate's name, domains, and IPs, but it does not specify usages. It is the
responsibility of the CA to specify the correct cert usages when issuing a certificate.
After a certificate is signed using your preferred method, the certificate and the private key
must be copied to the PKI directory (by default /etc/kubernetes/pki).
Certificate authority (CA) rotation
Kubeadm does not support rotation or replacement of CA certificates out of the box.
For more information about manual rotation or replacement of CA, see manual rotation of CA
certificates.
To configure the kubelets in a new kubeadm cluster to obtain properly signed serving
certificates you must pass the following minimal configuration to kubeadm init:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true
If you have already created the cluster you must adapt it by doing the following:
• Find and edit the kubelet-config-1.28 ConfigMap in the kube-system namespace. In that
ConfigMap, the kubelet key has a KubeletConfiguration document as its value. Edit the
KubeletConfiguration document to set serverTLSBootstrap: true.
• On each node, add the serverTLSBootstrap: true field in /var/lib/kubelet/config.yaml and
restart the kubelet with systemctl restart kubelet
The field serverTLSBootstrap: true will enable the bootstrap of kubelet serving certificates by
requesting them from the certificates.k8s.io API. One known limitation is that the CSRs
(Certificate Signing Requests) for these certificates cannot be automatically approved by the
default signer in the kube-controller-manager - kubernetes.io/kubelet-serving. This will require
action from the user or a third party controller.
By default, these serving certificate will expire after one year. Kubeadm sets the
KubeletConfiguration field rotateCertificates to true, which means that close to expiration a
new set of CSRs for the serving certificates will be created and must be approved to complete
the rotation. To understand more see Certificate Rotation.
If you are looking for a solution for automatic approval of these CSRs it is recommended that
you contact your cloud provider and ask if they have a CSR signer that verifies the node
identity with an out of band mechanism.
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.
• kubelet-csr-approver
Such a controller is not a secure mechanism unless it not only verifies the CommonName in the
CSR but also verifies the requested IPs and domain names. This would prevent a malicious actor
that has access to a kubelet client certificate to create CSRs requesting serving certificates for
any IP or domain name.
Instead, you can use the kubeadm kubeconfig user command to generate kubeconfig files for
additional users. The command accepts a mixture of command line flags and kubeadm
configuration options. The generated kubeconfig will be written to stdout and can be piped to a
file using kubeadm kubeconfig user ... > somefile.conf.
# example.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
# Will be used as the target "cluster" in the kubeconfig
clusterName: "kubernetes"
# Will be used as the "server" (IP or DNS name) of this cluster in the kubeconfig
controlPlaneEndpoint: "some-dns-address:6443"
# The cluster CA key and certificate will be loaded from this local directory
certificatesDir: "/etc/kubernetes/pki"
Make sure that these settings match the desired target cluster settings. To see the settings of an
existing cluster use:
The following example will generate a kubeconfig file with credentials valid for 24 hours for a
new user johndoe that is part of the appdevs group:
The page also provides details on how to set up a number of different container runtimes with
the systemd driver by default.
Note:
In v1.22 and later, if the user does not set the cgroupDriver field under KubeletConfiguration,
kubeadm defaults it to systemd.
In Kubernetes v1.28, you can enable automatic detection of the cgroup driver as an alpha
feature. See systemd cgroup driver for more details.
# kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.21.0
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
Kubeadm uses the same KubeletConfiguration for all nodes in the cluster. The
KubeletConfiguration is stored in a ConfigMap object under the kube-system namespace.
Executing the sub commands init, join and upgrade would result in kubeadm writing the
KubeletConfiguration as a file under /var/lib/kubelet/config.yaml and passing it to the local
node kubelet.
See the below section on "Modify the kubelet ConfigMap" for details on how to be explicit
about the value.
If you wish to configure a container runtime to use the cgroupfs driver, you must refer to the
documentation of the container runtime of your choice.
Note: Alternatively, it is possible to replace the old nodes in the cluster with new ones that use
the systemd driver. This requires executing only the first step below before joining the new
nodes and ensuring the workloads can safely move to the new nodes before deleting the old
nodes.
• Either modify the existing cgroupDriver value or add a new field that looks like this:
cgroupDriver: systemd
This field must be present under the kubelet: section of the ConfigMap.
Execute these steps on nodes one at a time to ensure workloads have sufficient time to schedule
on different nodes.
Once the process is complete ensure that all nodes and workloads are healthy.
To modify the components configuration you must manually edit associated cluster objects and
files on disk.
This guide shows the correct sequence of steps that need to be performed to achieve kubeadm
cluster reconfiguration.
The kubectl edit command will open a text editor where you can edit and save the object
directly.
You can use the environment variables KUBECONFIG and KUBE_EDITOR to specify the
location of the kubectl consumed kubeconfig file and preferred text editor.
For example:
Note: Upon saving any changes to these cluster objects, components running on nodes may not
be automatically updated. The steps below instruct you on how to perform that manually.
Warning: Component configuration in ConfigMaps is stored as unstructured data (YAML
string). This means that validation will not be performed upon updating the contents of a
ConfigMap. You have to be careful to follow the documented API format for a particular
component configuration and avoid introducing typos and YAML indentation mistakes.
Applying cluster configuration changes
During cluster creation and upgrade, kubeadm writes its ClusterConfiguration in a ConfigMap
called kubeadm-config in the kube-system namespace.
To change a particular option in the ClusterConfiguration you can edit the ConfigMap with this
command:
Note: The ClusterConfiguration includes a variety of options that affect the configuration of
individual components such as kube-apiserver, kube-scheduler, kube-controller-manager,
CoreDNS, etcd and kube-proxy. Changes to the configuration must be reflected on node
components manually.
kubeadm manages the control plane components as static Pod manifests located in the
directory /etc/kubernetes/manifests. Any changes to the ClusterConfiguration under the
apiServer, controllerManager, scheduler or etcd keys must be reflected in the associated files in
the manifests directory on a control plane node.
Before proceeding with these changes, make sure you have backed up the directory /etc/
kubernetes/.
The <config-file> contents must match the updated ClusterConfiguration. The <component-
name> value must be the name of the component.
Note: Updating a file in /etc/kubernetes/manifests will tell the kubelet to restart the static Pod
for the corresponding component. Try doing these changes one node at a time to leave the
cluster without downtime.
Applying kubelet configuration changes
During cluster creation and upgrade, kubeadm writes its KubeletConfiguration in a ConfigMap
called kubelet-config in the kube-system namespace.
Note: Do these changes one node at a time to allow workloads to be rescheduled properly.
Note: During kubeadm upgrade, kubeadm downloads the KubeletConfiguration from the
kubelet-config ConfigMap and overwrite the contents of /var/lib/kubelet/config.yaml. This
means that node local configuration must be applied either by flags in /var/lib/kubelet/
kubeadm-flags.env or by manually updating the contents of /var/lib/kubelet/config.yaml after
kubeadm upgrade, and then restarting the kubelet.
To change a particular option in the KubeProxyConfiguration, you can edit the ConfigMap with
this command:
Once the kube-proxy ConfigMap is updated, you can restart all kube-proxy Pods:
kubeadm deploys CoreDNS as a Deployment called coredns and with a Service kube-dns, both
in the kube-system namespace.
To update any of the CoreDNS settings, you can edit the Deployment and Service objects:
Once the CoreDNS changes are applied you can delete the CoreDNS Pods:
Note: kubeadm does not allow CoreDNS configuration during cluster creation and upgrade.
This means that if you execute kubeadm upgrade apply, your changes to the CoreDNS objects
will be lost and must be reapplied.
kubeadm writes Labels, Taints, CRI socket and other information on the Node object for a
particular Kubernetes node. To change any of the contents of this Node object you can use:
The main source of control plane configuration is the ClusterConfiguration object stored in the
cluster. To extend the static Pod manifests configuration, patches can be used.
These patch files must remain as files on the control plane nodes to ensure that they can be
used by the kubeadm upgrade ... --patches <directory>.
If reconfiguration is done to the ClusterConfiguration and static Pod manifests on disk, the set
of node specific patches must be updated accordingly.
What's next
• Upgrading kubeadm clusters
• Customizing components with the kubeadm API
• Certificate management with kubeadm
• Find more about kubeadm set-up
To see information about upgrading clusters created using older versions of kubeadm, please
refer to following pages instead:
Additional information
• The instructions below outline when to drain each node during the upgrade process. If
you are performing a minor version upgrade for any kubelet, you must first drain the
node (or nodes) that you are upgrading. In the case of control plane nodes, they could be
running CoreDNS Pods or other critical workloads. For more information see Draining
nodes.
• All containers are restarted after upgrade, because the container spec hash value is
changed.
• To verify that the kubelet service has successfully restarted after the kubelet has been
upgraded, you can execute systemctl status kubelet or view the service logs with
journalctl -xeu kubelet.
• Usage of the --config flag of kubeadm upgrade with kubeadm configuration API types
with the purpose of reconfiguring the cluster is not recommended and can have
unexpected results. Follow the steps in Reconfiguring a kubeadm cluster instead.
Note: The legacy package repositories (apt.kubernetes.io and yum.kubernetes.io) have been
deprecated and frozen starting from September 13, 2023. Using the new package repositories
hosted at pkgs.k8s.io is strongly recommended and required in order to install
Kubernetes versions released after September 13, 2023. The deprecated legacy repositories,
and their contents, might be removed at any time in the future and without a further notice
period. The new package repositories provide downloads for Kubernetes versions starting with
v1.24.0.
1. Upgrade kubeadm:
2. Verify that the download works and has the expected version:
kubeadm version
This command checks that your cluster can be upgraded, and fetches the versions you
can upgrade to. It also shows a table with the component config version states.
Note: kubeadm upgrade also automatically renews the certificates that it manages on this
node. To opt-out of certificate renewal the flag --certificate-renewal=false can be used.
For more information see the certificate management guide.
Note: If kubeadm upgrade plan shows any component configs that require manual
upgrade, users must provide a config file with replacement configs to kubeadm upgrade
apply via the --config command line flag. Failing to do so will cause kubeadm upgrade
apply to exit with an error and not perform an upgrade.
4. Choose a version to upgrade to, and run the appropriate command. For example:
# replace x with the patch version you picked for this upgrade
sudo kubeadm upgrade apply v1.28.x
[upgrade/kubelet] Now that your control plane is upgraded, please proceed with
upgrading your kubelets if you haven't already done so.
Note: For versions earlier than v1.28, kubeadm defaulted to a mode that upgrades the
addons (including CoreDNS and kube-proxy) immediately during kubeadm upgrade
apply, regardless of whether there are other control plane instances that have not been
upgraded. This may cause compatibility problems. Since v1.28, kubeadm defaults to a
mode that checks whether all the control plane instances have been upgraded before
starting to upgrade the addons. You must perform control plane instances upgrade
sequentially or at least ensure that the last control plane instance upgrade is not started
until all the other control plane instances have been upgraded completely, and the addons
upgrade will be performed after the last control plane instance is upgraded. If you want
to keep the old upgrade behavior, please enable the UpgradeAddonsBeforeControlPlane
feature gate by kubeadm upgrade apply --feature-
gates=UpgradeAddonsBeforeControlPlane=true. The Kubernetes project does not in
general recommend enabling this feature gate, you should instead change your upgrade
process or cluster addons so that you do not need to enable the legacy behavior. The
UpgradeAddonsBeforeControlPlane feature gate will be removed in a future release.
Your Container Network Interface (CNI) provider may have its own upgrade instructions
to follow. Check the addons page to find your CNI provider and see whether additional
upgrade steps are required.
This step is not required on additional control plane nodes if the CNI provider runs as a
DaemonSet.
instead of:
Also calling kubeadm upgrade plan and upgrading the CNI provider plugin is no longer needed.
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
Upgrade kubelet and kubectl
The following pages show how to upgrade Linux and Windows worker nodes:
The STATUS column should show Ready for all your nodes, and the version number should be
updated.
During upgrade kubeadm writes the following backup folders under /etc/kubernetes/tmp:
• kubeadm-backup-etcd-<date>-<time>
• kubeadm-backup-manifests-<date>-<time>
kubeadm-backup-etcd contains a backup of the local etcd member data for this control plane
Node. In case of an etcd upgrade failure and if the automatic rollback does not work, the
contents of this folder can be manually restored in /var/lib/etcd. In case external etcd is used
this backup folder will be empty.
kubeadm-backup-manifests contains a backup of the static Pod manifest files for this control
plane Node. In case of a upgrade failure and if the automatic rollback does not work, the
contents of this folder can be manually restored in /etc/kubernetes/manifests. If for some reason
there is no difference between a pre-upgrade and post-upgrade manifest file for a certain
component, a backup file for it will not be written.
How it works
kubeadm upgrade apply does the following:
kubeadm upgrade node does the following on additional control plane nodes:
• Killercoda
• Play with Kubernetes
• Familiarize yourself with the process for upgrading the rest of your kubeadm cluster. You
will want to upgrade the control plane nodes before upgrading your Linux Worker nodes.
Note: The legacy package repositories (apt.kubernetes.io and yum.kubernetes.io) have been
deprecated and frozen starting from September 13, 2023. Using the new package repositories
hosted at pkgs.k8s.io is strongly recommended and required in order to install
Kubernetes versions released after September 13, 2023. The deprecated legacy repositories,
and their contents, might be removed at any time in the future and without a further notice
period. The new package repositories provide downloads for Kubernetes versions starting with
v1.24.0.
Upgrade kubeadm:
Prepare the node for maintenance by marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
What's next
• See how to Upgrade Windows nodes.
This page explains how to upgrade a Windows node created with kubeadm.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version 1.17. To check the version, enter
kubectl version.
• Familiarize yourself with the process for upgrading the rest of your kubeadm cluster. You
will want to upgrade the control plane nodes before upgrading your Windows nodes.
1. From a machine with access to the Kubernetes API, prepare the node for maintenance by
marking it unschedulable and evicting the workloads:
# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
node/ip-172-31-85-18 cordoned
node/ip-172-31-85-18 drained
1. From the Windows node, call the following command to sync new kubelet configuration:
stop-service kubelet
curl.exe -Lo <path-to-kubelet.exe> "https://dl.k8s.io/v1.28.4/bin/windows/amd64/
kubelet.exe"
restart-service kubelet
Note: If you are running kube-proxy in a HostProcess container within a Pod, and not as a
Windows Service, you can upgrade kube-proxy by applying a newer version of your kube-
proxy manifests.
1. From a machine with access to the Kubernetes API, bring the node back online by
marking it schedulable:
What's next
• See how to Upgrade Linux nodes.
Note: This guide only covers a part of the Kubernetes upgrade process. Please see the upgrade
guide for more information about upgrading Kubernetes clusters.
Note: This step is only needed upon upgrading a cluster to another minor release. If you're
upgrading to another patch release within the same minor release (e.g. v1.28.5 to v1.28.7), you
don't need to follow this guide. However, if you're still using the legacy package repositories,
you'll need to migrate to the new community-owned package repositories before upgrading
(see the next section for more details on how to do this).
Note: The legacy package repositories (apt.kubernetes.io and yum.kubernetes.io) have been
deprecated and frozen starting from September 13, 2023. Using the new package repositories
hosted at pkgs.k8s.io is strongly recommended and required in order to install
Kubernetes versions released after September 13, 2023. The deprecated legacy repositories,
and their contents, might be removed at any time in the future and without a further notice
period. The new package repositories provide downloads for Kubernetes versions starting with
v1.24.0.
If you're unsure whether you're using the community-owned package repositories or the legacy
package repositories, take the following steps to verify:
Print the contents of the file that defines the Kubernetes apt repository:
You're using the Kubernetes package repositories and this guide applies to you.
Otherwise, it's strongly recommended to migrate to the Kubernetes package repositories as
described in the official announcement.
Print the contents of the file that defines the Kubernetes yum repository:
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl
You're using the Kubernetes package repositories and this guide applies to you.
Otherwise, it's strongly recommended to migrate to the Kubernetes package repositories as
described in the official announcement.
Print the contents of the file that defines the Kubernetes zypper repository:
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl
You're using the Kubernetes package repositories and this guide applies to you.
Otherwise, it's strongly recommended to migrate to the Kubernetes package repositories as
described in the official announcement.
Note:
The URL used for the Kubernetes package repositories is not limited to pkgs.k8s.io, it can also
be one of:
• pkgs.k8s.io
• pkgs.kubernetes.io
• packages.kubernetes.io
1. Open the file that defines the Kubernetes apt repository using a text editor of your
choice:
nano /etc/apt/sources.list.d/kubernetes.list
You should see a single line with the URL that contains your current Kubernetes minor
version. For example, if you're using v1.27, you should see this:
2. Change the version in the URL to the next available minor release, for example:
3. Save the file and exit your text editor. Continue following the relevant upgrade
instructions.
1. Open the file that defines the Kubernetes yum repository using a text editor of your
choice:
nano /etc/yum.repos.d/kubernetes.repo
You should see a file with two URLs that contain your current Kubernetes minor version.
For example, if you're using v1.27, you should see this:
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
2. Change the version in these URLs to the next available minor release, for example:
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/vv1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/vv1.28/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
3. Save the file and exit your text editor. Continue following the relevant upgrade
instructions.
What's next
• See how to Upgrade Linux nodes.
• See how to Upgrade Windows nodes.
Since the announcement of dockershim deprecation in Kubernetes 1.20, there were questions
on how this will affect various workloads and Kubernetes installations. Our Dockershim
Removal FAQ is there to help you to understand the problem better.
Dockershim was removed from Kubernetes with the release of v1.24. If you use Docker Engine
via dockershim as your container runtime and wish to upgrade to v1.24, it is recommended that
you either migrate to another runtime or find an alternative means to obtain Docker Engine
support. Check out the container runtimes section to know your options.
The version of Kubernetes with dockershim (1.23) is out of support and the v1.24 will run out of
support soon. Make sure to report issues you encountered with the migration so the issues can
be fixed in a timely manner and your cluster would be ready for dockershim removal. After
v1.24 running out of support, you will need to contact your Kubernetes provider for support or
upgrade multiple versions at a time if there are critical issues affecting your cluster.
Your cluster might have more than one kind of node, although this is not a common
configuration.
What's next
• Check out container runtimes to understand your options for an alternative.
• If you find a defect or other technical concern relating to migrating away from
dockershim, you can report an issue to the Kubernetes project.
Install containerd. For more information see containerd's installation documentation and for
specific prerequisite follow the containerd guide.
Replace <node-to-drain> with the name of your node you are draining.
Install Containerd
Follow the guide for detailed steps to install containerd.
• Linux
• Windows (PowerShell)
1. Install the containerd.io package from the official Docker repositories. Instructions for
setting up the Docker repository for your respective Linux distribution and installing the
containerd.io package can be found at Getting started with containerd.
Configure containerd:
2.
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
3. Restart containerd:
Start a Powershell session, set $Version to the desired version (ex: $Version="1.4.3"), and then
run the following commands:
1. Download containerd:
curl.exe -L https://github.com/containerd/containerd/releases/download/v$Version/
containerd-$Version-windows-amd64.tar.gz -o containerd-windows-amd64.tar.gz
tar.exe xvf .\containerd-windows-amd64.tar.gz
3. Start containerd:
.\containerd.exe --register-service
Start-Service containerd
Users using kubeadm should be aware that the kubeadm tool stores the CRI socket for each
host as an annotation in the Node object for that host. To change it you can execute the
following command on a machine that has the kubeadm /etc/kubernetes/admin.conf file.
Note that new CRI socket paths must be prefixed with unix:// ideally.
• Save the changes in the text editor, which will update the Node object.
• CentOS
• Debian
• Fedora
• Ubuntu
Caution: Docker's instructions for uninstalling Docker Engine create a risk of deleting
containerd. Be careful when executing commands.
Uncordon the node
kubectl uncordon <node-to-uncordon>
Replace <node-to-uncordon> with the name of your node you previously drained.
This page shows you how to migrate your Docker Engine nodes to use cri-dockerd instead of
dockershim. You should follow these steps in these scenarios:
• You want to switch away from dockershim and still use Docker Engine to run containers
in Kubernetes.
• You want to upgrade to Kubernetes v1.28 and your existing cluster relies on dockershim,
in which case you must migrate from dockershim and cri-dockerd is one of your options.
To learn more about the removal of dockershim, read the FAQ page.
What is cri-dockerd?
In Kubernetes 1.23 and earlier, you could use Docker Engine with Kubernetes, relying on a
built-in component of Kubernetes named dockershim. The dockershim component was removed
in the Kubernetes 1.24 release; however, a third-party replacement, cri-dockerd, is available.
The cri-dockerd adapter lets you use Docker Engine through the Container Runtime Interface.
Note: If you already use cri-dockerd, you aren't affected by the dockershim removal. Before you
begin, Check whether your nodes use the dockershim.
If you want to migrate to cri-dockerd so that you can continue using Docker Engine as your
container runtime, you should do the following for each affected node:
1. Install cri-dockerd.
2. Cordon and drain the node.
3. Configure the kubelet to use cri-dockerd.
4. Restart the kubelet.
5. Verify that the node is healthy.
You should perform the following steps for each node that you want to migrate to cri-dockerd.
The kubeadm tool stores the node's socket as an annotation on the Node object in the control
plane. To modify this socket for each affected node:
What's next
• Read the dockershim removal FAQ.
• Learn how to migrate from Docker Engine with dockershim to containerd.
Depending on the way you run your cluster, the container runtime for the nodes may have
been pre-configured or you need to configure it. If you're using a managed Kubernetes service,
there might be vendor-specific ways to check what container runtime is configured for the
nodes. The method described on this page should work whenever the execution of kubectl is
allowed.
The output is similar to the following. The column CONTAINER-RUNTIME outputs the runtime
and its version.
If your runtime shows as Docker Engine, you still might not be affected by the removal of
dockershim in Kubernetes v1.24. Check the runtime endpoint to see if you use dockershim. If
you don't use dockershim, you aren't affected.
Find out more information about container runtimes on Container Runtimes page.
Note: If you currently use Docker Engine in your nodes with cri-dockerd, you aren't affected by
the dockershim removal.
You can check which socket you use by checking the kubelet configuration on your nodes.
If you don't have tr or pgrep, check the command line for the kubelet process manually.
2. In the output, look for the --container-runtime flag and the --container-runtime-endpoint
flag.
◦ If your nodes use Kubernetes v1.23 and earlier and these flags aren't present or if
the --container-runtime flag is not remote, you use the dockershim socket with
Docker Engine. The --container-runtime command line argument is not available in
Kubernetes v1.27 and later.
◦ If the --container-runtime-endpoint flag is present, check the socket name to find
out which runtime you use. For example, unix:///run/containerd/containerd.sock is
the containerd endpoint.
If you want to change the Container Runtime on a Node from Docker Engine to containerd,
you can find out more information on migrating from Docker Engine to containerd, or, if you
want to continue using Docker Engine in Kubernetes v1.24 and later, migrate to a CRI-
compatible adapter like cri-dockerd.
With containerd v1.6.0-v1.6.3, if you do not upgrade the CNI plugins and/or declare the CNI
config version, you might encounter the following "Incompatible CNI versions" or "Failed to
destroy network for sandbox" error conditions.
If the version of your CNI plugin does not correctly match the plugin version in the config
because the config version is later than the plugin version, the containerd log will likely show
an error message on startup of a pod similar to:
incompatible CNI versions; config is \"1.0.0\", plugin supports [\"0.1.0\" \"0.2.0\" \"0.3.0\" \"0.3.1\"
\"0.4.0\"]"
To fix this issue, update your CNI plugins and CNI config files.
If the version of the plugin is missing in the CNI plugin config, the pod may run. However,
stopping the pod generates an error similar to:
This error leaves the pod in the not-ready state with a network namespace still attached. To
recover from this problem, edit the CNI config file to add the missing version information. The
next attempt to stop the pod should be successful.
If you're using containerd v1.6.0-v1.6.3 and encountered "Incompatible CNI versions" or "Failed
to destroy network for sandbox" errors, consider updating your CNI plugins and editing the
CNI config files.
1. Bring the node back into your cluster by restarting your container runtime and kubelet.
Uncordon the node (kubectl uncordon <nodename>).
Please see the documentation from your plugin and networking provider for further
instructions on configuring your system.
On Kubernetes, containerd runtime adds a loopback interface, lo, to pods as a default behavior.
The containerd runtime configures the loopback interface via a CNI plugin, loopback. The
loopback plugin is distributed as part of the containerd release packages that have the cni
designation. containerd v1.6.0 and later includes a CNI v1.0.0-compatible loopback plugin as
well as other default CNI plugins. The configuration for the loopback plugin is done internally
by containerd, and is set to use CNI v1.0.0. This also means that the version of the loopback
plugin must be v1.0.0 or later when this newer version containerd is started.
The following bash command generates an example CNI config. Here, the 1.0.0 value for the
config version is assigned to the cniVersion field for use when containerd invokes the CNI
bridge plugin.
Update the IP address ranges in the preceding example with ones that are based on your use
case and network addressing plan.
This page explains how your cluster could be using Docker as a container runtime, provides
details on the role that dockershim plays when in use, and shows steps you can take to check
whether any workloads could be affected by dockershim removal.
When alternative container runtime is used, executing Docker commands may either not work
or yield unexpected output. This is how you can find whether you have a dependency on
Docker:
1. Make sure no privileged Pods execute Docker commands (like docker ps), restart the
Docker service (commands such as systemctl restart docker.service), or modify Docker-
specific files such as /etc/docker/daemon.json.
2. Check for any private registries or image mirror settings in the Docker configuration file
(like /etc/docker/daemon.json). Those typically need to be reconfigured for another
container runtime.
3. Check that scripts and apps running on nodes outside of your Kubernetes infrastructure
do not execute Docker commands. It might be:
◦ SSH to nodes to troubleshoot;
◦ Node startup scripts;
◦ Monitoring and security agents installed on nodes directly.
4. Third-party tools that perform above mentioned privileged operations. See Migrating
telemetry and security agents from dockershim for more information.
5. Make sure there are no indirect dependencies on dockershim behavior. This is an edge
case and unlikely to affect your application. Some tooling may be configured to react to
Docker-specific behaviors, for example, raise alert on specific metrics or search for a
specific log message as part of troubleshooting instructions. If you have such tooling
configured, test the behavior on a test cluster before migration.
In its earliest releases, Kubernetes offered compatibility with one container runtime: Docker.
Later in the Kubernetes project's history, cluster operators wanted to adopt additional container
runtimes. The CRI was designed to allow this kind of flexibility - and the kubelet began
supporting CRI. However, because Docker existed before the CRI specification was invented,
the Kubernetes project created an adapter component, dockershim. The dockershim adapter
allows the kubelet to interact with Docker as if Docker were a CRI compatible runtime.
You can read about it in Kubernetes Containerd integration goes GA blog post.
Switching to Containerd as a container runtime eliminates the middleman. All the same
containers can be run by container runtimes like Containerd as before. But now, since
containers schedule directly with the container runtime, they are not visible to Docker. So any
Docker tooling or fancy UI you might have used before to check on these containers is no
longer available.
You cannot get container information using docker ps or docker inspect commands. As you
cannot list containers, you cannot get logs, stop containers, or execute something inside a
container using docker exec.
Note: If you're running workloads via Kubernetes, the best way to stop a container is through
the Kubernetes API rather than directly through the container runtime (this advice applies for
all container runtimes, not only Docker).
You can still pull images or build them using docker build command. But images built or pulled
by Docker would not be visible to container runtime and Kubernetes. They needed to be pushed
to some registry to allow them to be used by Kubernetes.
Known issues
Some filesystem metrics are missing and the metrics format is different
container_fs_inodes_free
container_fs_inodes_total
container_fs_io_current
container_fs_io_time_seconds_total
container_fs_io_time_weighted_seconds_total
container_fs_limit_bytes
container_fs_read_seconds_total
container_fs_reads_merged_total
container_fs_sector_reads_total
container_fs_sector_writes_total
container_fs_usage_bytes
container_fs_write_seconds_total
container_fs_writes_merged_total
Workaround
1. Find the latest cAdvisor release with the name pattern vX.Y.Z-containerd-cri (for
example, v0.42.0-containerd-cri).
2. Follow the steps in cAdvisor Kubernetes Daemonset to create the daemonset.
3. Point the installed metrics collector to use the cAdvisor /metrics endpoint which provides
the full set of Prometheus container metrics.
Alternatives:
What's next
• Read Migrating from dockershim to understand your next steps
• Read the dockershim deprecation FAQ article for more information.
Migrating telemetry and security agents
from dockershim
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.
Kubernetes' support for direct integration with Docker Engine is deprecated and has been
removed. Most apps do not have a direct dependency on runtime hosting containers. However,
there are still a lot of telemetry and monitoring agents that have a dependency on Docker to
collect containers metadata, logs, and metrics. This document aggregates information on how to
detect these dependencies as well as links on how to migrate these agents to use generic tools
or alternative runtimes.
Historically, Kubernetes was written to work specifically with Docker Engine. Kubernetes took
care of networking and scheduling, relying on Docker Engine for launching and running
containers (within Pods) on a node. Some information that is relevant to telemetry, such as a
pod name, is only available from Kubernetes components. Other data, such as container
metrics, is not the responsibility of the container runtime. Early telemetry agents needed to
query the container runtime and Kubernetes to report an accurate picture. Over time,
Kubernetes gained the ability to support multiple runtimes, and now supports any runtime that
is compatible with the container runtime interface.
Some telemetry agents rely specifically on Docker Engine tooling. For example, an agent might
run a command such as docker ps or docker top to list containers and processes or docker logs
to receive streamed logs. If nodes in your existing cluster use Docker Engine, and you switch to
a different container runtime, these commands will not work any longer.
If a pod wants to make calls to the dockerd running on the node, the pod must either:
• mount the filesystem containing the Docker daemon's privileged socket, as a volume; or
• mount the specific path of the Docker daemon's privileged socket directly, also as a
volume.
For example: on COS images, Docker exposes its Unix domain socket at /var/run/docker.sock
This means that the pod spec will include a hostPath volume mount of /var/run/docker.sock.
Here's a sample shell script to find Pods that have a mount directly mapping the Docker socket.
This script outputs the namespace and name of the pod. You can remove the grep '/var/run/
docker.sock' to review other mounts.
Note: There are alternative ways for a pod to access Docker on the host. For instance, the
parent directory /var/run may be mounted instead of the full path (like in this example). The
script above only detects the most common uses.
If your cluster nodes are customized and install additional security and telemetry agents on the
node, check with the agent vendor to verify whether it has any dependency on Docker.
This section is intended to aggregate information about various telemetry and security agents
that may have a dependency on container runtimes.
We keep the work in progress version of migration instructions for various telemetry and
security agent vendors in Google doc. Please contact the vendor to get up to date instructions
for migrating from dockershim.
No changes are needed: everything should work seamlessly on the runtime switch.
Datadog
How to migrate: Docker deprecation in Kubernetes The pod that accesses Docker Engine may
have a name containing any of:
• datadog-agent
• datadog
• dd-agent
Dynatrace
CRI-O support announcement: Get automated full-stack visibility into your CRI-O Kubernetes
containers (Beta)
The pod accessing Docker may have name containing:
• dynatrace-oneagent
Falco
How to migrate:
Migrate Falco from dockershim Falco supports any CRI-compatible runtime (containerd is used
in the default configuration); the documentation explains all details. The pod accessing Docker
may have name containing:
• falco
Check documentation for Prisma Cloud, under the "Install Prisma Cloud on a CRI (non-Docker)
cluster" section. The pod accessing Docker may be named like:
• twistlock-defender-ds
SignalFx (Splunk)
The SignalFx Smart Agent (deprecated) uses several different monitors for Kubernetes including
kubernetes-cluster, kubelet-stats/kubelet-metrics, and docker-container-stats. The kubelet-stats
monitor was previously deprecated by the vendor, in favor of kubelet-metrics. The docker-
container-stats monitor is the one affected by dockershim removal. Do not use the docker-
container-stats with container runtimes other than Docker Engine.
1. Remove docker-container-stats from the list of configured monitors. Note, keeping this
monitor enabled with non-dockershim runtime will result in incorrect metrics being
reported when docker is installed on node and no metrics when docker is not installed.
2. Enable and configure kubelet-metrics monitor.
Note: The set of collected metrics will change. Review your alerting rules and dashboards.
• signalfx-agent
Flame does not support container runtimes other than Docker. See https://github.com/yahoo/
kubectl-flame/issues/51
2. Generate a new certificate authority (CA). --batch sets automatic mode; --req-cn specifies
the Common Name (CN) for the CA's new root certificate.
The argument --subject-alt-name sets the possible IPs and DNS names the API server will
be accessed with. The MASTER_CLUSTER_IP is usually the first IP from the service CIDR
that is specified as the --service-cluster-ip-range argument for both the API server and
the controller manager component. The argument --days is used to set the number of
days after which the certificate expires. The sample below also assumes that you are using
cluster.local as the default DNS domain name.
./easyrsa --subject-alt-name="IP:${MASTER_IP},"\
"IP:${MASTER_CLUSTER_IP},"\
"DNS:kubernetes,"\
"DNS:kubernetes.default,"\
"DNS:kubernetes.default.svc,"\
"DNS:kubernetes.default.svc.cluster,"\
"DNS:kubernetes.default.svc.cluster.local" \
--days=10000 \
build-server-full server nopass
5. Fill in and add the following parameters into the API server start parameters:
--client-ca-file=/yourdirectory/ca.crt
--tls-cert-file=/yourdirectory/server.crt
--tls-private-key-file=/yourdirectory/server.key
openssl
2. According to the ca.key generate a ca.crt (use -days to set the certificate effective time):
openssl req -x509 -new -nodes -key ca.key -subj "/CN=${MASTER_IP}" -days 10000 -out
ca.crt
Be sure to substitute the values marked with angle brackets (e.g. <MASTER_IP>) with
real values before saving this to a file (e.g. csr.conf). Note that the value for
MASTER_CLUSTER_IP is the service cluster IP for the API server as described in
previous subsection. The sample below also assumes that you are using cluster.local as
the default DNS domain name.
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn
[ dn ]
C = <country>
ST = <state>
L = <city>
O = <organization>
OU = <organization unit>
CN = <MASTER_IP>
[ req_ext ]
subjectAltName = @alt_names
[ alt_names ]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster
DNS.5 = kubernetes.default.svc.cluster.local
IP.1 = <MASTER_IP>
IP.2 = <MASTER_CLUSTER_IP>
[ v3_ext ]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints=CA:FALSE
keyUsage=keyEncipherment,dataEncipherment
extendedKeyUsage=serverAuth,clientAuth
subjectAltName=@alt_names
6. Generate the server certificate using the ca.key, ca.crt and server.csr:
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 10000 \
-extensions v3_ext -extfile csr.conf -sha256
Finally, add the same parameters into the API server start parameters.
cfssl
1. Download, unpack and prepare the command line tools as shown below.
Note that you may need to adapt the sample commands based on the hardware
architecture and cfssl version you are using.
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/
cfssl_1.5.0_linux_amd64 -o cfssl
chmod +x cfssl
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/
cfssljson_1.5.0_linux_amd64 -o cfssljson
chmod +x cfssljson
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/cfssl-
certinfo_1.5.0_linux_amd64 -o cfssl-certinfo
chmod +x cfssl-certinfo
mkdir cert
cd cert
../cfssl print-defaults config > config.json
../cfssl print-defaults csr > csr.json
3. Create a JSON config file for generating the CA file, for example, ca-config.json:
{
"signing": {
"default": {
"expiry": "8760h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "8760h"
}
}
}
}
4. Create a JSON config file for CA certificate signing request (CSR), for example, ca-
csr.json. Be sure to replace the values marked with angle brackets with real values you
want to use.
{
"CN": "kubernetes",
"key": {
"algo": "rsa",
"size": 2048
},
"names":[{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}
6. Create a JSON config file for generating keys and certificates for the API server, for
example, server-csr.json. Be sure to replace the values in angle brackets with real values
you want to use. The <MASTER_CLUSTER_IP> is the service cluster IP for the API server
as described in previous subsection. The sample below also assumes that you are using
cluster.local as the default DNS domain name.
{
"CN": "kubernetes",
"hosts": [
"127.0.0.1",
"<MASTER_IP>",
"<MASTER_CLUSTER_IP>",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}
7. Generate the key and certificate for the API server, which are by default saved into file
server-key.pem and server.pem respectively:
Certificates API
You can use the certificates.k8s.io API to provision x509 certificates to use for authentication as
documented in the Managing TLS in a cluster task page.
Define a default memory resource limit for a namespace, so that every new Pod in that
namespace has a memory resource limit configured.
Define a default CPU resource limits for a namespace, so that every new Pod in that namespace
has a CPU resource limit configured.
Configure Minimum and Maximum Memory Constraints for a Namespace
Define a range of valid memory resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.
Define a range of valid CPU resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.
This page shows how to configure default memory requests and limits for a namespace.
A Kubernetes cluster can be divided into namespaces. Once you have a namespace that has a
default memory limit, and you then try to create a Pod with a container that does not specify its
own memory limit, then the control plane assigns the default memory limit to that container.
Kubernetes assigns a default memory request under certain conditions that are explained later
in this topic.
• Killercoda
• Play with Kubernetes
admin/resource/memory-defaults.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container
Now if you create a Pod in the default-mem-example namespace, and any container within that
Pod does not specify its own values for memory request and memory limit, then the control
plane applies default values: a memory request of 256MiB and a memory limit of 512MiB.
Here's an example manifest for a Pod that has one container. The container does not specify a
memory request and limit.
admin/resource/memory-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo
spec:
containers:
- name: default-mem-demo-ctr
image: nginx
The output shows that the Pod's container has a memory request of 256 MiB and a memory
limit of 512 MiB. These are the default values specified by the LimitRange.
containers:
- image: nginx
imagePullPolicy: Always
name: default-mem-demo-ctr
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
admin/resource/memory-defaults-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-2
spec:
containers:
- name: default-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1Gi"
The output shows that the container's memory request is set to match its memory limit. Notice
that the container was not assigned the default memory request value of 256Mi.
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
admin/resource/memory-defaults-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-3
spec:
containers:
- name: default-mem-demo-3-ctr
image: nginx
resources:
requests:
memory: "128Mi"
The output shows that the container's memory request is set to the value specified in the
container's manifest. The container is limited to use no more than 512MiB of memory, which
matches the default memory limit for the namespace.
resources:
limits:
memory: 512Mi
requests:
memory: 128Mi
Note: A LimitRange does not check the consistency of the default values it applies. This means
that a default value for the limit that is set by LimitRange may be less than the request value
specified for the container in the spec that a client submits to the API server. If that happens,
the final Pod will not be scheduleable. See Constraints on resource limits and requests for more
details.
Motivation for default memory limits and requests
If your namespace has a memory resource quota configured, it is helpful to have a default value
in place for memory limit. Here are three of the restrictions that a resource quota imposes on a
namespace:
• For every Pod that runs in the namespace, the Pod and each of its containers must have a
memory limit. (If you specify a memory limit for every container in a Pod, Kubernetes
can infer the Pod-level memory limit by adding up the limits for its containers).
• Memory limits apply a resource reservation on the node where the Pod in question is
scheduled. The total amount of memory reserved for all Pods in the namespace must not
exceed a specified limit.
• The total amount of memory actually used by all Pods in the namespace must also not
exceed a specified limit.
If any Pod in that namespace that includes a container does not specify its own memory limit,
the control plane applies the default memory limit to that container, and the Pod can be allowed
to run in a namespace that is restricted by a memory ResourceQuota.
Clean up
Delete your namespace:
What's next
For cluster administrators
This page shows how to configure default CPU requests and limits for a namespace.
A Kubernetes cluster can be divided into namespaces. If you create a Pod within a namespace
that has a default CPU limit, and any container in that Pod does not specify its own CPU limit,
then the control plane assigns the default CPU limit to that container.
Kubernetes assigns a default CPU request, but only under certain conditions that are explained
later in this page.
• Killercoda
• Play with Kubernetes
If you're not already familiar with what Kubernetes means by 1.0 CPU, read meaning of CPU.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.
admin/resource/cpu-defaults.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 1
defaultRequest:
cpu: 0.5
type: Container
Now if you create a Pod in the default-cpu-example namespace, and any container in that Pod
does not specify its own values for CPU request and CPU limit, then the control plane applies
default values: a CPU request of 0.5 and a default CPU limit of 1.
Here's a manifest for a Pod that has one container. The container does not specify a CPU
request and limit.
admin/resource/cpu-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo
spec:
containers:
- name: default-cpu-demo-ctr
image: nginx
The output shows that the Pod's only container has a CPU request of 500m cpu (which you can
read as “500 millicpu”), and a CPU limit of 1 cpu. These are the default values specified by the
LimitRange.
containers:
- image: nginx
imagePullPolicy: Always
name: default-cpu-demo-ctr
resources:
limits:
cpu: "1"
requests:
cpu: 500m
What if you specify a container's limit, but not its
request?
Here's a manifest for a Pod that has one container. The container specifies a CPU limit, but not
a request:
admin/resource/cpu-defaults-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-2
spec:
containers:
- name: default-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1"
The output shows that the container's CPU request is set to match its CPU limit. Notice that the
container was not assigned the default CPU request value of 0.5 cpu:
resources:
limits:
cpu: "1"
requests:
cpu: "1"
admin/resource/cpu-defaults-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-3
spec:
containers:
- name: default-cpu-demo-3-ctr
image: nginx
resources:
requests:
cpu: "0.75"
The output shows that the container's CPU request is set to the value you specified at the time
you created the Pod (in other words: it matches the manifest). However, the same container's
CPU limit is set to 1 cpu, which is the default CPU limit for that namespace.
resources:
limits:
cpu: "1"
requests:
cpu: 750m
• For every Pod that runs in the namespace, each of its containers must have a CPU limit.
• CPU limits apply a resource reservation on the node where the Pod in question is
scheduled. The total amount of CPU that is reserved for use by all Pods in the namespace
must not exceed a specified limit.
If any Pod in that namespace that includes a container does not specify its own CPU limit, the
control plane applies the default CPU limit to that container, and the Pod can be allowed to run
in a namespace that is restricted by a CPU ResourceQuota.
Clean up
Delete your namespace:
This page shows how to set minimum and maximum values for memory used by containers
running in a namespace. You specify minimum and maximum memory values in a LimitRange
object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot be created
in the namespace.
• Killercoda
• Play with Kubernetes
Each node in your cluster must have at least 1 GiB of memory available for Pods.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.
admin/resource/memory-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: mem-min-max-demo-lr
spec:
limits:
- max:
memory: 1Gi
min:
memory: 500Mi
type: Container
The output shows the minimum and maximum memory constraints as expected. But notice that
even though you didn't specify default values in the configuration file for the LimitRange, they
were created automatically.
limits:
- default:
memory: 1Gi
defaultRequest:
memory: 1Gi
max:
memory: 1Gi
min:
memory: 500Mi
type: Container
Now whenever you define a Pod within the constraints-mem-example namespace, Kubernetes
performs these steps:
• If any container in that Pod does not specify its own memory request and limit, the
control plane assigns the default memory request and limit to that container.
• Verify that every container in that Pod requests at least 500 MiB of memory.
• Verify that every container in that Pod requests no more than 1024 MiB (1 GiB) of
memory.
Here's a manifest for a Pod that has one container. Within the Pod spec, the sole container
specifies a memory request of 600 MiB and a memory limit of 800 MiB. These satisfy the
minimum and maximum memory constraints imposed by the LimitRange.
admin/resource/memory-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo
spec:
containers:
- name: constraints-mem-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "600Mi"
Verify that the Pod is running and that its container is healthy:
The output shows that the container within that Pod has a memory request of 600 MiB and a
memory limit of 800 MiB. These satisfy the constraints imposed by the LimitRange for this
namespace:
resources:
limits:
memory: 800Mi
requests:
memory: 600Mi
admin/resource/memory-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-2
spec:
containers:
- name: constraints-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1.5Gi"
requests:
memory: "800Mi"
The output shows that the Pod does not get created, because it defines a container that requests
more memory than is allowed:
admin/resource/memory-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-3
spec:
containers:
- name: constraints-mem-demo-3-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "100Mi"
The output shows that the Pod does not get created, because it defines a container that requests
less memory than the enforced minimum:
admin/resource/memory-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-4
spec:
containers:
- name: constraints-mem-demo-4-ctr
image: nginx
The output shows that the Pod's only container has a memory request of 1 GiB and a memory
limit of 1 GiB. How did that container get those values?
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
Because your Pod did not define any memory request and limit for that container, the cluster
applied a default memory request and limit from the LimitRange.
This means that the definition of that Pod shows those values. You can check it using kubectl
describe:
At this point, your Pod might be running or it might not be running. Recall that a prerequisite
for this task is that your Nodes have at least 1 GiB of memory. If each of your Nodes has only 1
GiB of memory, then there is not enough allocatable memory on any Node to accommodate a
memory request of 1 GiB. If you happen to be using Nodes with 2 GiB of memory, then you
probably have enough space to accommodate the 1 GiB request.
• Each Node in a cluster has 2 GiB of memory. You do not want to accept any Pod that
requests more than 2 GiB of memory, because no Node in the cluster can support the
request.
• A cluster is shared by your production and development departments. You want to allow
production workloads to consume up to 8 GiB of memory, but you want development
workloads to be limited to 512 MiB. You create separate namespaces for production and
development, and you apply memory constraints to each namespace.
Clean up
Delete your namespace:
kubectl delete namespace constraints-mem-example
What's next
For cluster administrators
This page shows how to set minimum and maximum values for the CPU resources used by
containers and Pods in a namespace. You specify minimum and maximum CPU values in a
LimitRange object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot
be created in the namespace.
• Killercoda
• Play with Kubernetes
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.
admin/resource/cpu-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-min-max-demo-lr
spec:
limits:
- max:
cpu: "800m"
min:
cpu: "200m"
type: Container
The output shows the minimum and maximum CPU constraints as expected. But notice that
even though you didn't specify default values in the configuration file for the LimitRange, they
were created automatically.
limits:
- default:
cpu: 800m
defaultRequest:
cpu: 800m
max:
cpu: 800m
min:
cpu: 200m
type: Container
Now whenever you create a Pod in the constraints-cpu-example namespace (or some other
client of the Kubernetes API creates an equivalent Pod), Kubernetes performs these steps:
• If any container in that Pod does not specify its own CPU request and limit, the control
plane assigns the default CPU request and limit to that container.
• Verify that every container in that Pod specifies a CPU request that is greater than or
equal to 200 millicpu.
• Verify that every container in that Pod specifies a CPU limit that is less than or equal to
800 millicpu.
Note: When creating a LimitRange object, you can specify limits on huge-pages or GPUs as
well. However, when both default and defaultRequest are specified on these resources, the two
values must be the same.
Here's a manifest for a Pod that has one container. The container manifest specifies a CPU
request of 500 millicpu and a CPU limit of 800 millicpu. These satisfy the minimum and
maximum CPU constraints imposed by the LimitRange for this namespace.
admin/resource/cpu-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo
spec:
containers:
- name: constraints-cpu-demo-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "500m"
Verify that the Pod is running and that its container is healthy:
The output shows that the Pod's only container has a CPU request of 500 millicpu and CPU
limit of 800 millicpu. These satisfy the constraints imposed by the LimitRange.
resources:
limits:
cpu: 800m
requests:
cpu: 500m
admin/resource/cpu-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-2
spec:
containers:
- name: constraints-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1.5"
requests:
cpu: "500m"
The output shows that the Pod does not get created, because it defines an unacceptable
container. That container is not acceptable because it specifies a CPU limit that is too large:
admin/resource/cpu-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-3
spec:
containers:
- name: constraints-cpu-demo-3-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "100m"
The output shows that the Pod does not get created, because it defines an unacceptable
container. That container is not acceptable because it specifies a CPU request that is lower than
the enforced minimum:
admin/resource/cpu-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-4
spec:
containers:
- name: constraints-cpu-demo-4-ctr
image: vish/stress
resources:
limits:
cpu: 800m
requests:
cpu: 800m
Because that container did not specify its own CPU request and limit, the control plane applied
the default CPU request and limit from the LimitRange for this namespace.
At this point, your Pod may or may not be running. Recall that a prerequisite for this task is
that your Nodes must have at least 1 CPU available for use. If each of your Nodes has only 1
CPU, then there might not be enough allocatable CPU on any Node to accommodate a request
of 800 millicpu. If you happen to be using Nodes with 2 CPU, then you probably have enough
CPU to accommodate the 800 millicpu request.
• Each Node in a cluster has 2 CPU. You do not want to accept any Pod that requests more
than 2 CPU, because no Node in the cluster can support the request.
• A cluster is shared by your production and development departments. You want to allow
production workloads to consume up to 3 CPU, but you want development workloads to
be limited to 1 CPU. You create separate namespaces for production and development,
and you apply CPU constraints to each namespace.
Clean up
Delete your namespace:
This page shows how to set quotas for the total amount memory and CPU that can be used by
all Pods running in a namespace. You specify quotas in a ResourceQuota object.
• Killercoda
• Play with Kubernetes
Create a ResourceQuota
Here is a manifest for an example ResourceQuota:
admin/resource/quota-mem-cpu.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-demo
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi
• For every Pod in the namespace, each container must have a memory request, memory
limit, cpu request, and cpu limit.
• The memory request total for all Pods in that namespace must not exceed 1 GiB.
• The memory limit total for all Pods in that namespace must not exceed 2 GiB.
• The CPU request total for all Pods in that namespace must not exceed 1 cpu.
• The CPU limit total for all Pods in that namespace must not exceed 2 cpu.
Create a Pod
Here is a manifest for an example Pod:
admin/resource/quota-mem-cpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo
spec:
containers:
- name: quota-mem-cpu-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
cpu: "800m"
requests:
memory: "600Mi"
cpu: "400m"
Verify that the Pod is running and that its (only) container is healthy:
The output shows the quota along with how much of the quota has been used. You can see that
the memory and CPU requests and limits for your Pod do not exceed the quota.
status:
hard:
limits.cpu: "2"
limits.memory: 2Gi
requests.cpu: "1"
requests.memory: 1Gi
used:
limits.cpu: 800m
limits.memory: 800Mi
requests.cpu: 400m
requests.memory: 600Mi
If you have the jq tool, you can also query (using JSONPath) for just the used values, and
pretty-print that that of the output. For example:
admin/resource/quota-mem-cpu-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo-2
spec:
containers:
- name: quota-mem-cpu-demo-2-ctr
image: redis
resources:
limits:
memory: "1Gi"
cpu: "800m"
requests:
memory: "700Mi"
cpu: "400m"
In the manifest, you can see that the Pod has a memory request of 700 MiB. Notice that the sum
of the used memory request and this new memory request exceeds the memory request quota:
600 MiB + 700 MiB > 1 GiB.
The second Pod does not get created. The output shows that creating the second Pod would
cause the memory request total to exceed the memory request quota.
Discussion
As you have seen in this exercise, you can use a ResourceQuota to restrict the memory request
total for all Pods running in a namespace. You can also restrict the totals for memory limit, cpu
request, and cpu limit.
Instead of managing total resource use within a namespace, you might want to restrict
individual Pods, or the containers in those Pods. To achieve that kind of limiting, use a
LimitRange.
Clean up
Delete your namespace:
What's next
For cluster administrators
This page shows how to set a quota for the total number of Pods that can run in a Namespace.
You specify quotas in a ResourceQuota object.
• Killercoda
• Play with Kubernetes
Create a ResourceQuota
Here is an example manifest for a ResourceQuota:
admin/resource/quota-pod.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: pod-demo
spec:
hard:
pods: "2"
The output shows that the namespace has a quota of two Pods, and that currently there are no
Pods; that is, none of the quota is used.
spec:
hard:
pods: "2"
status:
hard:
pods: "2"
used:
pods: "0"
admin/resource/quota-pod-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-quota-demo
spec:
selector:
matchLabels:
purpose: quota-demo
replicas: 3
template:
metadata:
labels:
purpose: quota-demo
spec:
containers:
- name: pod-quota-demo
image: nginx
In that manifest, replicas: 3 tells Kubernetes to attempt to create three new Pods, all running the
same application.
The output shows that even though the Deployment specifies three replicas, only two Pods
were created because of the quota you defined earlier:
spec:
...
replicas: 3
...
status:
availableReplicas: 2
...
lastUpdateTime: 2021-04-02T20:57:05Z
message: 'unable to create pods: pods "pod-quota-demo-1650323038-" is forbidden:
exceeded quota: pod-demo, requested: pods=1, used: pods=2, limited: pods=2'
Choice of resource
In this task you have defined a ResourceQuota that limited the total number of Pods, but you
could also limit the total number of other kinds of object. For example, you might decide to
limit how many CronJobs that can live in a single namespace.
Clean up
Delete your namespace:
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy.
Syntax
Example
The Calico pods begin with calico. Check to make sure each one has a status of Running.
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy.
Use Cilium for NetworkPolicy
This page shows how to use Cilium for NetworkPolicy.
• Killercoda
• Play with Kubernetes
To start minikube, minimal version required is >= v1.5.2, run the with the following arguments:
minikube version
For minikube you can install Cilium using its CLI tool. To do so, first download the latest
version of the CLI with the following command:
Then extract the downloaded file to your /usr/local/bin directory with the following command:
After running the above commands, you can now install Cilium with the following command:
cilium install
Cilium will then automatically detect the cluster configuration and create and install the
appropriate components for a successful installation. The components are:
• Certificate Authority (CA) in Secret cilium-ca and certificates for Hubble (Cilium's
observability layer).
• Service accounts.
• Cluster roles.
• ConfigMap.
• Agent DaemonSet and an Operator Deployment.
After the installation, you can view the overall status of the Cilium deployment with the cilium
status command. See the expected output of the status command here.
The remainder of the Getting Started Guide explains how to enforce both L3/L4 (i.e., IP address
+ port) security policies, as well as L7 (e.g., HTTP) security policies using an example
application.
A cilium Pod runs on each node in your cluster and enforces network policy on the traffic to/
from Pods on that node using Linux BPF.
What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy with Cilium. Have fun, and if you have questions, contact us using the Cilium
Slack Channel.
What's next
Once you have installed the Kube-router addon, you can follow the Declare Network Policy to
try out Kubernetes NetworkPolicy.
What's next
Once you have installed Romana, you can follow the Declare Network Policy to try out
Kubernetes NetworkPolicy.
The Weave Net addon for Kubernetes comes with a Network Policy Controller that
automatically monitors Kubernetes for any NetworkPolicy annotations on all namespaces and
configures iptables rules to allow or block traffic as directed by the policies.
Each Node has a weave Pod, and all Pods are Running and 2/2 READY. (2/2 means that each
Pod has weave and weave-npc.)
What's next
Once you have installed the Weave Net addon, you can follow the Declare Network Policy to
try out Kubernetes NetworkPolicy. If you have any question, contact us at #weave-community
on Slack or Weave User Group.
• Killercoda
• Play with Kubernetes
When accessing the Kubernetes API for the first time, use the Kubernetes command-line tool,
kubectl.
To access a cluster, you need to know the location of the cluster and have credentials to access
it. Typically, this is automatically set-up when you work through a Getting started guide, or
someone else set up the cluster and provided you with credentials and a location.
Check the location and credentials that kubectl knows about with this command:
kubectl handles locating and authenticating to the API server. If you want to directly access the
REST API with an http client like curl or wget, or a browser, there are multiple ways you can
locate and authenticate against the API server:
1. Run kubectl in proxy mode (recommended). This method is recommended, since it uses
the stored apiserver location and verifies the identity of the API server using a self-signed
cert. No man-in-the-middle (MITM) attack is possible using this method.
2. Alternatively, you can provide the location and credentials directly to the http client. This
works with client code that is confused by proxies. To protect against man in the middle
attacks, you'll need to import a root cert into your browser.
Using the Go or Python client libraries provides accessing kubectl in proxy mode.
The following command runs kubectl in a mode where it acts as a reverse proxy. It handles
locating the API server and authenticating.
Then you can explore the API with curl, wget, or a browser, like so:
curl http://localhost:8080/api/
{
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}
It is possible to avoid using kubectl proxy by passing an authentication token directly to the
API server, like this:
# Check all possible clusters, as your .KUBECONFIG may have multiple contexts:
kubectl config view -o jsonpath='{"Cluster name\tServer\n"}{range .clusters[*]}{.name}{"\t"}
{.cluster.server}{"\n"}{end}'
# Select name of cluster you want to interact with from above output:
export CLUSTER_NAME="some_server_name"
# Wait for the token controller to populate the secret with a token:
while ! kubectl describe secret default-token | grep -E '^token' >/dev/null; do
echo "waiting for token..." >&2
sleep 1
done
The above example uses the --insecure flag. This leaves it subject to MITM attacks. When
kubectl accesses the cluster it uses a stored root certificate and client certificates to access the
server. (These are installed in the ~/.kube directory). Since cluster certificates are typically self-
signed, it may take special configuration to get your http client to use root certificate.
On some clusters, the API server does not require authentication; it may serve on localhost, or
be protected by a firewall. There is not a standard for this. Controlling Access to the Kubernetes
API describes how you can configure this as a cluster administrator.
Kubernetes officially supports client libraries for Go, Python, Java, dotnet, JavaScript, and
Haskell. There are other client libraries that are provided and maintained by their authors, not
the Kubernetes team. See client libraries for accessing the API from other languages and how
they authenticate.
Go client
Note: client-go defines its own API objects, so if needed, import API definitions from client-go
rather than from the main repository. For example, import "k8s.io/client-go/kubernetes" is
correct.
The Go client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
package main
import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
// uses the current context in kubeconfig
// path-to-kubeconfig -- for example, /root/.kube/config
config, _ := clientcmd.BuildConfigFromFlags("", "<path-to-kubeconfig>")
// creates the clientset
clientset, _ := kubernetes.NewForConfig(config)
// access the API to list pods
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{})
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
}
If the application is deployed as a Pod in the cluster, see Accessing the API from within a Pod.
Python client
To use Python client, run the following command: pip install kubernetes. See Python Client
Library page for more installation options.
The Python client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
config.load_kube_config()
v1=client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))
Java client
The Java client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
package io.kubernetes.client.examples;
import io.kubernetes.client.ApiClient;
import io.kubernetes.client.ApiException;
import io.kubernetes.client.Configuration;
import io.kubernetes.client.apis.CoreV1Api;
import io.kubernetes.client.models.V1Pod;
import io.kubernetes.client.models.V1PodList;
import io.kubernetes.client.util.ClientBuilder;
import io.kubernetes.client.util.KubeConfig;
import java.io.FileReader;
import java.io.IOException;
/**
* A simple example of how to use the Java API from an application outside a kubernetes cluster
*
* <p>Easiest way to run this: mvn exec:java
* -Dexec.mainClass="io.kubernetes.client.examples.KubeConfigFileClientExample"
*
*/
public class KubeConfigFileClientExample {
public static void main(String[] args) throws IOException, ApiException {
// set the global default api-client to the in-cluster one from above
Configuration.setDefaultApiClient(client);
dotnet client
To use dotnet client, run the following command: dotnet add package KubernetesClient --
version 1.6.1 See dotnet Client Library page for more installation options. See https://
github.com/kubernetes-client/csharp/releases to see which versions are supported.
The dotnet client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
using System;
using k8s;
namespace simple
{
internal class PodList
{
private static void Main(string[] args)
{
var config = KubernetesClientConfiguration.BuildDefaultConfig();
IKubernetes client = new Kubernetes(config);
Console.WriteLine("Starting Request!");
JavaScript client
To install JavaScript client, run the following command: npm install @kubernetes/client-node.
See https://github.com/kubernetes-client/javascript/releases to see which versions are
supported.
The JavaScript client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
k8sApi.listNamespacedPod('default').then((res) => {
console.log(res.body);
});
Haskell client
The Haskell client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:
exampleWithKubeConfig :: IO ()
exampleWithKubeConfig = do
oidcCache <- atomically $ newTVar $ Map.fromList []
(mgr, kcfg) <- mkKubeClientConfig oidcCache $ KubeConfigFile "/path/to/kubeconfig"
dispatchMime
mgr
kcfg
(CoreV1.listPodForAllNamespaces (Accept MimeJSON))
>>= print
What's next
• Accessing the Kubernetes API from a Pod
• Killercoda
• Play with Kubernetes
[
{
"op": "add",
"path": "/status/capacity/example.com~1dongle",
"value": "4"
}
]
Note that Kubernetes does not need to know what a dongle is or what a dongle is for. The
preceding PATCH request tells Kubernetes that your Node has four things that you call
dongles.
Start a proxy, so that you can easily send requests to the Kubernetes API server:
kubectl proxy
In another command window, send the HTTP PATCH request. Replace <your-node-name>
with the name of your Node:
Note: In the preceding request, ~1 is the encoding for the character / in the patch path. The
operation path value in JSON-Patch is interpreted as a JSON-Pointer. For more details, see IETF
RFC 6901, section 3.
"capacity": {
"cpu": "2",
"memory": "2049008Ki",
"example.com/dongle": "4",
Capacity:
cpu: 2
memory: 2049008Ki
example.com/dongle: 4
Now, application developers can create Pods that request a certain number of dongles. See
Assign Extended Resources to a Container.
Discussion
Extended resources are similar to memory and CPU resources. For example, just as a Node has
a certain amount of memory and CPU to be shared by all components running on the Node, it
can have a certain number of dongles to be shared by all components running on the Node.
And just as application developers can create Pods that request a certain amount of memory
and CPU, they can create Pods that request a certain number of dongles.
Extended resources are opaque to Kubernetes; Kubernetes does not know anything about what
they are. Kubernetes knows only that a Node has a certain number of them. Extended resources
must be advertised in integer amounts. For example, a Node can advertise four dongles, but not
4.5 dongles.
Storage example
Suppose a Node has 800 GiB of a special kind of disk storage. You could create a name for the
special storage, say example.com/special-storage. Then you could advertise it in chunks of a
certain size, say 100 GiB. In that case, your Node would advertise that it has eight resources of
type example.com/special-storage.
Capacity:
...
example.com/special-storage: 8
If you want to allow arbitrary requests for special storage, you could advertise special storage
in chunks of size 1 byte. In that case, you would advertise 800Gi resources of type example.com/
special-storage.
Capacity:
...
example.com/special-storage: 800Gi
Then a Container could request any number of bytes of special storage, up to 800Gi.
Clean up
Here is a PATCH request that removes the dongle advertisement from a Node.
[
{
"op": "remove",
"path": "/status/capacity/example.com~1dongle",
}
]
Start a proxy, so that you can easily send requests to the Kubernetes API server:
kubectl proxy
In another command window, send the HTTP PATCH request. Replace <your-node-name>
with the name of your Node:
What's next
For application developers
◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.
• This guide assumes your nodes use the AMD64 or Intel 64 CPU architecture.
If you see "dns-autoscaler" in the output, DNS horizontal autoscaling is already enabled, and
you can skip to Tuning autoscaling parameters.
If you don't see a Deployment for DNS services, you can also look for it by name:
Deployment/<your-deployment-name>
where <your-deployment-name> is the name of your DNS Deployment. For example, if the
name of your Deployment for DNS is coredns, your scale target is Deployment/coredns.
Note: CoreDNS is the default DNS service for Kubernetes. CoreDNS sets the label k8s-
app=kube-dns so that it can work in clusters that originally used kube-dns.
admin/dns/dns-horizontal-autoscaler.yaml
kind: ServiceAccount
apiVersion: v1
metadata:
name: kube-dns-autoscaler
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["replicationcontrollers/scale"]
verbs: ["get", "update"]
- apiGroups: ["apps"]
resources: ["deployments/scale", "replicasets/scale"]
verbs: ["get", "update"]
# Remove the configmaps rule once below issue is fixed:
# kubernetes-incubator/cluster-proportional-autoscaler#16
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
subjects:
- kind: ServiceAccount
name: kube-dns-autoscaler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-dns-autoscaler
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-dns-autoscaler
namespace: kube-system
labels:
k8s-app: kube-dns-autoscaler
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: kube-dns-autoscaler
template:
metadata:
labels:
k8s-app: kube-dns-autoscaler
spec:
priorityClassName: system-cluster-critical
securityContext:
seccompProfile:
type: RuntimeDefault
supplementalGroups: [ 65534 ]
fsGroup: 65534
nodeSelector:
kubernetes.io/os: linux
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler:1.8.4
resources:
requests:
cpu: "20m"
memory: "10Mi"
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=kube-dns-autoscaler
# Should keep target in sync with cluster/addons/dns/kube-dns.yaml.base
- --target=<SCALE_TARGET>
# When cluster is using large nodes(with more cores), "coresPerReplica" should dominate.
# If using small nodes, "nodesPerReplica" should dominate.
- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":
16,"preventSinglePointFailure":true,"includeUnschedulableNodes":true}}
- --logtostderr=true
- --v=2
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
serviceAccountName: kube-dns-autoscaler
Go to the directory that contains your configuration file, and enter this command to create the
Deployment:
deployment.apps/dns-autoscaler created
linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'
Modify the fields according to your needs. The "min" field indicates the minimal number of DNS
backends. The actual number of backends is calculated using this equation:
Note that the values of both coresPerReplica and nodesPerReplica are floats.
The idea is that when a cluster is using nodes that have many cores, coresPerReplica dominates.
When a cluster is using nodes that have fewer cores, nodesPerReplica dominates.
There are other supported scaling patterns. For details, see cluster-proportional-autoscaler.
deployment.apps/dns-autoscaler scaled
This option works if dns-autoscaler is under your own control, which means no one will re-
create it:
Option 3: Delete the dns-autoscaler manifest file from the master node
This option works if dns-autoscaler is under control of the (deprecated) Addon Manager, and
you have write access to the master node.
Sign in to the master node and delete the corresponding manifest file. The common path for this
dns-autoscaler is:
/etc/kubernetes/addons/dns-horizontal-autoscaler/dns-horizontal-autoscaler.yaml
After the manifest file is deleted, the Addon Manager will delete the dns-autoscaler
Deployment.
• An autoscaler Pod runs a client that polls the Kubernetes API server for the number of
nodes and cores in the cluster.
• A desired replica count is calculated and applied to the DNS backends based on the
current schedulable nodes and cores and the given scaling parameters.
• The scaling parameters and data points are provided via a ConfigMap to the autoscaler,
and it refreshes its parameters table every poll interval to be up to date with the latest
desired scaling parameters.
• Changes to the scaling parameters are allowed without rebuilding or restarting the
autoscaler Pod.
• The autoscaler provides a controller interface to support two control patterns: linear and
ladder.
What's next
• Read about Guaranteed Scheduling For Critical Add-On Pods.
• Learn more about the implementation of cluster-proportional-autoscaler.
• Killercoda
• Play with Kubernetes
The pre-installed default StorageClass may not fit well with your expected workload; for
example, it might provision storage that is too expensive. If this is the case, you can either
change the default StorageClass or disable it completely to avoid dynamic provisioning of
storage.
Deleting the default StorageClass may not work, as it may be re-created automatically by the
addon manager running in your cluster. Please consult the docs for your installation for details
about addon manager and how to disable individual addons.
Please note that at most one StorageClass can be marked as default. If two or more of
them are marked as default, a PersistentVolumeClaim without storageClassName
explicitly specified cannot be created.
What's next
• Learn more about PersistentVolumes.
This page shows how to migrate nodes to use event based updates for container status. The
event-based implementation reduces node resource consumption by the kubelet, compared to
the legacy approach that relies on polling. You may know this feature as evented Pod lifecycle
event generator (PLEG). That's the name used internally within the Kubernetes project for a key
implementation detail.
3. Start the container runtime with the container event generation enabled.
◦ Containerd
◦ CRI-O
Version 1.7+
Version 1.26+
Check if the CRI-O is already configured to emit CRI events by verifying the
configuration,
enable_pod_events = true
To enable it, start the CRI-O daemon with the flag --enable-pod-events=true or use a
dropin config with the following lines:
[crio.runtime]
enable_pod_events: true
Your Kubernetes server must be at or later than version 1.26. To check the version, enter
kubectl version.
4. Verify that the kubelet is using event-based container stage change monitoring. To check,
look for the term EventedPLEG in the kubelet logs.
If you have set --v to 4 and above, you might see more entries that indicate that the
kubelet is using event-based container state monitoring.
I0314 11:12:42.009542 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=3b2c6172-b112-447a-ba96-94e7022912dc
I0314 11:12:44.623326 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=b3fba5ea-a8c5-4b76-8f43-481e17e8ec40
I0314 11:12:44.714564 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=b3fba5ea-a8c5-4b76-8f43-481e17e8ec40
What's next
• Learn more about the design in the Kubernetes Enhancement Proposal (KEP): Kubelet
Evented PLEG for Better Performance.
• Killercoda
• Play with Kubernetes
kubectl get pv
This list also includes the name of the claims that are bound to each volume for easier
identification of dynamically provisioned volumes.
Note:
On Windows, you must double quote any JSONPath template that contains spaces (not
single quote as shown above for bash). This in turn means that you must use a single
quote or escaped double quote around any literals in the template. For example:
kubectl get pv
In the preceding output, you can see that the volume bound to claim default/claim3 has
reclaim policy Retain. It will not be automatically deleted when a user deletes claim
default/claim3.
What's next
• Learn more about PersistentVolumes.
• Learn more about PersistentVolumeClaims.
References
• PersistentVolume
◦ Pay attention to the .spec.persistentVolumeReclaimPolicy field of PersistentVolume.
• PersistentVolumeClaim
Since cloud providers develop and release at a different pace compared to the Kubernetes
project, abstracting the provider-specific code to the cloud-controller-manager binary allows
cloud vendors to evolve independently from the core Kubernetes code.
Administration
Requirements
Every cloud has their own set of requirements for running their own cloud provider
integration, it should not be too different from the requirements when running kube-controller-
manager. As a general rule of thumb you'll need:
• cloud authentication/authorization: your cloud may require a token or IAM rules to allow
access to their APIs
• kubernetes authentication/authorization: cloud-controller-manager may need RBAC rules
set to speak to the kubernetes apiserver
• high availability: like kube-controller-manager, you may want a high available setup for
cloud controller manager using leader election (on by default).
Running cloud-controller-manager
Keep in mind that setting up your cluster to use cloud controller manager will change your
cluster behaviour in a few ways:
• Node controller - responsible for updating kubernetes nodes using cloud APIs and
deleting kubernetes nodes that were deleted on your cloud.
• Service controller - responsible for loadbalancers on your cloud against services of type
LoadBalancer.
• Route controller - responsible for setting up network routes on your cloud
• any other features you would like to implement if you are running an out-of-tree
provider.
Examples
If you are using a cloud that is currently supported in Kubernetes core and would like to adopt
cloud controller manager, see the cloud controller manager in kubernetes core.
For cloud controller managers not in Kubernetes core, you can find the respective projects in
repositories maintained by cloud vendors or by SIGs.
For providers already in Kubernetes core, you can run the in-tree cloud controller manager as a
DaemonSet in your cluster, use the following as a guideline:
admin/cloud/ccm-example.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:cloud-controller-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: cloud-controller-manager
name: cloud-controller-manager
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: cloud-controller-manager
template:
metadata:
labels:
k8s-app: cloud-controller-manager
spec:
serviceAccountName: cloud-controller-manager
containers:
- name: cloud-controller-manager
# for in-tree providers we use registry.k8s.io/cloud-controller-manager
# this can be replaced with any other image for out-of-tree providers
image: registry.k8s.io/cloud-controller-manager:v1.8.0
command:
- /usr/local/bin/cloud-controller-manager
- --cloud-provider=[YOUR_CLOUD_PROVIDER] # Add your own cloud provider here!
- --leader-elect=true
- --use-service-account-credentials
# these flags will vary for every cloud provider
- --allocate-node-cidrs=true
- --configure-cloud-routes=true
- --cluster-cidr=172.17.0.0/16
tolerations:
# this is required so CCM can bootstrap itself
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
# these tolerations are to have the daemonset runnable on control plane nodes
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
# this is to restrict CCM to only run on master nodes
# the node selector may vary depending on your cluster setup
nodeSelector:
node-role.kubernetes.io/master: ""
Limitations
Running cloud controller manager comes with a few possible limitations. Although these
limitations are being addressed in upcoming releases, it's important that you are aware of these
limitations for production workloads.
Cloud controller manager does not implement any of the volume controllers found in kube-
controller-manager as the volume integrations also require coordination with kubelets. As we
evolve CSI (container storage interface) and add stronger support for flex volume plugins,
necessary support will be added to cloud controller manager so that clouds can fully integrate
with volumes. Learn more about out-of-tree CSI volume plugins here.
Scalability
The cloud-controller-manager queries your cloud provider's APIs to retrieve information for all
nodes. For very large clusters, consider possible bottlenecks such as resource requirements and
API rate limiting.
The goal of the cloud controller manager project is to decouple development of cloud features
from the core Kubernetes project. Unfortunately, many aspects of the Kubernetes project has
assumptions that cloud provider features are tightly integrated into the project. As a result,
adopting this new architecture can create several situations where a request is being made for
information from a cloud provider, but the cloud controller manager may not be able to return
that information without the original request being complete.
A good example of this is the TLS bootstrapping feature in the Kubelet. TLS bootstrapping
assumes that the Kubelet has the ability to ask the cloud provider (or a local metadata service)
for all its address types (private, public, etc) but cloud controller manager cannot set a node's
address types without being initialized in the first place which requires that the kubelet has TLS
certificates to communicate with the apiserver.
As this initiative evolves, changes will be made to address these issues in upcoming releases.
What's next
To build and develop your own cloud controller manager, read Developing Cloud Controller
Manager.
You may be interested in using this capability if any of the below are true:
• API calls to a cloud provider service are required to retrieve authentication information
for a registry.
• Credentials have short expiration times and requesting new credentials frequently is
required.
• Storing registry credentials on disk or in imagePullSecrets is not acceptable.
This guide demonstrates how to configure the kubelet's image credential provider plugin
mechanism.
Your Kubernetes server must be at or later than version v1.26. To check the version, enter
kubectl version.
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
# providers is a list of credential provider helper plugins that will be enabled by the kubelet.
# Multiple providers may match against a single image, in which case credentials
# from all providers will be returned to the kubelet. If multiple providers are called
# for a single image, the results are combined. If providers return overlapping
# auth keys, the value from the provider earlier in this list is used.
providers:
# name is the required name of the credential provider. It must match the name of the
# provider executable as seen by the kubelet. The executable must be in the kubelet's
# bin directory (set by the --image-credential-provider-bin-dir flag).
- name: ecr-credential-provider
# matchImages is a required list of strings used to match against images in order to
# determine if this provider should be invoked. If one of the strings matches the
# requested image from the kubelet, the plugin will be invoked and given a chance
# to provide credentials. Images are expected to contain the registry domain
# and URL path.
#
# Each entry in matchImages is a pattern which can optionally contain a port and a path.
# Globs can be used in the domain, but not in the port or the path. Globs are supported
# as subdomains like '*.k8s.io' or 'k8s.*.io', and top-level-domains such as 'k8s.*'.
# Matching partial subdomains like 'app*.k8s.io' is also supported. Each glob can only match
# a single subdomain segment, so `*.io` does **not** match `*.k8s.io`.
#
# A match exists between an image and a matchImage when all of the below are true:
# - Both contain the same number of domain parts and each part matches.
# - The URL path of an matchImages must be a prefix of the target image URL path.
# - If the matchImages contains a port, then the port must match in the image as well.
#
# Example values of matchImages:
# - 123456789.dkr.ecr.us-east-1.amazonaws.com
# - *.azurecr.io
# - gcr.io
# - *.*.registry.io
# - registry.io:8080/path
matchImages:
- "*.dkr.ecr.*.amazonaws.com"
- "*.dkr.ecr.*.amazonaws.com.cn"
- "*.dkr.ecr-fips.*.amazonaws.com"
- "*.dkr.ecr.us-iso-east-1.c2s.ic.gov"
- "*.dkr.ecr.us-isob-east-1.sc2s.sgov.gov"
# defaultCacheDuration is the default duration the plugin will cache credentials in-memory
# if a cache duration is not provided in the plugin response. This field is required.
defaultCacheDuration: "12h"
# Required input version of the exec CredentialProviderRequest. The returned
CredentialProviderResponse
# MUST use the same encoding version as the input. Current supported values are:
# - credentialprovider.kubelet.k8s.io/v1
apiVersion: credentialprovider.kubelet.k8s.io/v1
# Arguments to pass to the command when executing it.
# +optional
# args:
# - --example-argument
# Env defines additional environment variables to expose to the process. These
# are unioned with the host's environment, as well as variables client-go uses
# to pass argument to the plugin.
# +optional
env:
- name: AWS_PROFILE
value: example_profile
The providers field is a list of enabled plugins used by the kubelet. Each entry has a few
required fields:
• name: the name of the plugin which MUST match the name of the executable binary that
exists in the directory passed into --image-credential-provider-bin-dir.
• matchImages: a list of strings used to match against images in order to determine if this
provider should be invoked. More on this below.
• defaultCacheDuration: the default duration the kubelet will cache credentials in-memory
if a cache duration was not specified by the plugin.
• apiVersion: the API version that the kubelet and the exec plugin will use when
communicating.
Each credential provider can also be given optional args and environment variables as well.
Consult the plugin implementors to determine what set of arguments and environment
variables are required for a given plugin.
The matchImages field for each credential provider is used by the kubelet to determine whether
a plugin should be invoked for a given image that a Pod is using. Each entry in matchImages is
an image pattern which can optionally contain a port and a path. Globs can be used in the
domain, but not in the port or the path. Globs are supported as subdomains like *.k8s.io or
k8s.*.io, and top-level domains such as k8s.*. Matching partial subdomains like app*.k8s.io is also
supported. Each glob can only match a single subdomain segment, so *.io does NOT match
*.k8s.io.
A match exists between an image name and a matchImage entry when all of the below are true:
• Both contain the same number of domain parts and each part matches.
• The URL path of match image must be a prefix of the target image URL path.
• If the matchImages contains a port, then the port must match in the image as well.
• 123456789.dkr.ecr.us-east-1.amazonaws.com
• *.azurecr.io
• gcr.io
• *.*.registry.io
• foo.registry.io:8080/path
What's next
• Read the details about CredentialProviderConfig in the kubelet configuration API (v1)
reference.
• Read the kubelet credential provider API reference (v1).
Configure Quotas for API Objects
This page shows how to configure quotas for API objects, including PersistentVolumeClaims
and Services. A quota restricts the number of objects, of a particular type, that can be created in
a namespace. You specify quotas in a ResourceQuota object.
• Killercoda
• Play with Kubernetes
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.
Create a ResourceQuota
Here is the configuration file for a ResourceQuota object:
admin/resource/quota-objects.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: object-quota-demo
spec:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"
status:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"
used:
persistentvolumeclaims: "0"
services.loadbalancers: "0"
services.nodeports: "0"
Create a PersistentVolumeClaim
Here is the configuration file for a PersistentVolumeClaim object:
admin/resource/quota-objects-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
The output shows that the PersistentVolumeClaim exists and has status Pending:
NAME STATUS
pvc-quota-demo Pending
admin/resource/quota-objects-pvc-2.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo-2
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi
The output shows that the second PersistentVolumeClaim was not created, because it would
have exceeded the quota for the namespace.
Notes
These are the strings used to identify API resources that can be constrained by quotas:
Clean up
Delete your namespace:
What's next
For cluster administrators
Kubernetes keeps many aspects of how pods execute on nodes abstracted from the user. This is
by design. However, some workloads require stronger guarantees in terms of latency and/or
performance in order to operate acceptably. The kubelet provides methods to enable more
complex workload placement policies while keeping the abstraction free from explicit
placement directives.
For detailed information on resource management, please refer to the Resource Management
for Pods and Containers documentation.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.26. To check the version, enter
kubectl version.
If you are running an older version of Kubernetes, please look at the documentation for the
version you are actually running.
CPU Management Policies
By default, the kubelet uses CFS quota to enforce pod CPU limits. When the node runs many
CPU-bound pods, the workload can move to different CPU cores depending on whether the pod
is throttled and which CPU cores are available at scheduling time. Many workloads are not
sensitive to this migration and thus work fine without any intervention.
However, in workloads where CPU cache affinity and scheduling latency significantly affect
workload performance, the kubelet allows alternative CPU management policies to determine
some placement preferences on the node.
Configuration
The CPU Manager policy is set with the --cpu-manager-policy kubelet flag or the
cpuManagerPolicy field in KubeletConfiguration. There are two supported policies:
The CPU manager periodically writes resource updates through the CRI in order to reconcile
in-memory CPU assignments with cgroupfs. The reconcile frequency is set through a new
Kubelet configuration value --cpu-manager-reconcile-period. If not specified, it defaults to the
same duration as --node-status-update-frequency.
The behavior of the static policy can be fine-tuned using the --cpu-manager-policy-options flag.
The flag takes a comma-separated list of key=value policy options. If you disable the
CPUManagerPolicyOptions feature gate then you cannot fine-tune CPU manager policies. In
that case, the CPU manager operates only using its default settings.
In addition to the top-level CPUManagerPolicyOptions feature gate, the policy options are split
into two groups: alpha quality (hidden by default) and beta quality (visible by default). The
groups are guarded respectively by the CPUManagerPolicyAlphaOptions and
CPUManagerPolicyBetaOptions feature gates. Diverging from the Kubernetes standard, these
feature gates guard groups of options, because it would have been too cumbersome to add a
feature gate for each individual option.
Since the CPU manager policy can only be applied when kubelet spawns new pods, simply
changing from "none" to "static" won't apply to existing pods. So in order to properly change
the CPU manager policy on a node, perform the following steps:
Repeat this process for every node that needs its CPU manager policy changed. Skipping this
process will result in kubelet crashlooping with the following error:
could not restore state from checkpoint: configured policy "static" differs from state checkpoint
policy "none", please drain this node and delete the CPU manager checkpoint file "/var/lib/
kubelet/cpu_manager_state" before restarting Kubelet
None policy
The none policy explicitly enables the existing default CPU affinity scheme, providing no
affinity beyond what the OS scheduler does automatically. Limits on CPU usage for
Guaranteed pods and Burstable pods are enforced using CFS quota.
Static policy
The static policy allows containers in Guaranteed pods with integer CPU requests access to
exclusive CPUs on the node. This exclusivity is enforced using the cpuset cgroup controller.
Note: System services such as the container runtime and the kubelet itself can continue to run
on these exclusive CPUs. The exclusivity only extends to other pods.
Note: CPU Manager doesn't support offlining and onlining of CPUs at runtime. Also, if the set
of online CPUs changes on the node, the node must be drained and CPU manager manually
reset by deleting the state file cpu_manager_state in the kubelet root directory.
This policy manages a shared pool of CPUs that initially contains all CPUs in the node. The
amount of exclusively allocatable CPUs is equal to the total number of CPUs in the node minus
any CPU reservations by the kubelet --kube-reserved or --system-reserved options. From 1.17,
the CPU reservation list can be specified explicitly by kubelet --reserved-cpus option. The
explicit CPU list specified by --reserved-cpus takes precedence over the CPU reservation
specified by --kube-reserved and --system-reserved. CPUs reserved by these options are taken,
in integer quantity, from the initial shared pool in ascending order by physical core ID. This
shared pool is the set of CPUs on which any containers in BestEffort and Burstable pods run.
Containers in Guaranteed pods with fractional CPU requests also run on CPUs in the shared
pool. Only containers that are both part of a Guaranteed pod and have integer CPU requests are
assigned exclusive CPUs.
Note: The kubelet requires a CPU reservation greater than zero be made using either --kube-
reserved and/or --system-reserved or --reserved-cpus when the static policy is enabled. This is
because zero CPU reservation would allow the shared pool to become empty.
As Guaranteed pods whose containers fit the requirements for being statically assigned are
scheduled to the node, CPUs are removed from the shared pool and placed in the cpuset for the
container. CFS quota is not used to bound the CPU usage of these containers as their usage is
bound by the scheduling domain itself. In others words, the number of CPUs in the container
cpuset is equal to the integer CPU limit specified in the pod spec. This static assignment
increases CPU affinity and decreases context switches due to throttling for the CPU-bound
workload.
spec:
containers:
- name: nginx
image: nginx
This pod runs in the BestEffort QoS class because no resource requests or limits are specified. It
runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
This pod runs in the Burstable QoS class because resource requests do not equal limits and the
cpu quantity is not specified. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "100Mi"
cpu: "1"
This pod runs in the Burstable QoS class because resource requests do not equal limits. It runs
in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "200Mi"
cpu: "2"
This pod runs in the Guaranteed QoS class because requests are equal to limits. And the
container's resource limit for the CPU resource is an integer greater than or equal to one. The
nginx container is granted 2 exclusive CPUs.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "1.5"
requests:
memory: "200Mi"
cpu: "1.5"
This pod runs in the Guaranteed QoS class because requests are equal to limits. But the
container's resource limit for the CPU resource is a fraction. It runs in the shared pool.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
This pod runs in the Guaranteed QoS class because only limits are specified and requests are set
equal to limits when not explicitly specified. And the container's resource limit for the CPU
resource is an integer greater than or equal to one. The nginx container is granted 2 exclusive
CPUs.
You can toggle groups of options on and off based upon their maturity level using the following
feature gates:
The following policy options exist for the static CPUManager policy:
If the full-pcpus-only policy option is specified, the static policy will always allocate full
physical cores. By default, without this option, the static policy allocates CPUs using a
topology-aware best-fit allocation. On SMT enabled systems, the policy can allocate individual
virtual cores, which correspond to hardware threads. This can lead to different containers
sharing the same physical cores; this behaviour in turn contributes to the noisy neighbours
problem. With the option enabled, the pod will be admitted by the kubelet only if the CPU
request of all its containers can be fulfilled by allocating full physical cores. If the pod does not
pass the admission, it will be put in Failed state with the message SMTAlignmentError.
If the align-by-socket policy option is specified, CPUs will be considered aligned at the socket
boundary when deciding how to allocate CPUs to a container. By default, the CPUManager
aligns CPU allocations at the NUMA boundary, which could result in performance degradation
if CPUs need to be pulled from more than one NUMA node to satisfy the allocation. Although it
tries to ensure that all CPUs are allocated from the minimum number of NUMA nodes, there is
no guarantee that those NUMA nodes will be on the same socket. By directing the
CPUManager to explicitly align CPUs at the socket boundary rather than the NUMA boundary,
we are able to avoid such issues. Note, this policy option is not compatible with
TopologyManager single-numa-node policy and does not apply to hardware where the number
of sockets is greater than number of NUMA nodes.
In order to extract the best performance, optimizations related to CPU isolation, memory and
device locality are required. However, in Kubernetes, these optimizations are handled by a
disjoint set of components.
Topology Manager is a Kubelet component that aims to coordinate the set of components that
are responsible for these optimizations.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.18. To check the version, enter
kubectl version.
The Topology Manager is a Kubelet component, which acts as a source of truth so that other
Kubelet components can make topology aligned resource allocation choices.
The Topology Manager provides an interface for components, called Hint Providers, to send and
receive topology information. Topology Manager has a set of node level policies which are
explained below.
The Topology manager receives Topology information from the Hint Providers as a bitmask
denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager
policies perform a set of operations on the hints provided and converge on the hint determined
by the policy to give the optimal result, if an undesirable hint is stored the preferred field for
the hint will be set to false. In the current policies preferred is the narrowest preferred mask.
The selected hint is stored as part of the Topology Manager. Depending on the policy
configured the pod can be accepted or rejected from the node based on the selected hint. The
hint is then stored in the Topology Manager for use by the Hint Providers when making the
resource allocation decisions.
If these conditions are met, the Topology Manager will align the requested resources.
In order to customise how this alignment is carried out, the Topology Manager provides two
distinct knobs: scope and policy.
The scope defines the granularity at which you would like resource alignment to be performed
(e.g. at the pod or container level). And the policy defines the actual strategy used to carry out
the alignment (e.g. best-effort, restricted, single-numa-node, etc.). Details on the various scopes
and policies available today can be found below.
Note: To align CPU resources with other requested resources in a Pod Spec, the CPU Manager
should be enabled and proper CPU Manager policy should be configured on a Node. See control
CPU Management Policies.
Note: To align memory (and hugepages) resources with other requested resources in a Pod
Spec, the Memory Manager should be enabled and proper Memory Manager policy should be
configured on a Node. Examine Memory Manager documentation.
Topology Manager Scopes
The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:
• container (default)
• pod
Either option can be selected at a time of the kubelet startup, with --topology-manager-scope
flag.
container scope
Within this scope, the Topology Manager performs a number of sequential resource alignments,
i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no
notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In
effect, the Topology Manager performs an arbitrary alignment of individual containers to
NUMA nodes.
The notion of grouping the containers was endorsed and implemented on purpose in the
following scope, for example the pod scope.
pod scope
To select the pod scope, start the kubelet with the command line option --topology-manager-
scope=pod.
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is,
the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all
containers) to either a single NUMA node or a common set of NUMA nodes. The following
examples illustrate the alignments produced by the Topology Manager on different occasions:
The total amount of particular resource demanded for the entire pod is calculated according to
effective requests/limits formula, and thus, this total value is equal to the maximum of:
for a resource.
Using the pod scope in tandem with single-numa-node Topology Manager policy is specifically
valuable for workloads that are latency sensitive or for high-throughput applications that
perform IPC. By combining both options, you are able to place all containers in a pod onto a
single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for
that pod.
In the case of single-numa-node policy, a pod is accepted only if a suitable set of NUMA nodes
is present among possible allocations. Reconsider the example above:
• a set containing only a single NUMA node - it leads to pod being admitted,
• whereas a set containing more NUMA nodes - it results in pod rejection (because instead
of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against
Topology Manager policy, which either leads to the rejection or admission of the pod.
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, --
topology-manager-policy. There are four supported policies:
• none (default)
• best-effort
• restricted
• single-numa-node
Note: If Topology Manager is configured with the pod scope, the container, which is
considered by the policy, is reflecting requirements of the entire pod, and thus each container
from the pod will result with the same topology alignment decision.
none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a Pod, the kubelet, with best-effort topology management policy, calls
each Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will store this and admit the pod to the node anyway.
The Hint Providers can then use this information when making the resource allocation decision.
restricted policy
For each container in a Pod, the kubelet, with restricted topology management policy, calls each
Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will reject this pod from the node. This will result in a pod in a
Terminated state with a pod admission failure.
Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule
the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
An external control loop could be also implemented to trigger a redeployment of pods that have
the Topology Affinity error.
If the pod is admitted, the Hint Providers can then use this information when making the
resource allocation decision.
single-numa-node policy
For each container in a Pod, the kubelet, with single-numa-node topology management policy,
calls each Hint Provider to discover their resource availability. Using this information, the
Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
Manager will store this and the Hint Providers can then use this information when making the
resource allocation decision. If, however, this is not possible then the Topology Manager will
reject the pod from the node. This will result in a pod in a Terminated state with a pod
admission failure.
Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule
the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the
Pod.An external control loop could be also implemented to trigger a redeployment of pods that
have the Topology Affinity error.
You can toggle groups of options on and off based upon their maturity level using the following
feature gates:
You will still have to enable each option using the TopologyManagerPolicyOptions kubelet
option.
If the prefer-closest-numa-nodes policy option is specified, the best-effort and restricted policies
will favor sets of NUMA nodes with shorter distance between them when making admission
decisions. You can enable this option by adding prefer-closest-numa-nodes=true to the
Topology Manager policy options. By default, without this option, Topology Manager aligns
resources on either a single NUMA node or the minimum number of NUMA nodes (in cases
where more than one NUMA node is required). However, the TopologyManager is not aware of
NUMA distances and does not take them into account when making admission decisions. This
limitation surfaces in multi-socket, as well as single-socket multi NUMA systems, and can cause
significant performance degradation in latency-critical execution and high-throughput
applications if the Topology Manager decides to align resources on non-adjacent NUMA nodes.
spec:
containers:
- name: nginx
image: nginx
This pod runs in the BestEffort QoS class because no resource requests or limits are specified.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
This pod runs in the Burstable QoS class because requests are less than limits.
If the selected policy is anything other than none, Topology Manager would consider these Pod
specifications. The Topology Manager would consult the Hint Providers to get topology hints.
In the case of the static, the CPU Manager policy would return default topology hint, because
these Pods do not have explicitly request CPU resources.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
This pod with integer CPU request runs in the Guaranteed QoS class because requests are equal
to limits.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
This pod with sharing CPU request runs in the Guaranteed QoS class because requests are
equal to limits.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
example.com/deviceA: "1"
example.com/deviceB: "1"
requests:
example.com/deviceA: "1"
example.com/deviceB: "1"
This pod runs in the BestEffort QoS class because there are no CPU and memory requests.
The Topology Manager would consider the above pods. The Topology Manager would consult
the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.
In the case of the Guaranteed pod with integer CPU request, the static CPU Manager policy
would return topology hints relating to the exclusive CPU and the Device Manager would send
back hints for the requested device.
In the case of the Guaranteed pod with sharing CPU request, the static CPU Manager policy
would return default topology hint as there is no exclusive CPU request and the Device
Manager would send back hints for the requested device.
In the above two cases of the Guaranteed pod, the none CPU Manager policy would return
default topology hint.
In the case of the BestEffort pod, the static CPU Manager policy would send back the default
topology hint as there is no CPU request and the Device Manager would send back the hints for
each of the requested devices.
Using this information the Topology Manager calculates the optimal hint for the pod and stores
this information, which will be used by the Hint Providers when they are making their resource
assignments.
Known Limitations
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more
than 8 NUMA nodes there will be a state explosion when trying to enumerate the
possible NUMA affinities and generating their hints.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.12. To check the version, enter
kubectl version.
Introduction
DNS is a built-in Kubernetes service launched automatically using the addon manager cluster
add-on.
DNS names also need domains. You configure the local domain in the kubelet with the flag --
cluster-domain=<default-local-domain>.
The DNS server supports forward lookups (A and AAAA records), port lookups (SRV records),
reverse IP address lookups (PTR records), and more. For more information, see DNS for Services
and Pods.
If a Pod's dnsPolicy is set to default, it inherits the name resolution configuration from the node
that the Pod runs on. The Pod's DNS resolution should behave the same as the node. But see
Known issues.
If you don't want this, or if you want a different DNS config for pods, you can use the kubelet's
--resolv-conf flag. Set this flag to "" to prevent Pods from inheriting DNS. Set it to a valid file
path to specify a file other than /etc/resolv.conf for DNS inheritance.
CoreDNS
CoreDNS is a general-purpose authoritative DNS server that can serve as cluster DNS,
complying with the DNS specifications.
CoreDNS ConfigMap options
CoreDNS is a DNS server that is modular and pluggable, with plugins adding new
functionalities. The CoreDNS server can be configured by maintaining a Corefile, which is the
CoreDNS configuration file. As a cluster administrator, you can modify the ConfigMap for the
CoreDNS Corefile to change how DNS service discovery behaves for that cluster.
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
You can modify the default CoreDNS behavior by modifying the ConfigMap.
CoreDNS has the ability to configure stub-domains and upstream nameservers using the
forward plugin.
Example
If a cluster operator has a Consul domain server located at "10.150.0.1", and all Consul names
have the suffix ".consul.local". To configure it in CoreDNS, the cluster administrator creates the
following stanza in the CoreDNS ConfigMap.
consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}
forward . 172.16.0.1
The final ConfigMap along with the default Corefile configuration looks like:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 172.16.0.1
cache 30
loop
reload
loadbalance
}
consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}
Note: CoreDNS does not support FQDNs for stub-domains and nameservers (eg: "ns.foo.com").
During translation, all FQDN nameservers will be omitted from the CoreDNS config.
What's next
• Read Debugging DNS Resolution
• Killercoda
• Play with Kubernetes
Your cluster must be configured to use the CoreDNS addon or its precursor, kube-dns.
Your Kubernetes server must be at or later than version v1.6. To check the version, enter
kubectl version.
admin/dns/dnsutils.yaml
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3
command:
- sleep
- "infinity"
imagePullPolicy: IfNotPresent
restartPolicy: Always
Note: This example creates a pod in the default namespace. DNS name resolution for services
depends on the namespace of the pod. For more information, review DNS for Services and Pods.
pod/dnsutils created
Once that Pod is running, you can exec nslookup in that environment. If you see something like
the following, DNS is working correctly.
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
Take a look inside the resolv.conf file. (See Customizing DNS Service and Known issues below
for more information)
Verify that the search path and name server are set up like the following (note that search path
may vary for different cloud providers):
Errors such as the following indicate a problem with the CoreDNS (or kube-dns) add-on or with
associated Services:
or
Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Use the kubectl get pods command to verify that the DNS pod is running.
Note: The value for label k8s-app is kube-dns for both CoreDNS and kube-dns deployments.
If you see that no CoreDNS Pod is running or that the Pod has failed/completed, the DNS add-
on may not be deployed by default in your current environment and you will have to deploy it
manually.
Use the kubectl logs command to see logs for the DNS containers.
For CoreDNS:
.:53
2018/08/15 14:37:17 [INFO] CoreDNS-1.2.2
2018/08/15 14:37:17 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.2
linux/amd64, go1.10.3, 2e322f6
2018/08/15 14:37:17 [INFO] plugin/reload: Running configuration MD5 =
24e6c59e83ce706f07bcc82c31b1ea1c
Verify that the DNS service is up by using the kubectl get service command.
Note: The service name is kube-dns for both CoreDNS and kube-dns deployments.
If you have created the Service or in the case it should be created by default but it does not
appear, see debugging Services for more information.
You can verify that DNS endpoints are exposed by using the kubectl get endpoints command.
If you do not see the endpoints, see the endpoints section in the debugging Services
documentation.
For additional Kubernetes DNS examples, see the cluster-dns examples in the Kubernetes
GitHub repository.
You can verify if queries are being received by CoreDNS by adding the log plugin to the
CoreDNS configuration (aka Corefile). The CoreDNS Corefile is held in a ConfigMap named
coredns. To edit it, use the command:
Then add log in the Corefile section per the example below:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
log
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
After saving the changes, it may take up to minute or two for Kubernetes to propagate these
changes to the CoreDNS pods.
Next, make some queries and view the logs per the sections above in this document. If
CoreDNS pods are receiving the queries, you should see them in the logs.
.:53
2018/08/15 14:37:15 [INFO] CoreDNS-1.2.0
2018/08/15 14:37:15 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.0
linux/amd64, go1.10.3, 2e322f6
2018/09/07 15:29:04 [INFO] plugin/reload: Running configuration MD5 =
162475cdf272d8aa601e6fe67a6ad42f
2018/09/07 15:29:04 [INFO] Reloading complete
172.17.0.18:41675 - [07/Sep/2018:15:29:11 +0000] 59925 "A IN kubernetes.default.svc.cluster.local.
udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000066649s
CoreDNS must be able to list service and endpoint related resources to properly resolve service
names.
Expected output:
PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
endpoints [] [] [list watch]
namespaces [] [] [list watch]
pods [] [] [list watch]
services [] [] [list watch]
endpointslices.discovery.k8s.io [] [] [list watch]
If any permissions are missing, edit the ClusterRole to add them:
...
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
...
DNS queries that don't specify a namespace are limited to the pod's namespace.
If the namespace of the pod and service differ, the DNS query must include the namespace of
the service.
To learn more about name resolution, see DNS for Services and Pods.
Known issues
Some Linux distributions (e.g. Ubuntu) use a local DNS resolver by default (systemd-resolved).
Systemd-resolved moves and replaces /etc/resolv.conf with a stub file that can cause a fatal
forwarding loop when resolving names in upstream servers. This can be fixed manually by
using kubelet's --resolv-conf flag to point to the correct resolv.conf (With systemd-resolved, this
is /run/systemd/resolve/resolv.conf). kubeadm automatically detects systemd-resolved, and
adjusts the kubelet flags accordingly.
Kubernetes installs do not configure the nodes' resolv.conf files to use the cluster DNS by
default, because that process is inherently distribution-specific. This should probably be
implemented eventually.
Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver records to 3 by default and
Kubernetes needs to consume 1 nameserver record. This means that if a local installation
already uses 3 nameservers, some of those entries will be lost. To work around this limit, the
node can run dnsmasq, which will provide more nameserver entries. You can also use kubelet's
--resolv-conf flag.
If you are using Alpine version 3.17 or earlier as your base image, DNS may not work properly
due to a design issue with Alpine. Until musl version 1.24 didn't include TCP fallback to the
DNS stub resolver meaning any DNS call above 512 bytes would fail. Please upgrade your
images to Alpine version 3.18 or above.
What's next
• See Autoscaling the DNS Service in a Cluster.
• Read DNS for Services and Pods
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.8. To check the version, enter
kubectl version.
Make sure you've configured a network provider with network policy support. There are a
number of network providers that support NetworkPolicy, including:
• Antrea
• Calico
• Cilium
• Kube-router
• Romana
• Weave Net
deployment.apps/nginx created
Expose the Deployment through a Service called nginx.
service/nginx exposed
The above commands create a Deployment with an nginx Pod and expose the Deployment
through a Service named nginx. The nginx Pod and Deployment are found in the default
namespace.
service/networking/nginx-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: access-nginx
spec:
podSelector:
matchLabels:
app: nginx
ingress:
- from:
- podSelector:
matchLabels:
access: "true"
The name of a NetworkPolicy object must be a valid DNS subdomain name.
Note: NetworkPolicy includes a podSelector which selects the grouping of Pods to which the
policy applies. You can see this policy selects Pods with the label app=nginx. The label was
automatically added to the Pod in the nginx Deployment. An empty podSelector selects all pods
in the namespace.
networkpolicy.networking.k8s.io/access-nginx created
Background
Since cloud providers develop and release at a different pace compared to the Kubernetes
project, abstracting the provider-specific code to the cloud-controller-manager binary allows
cloud vendors to evolve independently from the core Kubernetes code.
Developing
Out of tree
Many cloud providers publish their controller manager code as open source. If you are creating
a new cloud-controller-manager from scratch, you could take an existing out-of-tree cloud
controller manager as your starting point.
In tree
For in-tree cloud providers, you can run the in-tree cloud controller manager as a DaemonSet in
your cluster. See Cloud Controller Manager Administration for more details.
For example, to turn off all API versions except v1, pass --runtime-config=api/all=false,api/
v1=true to the kube-apiserver.
What's next
Read the full documentation for the kube-apiserver component.
This page shows how to enable and configure encryption of API data at rest.
Note:
This task covers encryption for resource data stored using the Kubernetes API. For example,
you can encrypt Secret objects, including the key-value data they contain.
If you want to encrypt data in filesystems that are mounted into containers, you instead need to
either:
◦ Killercoda
◦ Play with Kubernetes
• This task assumes that you are running the Kubernetes API server as a static pod on each
control plane node.
• Your cluster's control plane must use etcd v3.x (major version 3, any minor version).
• To encrypt a custom resource, your cluster must be running Kubernetes v1.26 or newer.
• To use a wildcard to match resources, your cluster must be running Kubernetes v1.27 or
newer.
Caution: IMPORTANT: For high-availability configurations (with two or more control plane
nodes), the encryption configuration file must be the same! Otherwise, the kube-apiserver
component cannot decrypt data stored in the etcd.
Each resources array item is a separate config and contains a complete configuration. The
resources.resources field is an array of Kubernetes resource names (resource or resource.group)
that should be encrypted like Secrets, ConfigMaps, or other resources.
If custom resources are added to EncryptionConfiguration and the cluster version is 1.26 or
newer, any newly created custom resources mentioned in the EncryptionConfiguration will be
encrypted. Any custom resources that existed in etcd prior to that version and configuration
will be unencrypted until they are next written to storage. This is the same behavior as built-in
resources. See the Ensure all secrets are encrypted section.
The providers array is an ordered list of the possible encryption providers to use for the APIs
that you listed. Each provider supports multiple keys - the keys are tried in order for
decryption, and if the provider is the first provider, the first key is used for encryption.
Only one provider type may be specified per entry (identity or aescbc may be provided, but not
both in the same item). The first provider in the list is used to encrypt resources written into the
storage. When reading resources from storage, each provider that matches the stored data
attempts in order to decrypt the data. If no provider can read the stored data due to a mismatch
in format or secret key, an error is returned which prevents clients from accessing that
resource.
EncryptionConfiguration supports the use of wildcards to specify the resources that should be
encrypted. Use '*.<group>' to encrypt all resources within a group (for eg '*.apps' in above
example) or '*.*' to encrypt all resources. '*.' can be used to encrypt all resource in the core
group. '*.*' will encrypt all resources, even custom resources that are added after API server
start.
Note: Use of wildcards that overlap within the same resource list or across multiple entries are
not allowed since part of the configuration would be ineffective. The resources list's processing
order and precedence are determined by the order it's listed in the configuration.
Opting out of encryption for specific resources while wildcard is enabled can be achieved by
adding a new resources array item with the resource name, followed by the providers array
item with the identity provider. For example, if '*.*' is enabled and you want to opt-out
encryption for the events resource, add a new item to the resources array with events as the
resource name, followed by the providers array item with identity. The new item should look
like this:
- resources:
- events
providers:
- identity: {}
Ensure that the new item is listed before the wildcard '*.*' item in the resources array to give it
precedence.
For more detailed information about the EncryptionConfiguration struct, please refer to the
encryption configuration API.
Caution: If any resource is not readable via the encryption config (because keys were
changed), the only recourse is to delete that key from the underlying etcd directly. Calls that
attempt to read that resource will fail until it is deleted or a valid decryption key is provided.
Available providers
Before you configure encryption-at-rest for data in your cluster's Kubernetes API, you need to
select which provider(s) you will use.
The identity provider is the default if you do not specify otherwise. The identity provider
does not encrypt stored data and provides no additional confidentiality protection.
Key storage
Encrypting secret data with a locally managed key protects against an etcd compromise, but it
fails to protect against a host compromise. Since the encryption keys are stored on the host in
the EncryptionConfiguration YAML file, a skilled attacker can access that file and extract the
encryption keys.
The KMS provider uses envelope encryption: Kubernetes encrypts resources using a data key,
and then encrypts that data key using the managed encryption service. Kubernetes generates a
unique data key for each resource. The API server stores an encrypted version of the data key in
etcd alongside the ciphertext; when reading the resource, the API server calls the managed
encryption service and provides both the ciphertext and the (encrypted) data key. Within the
managed encryption service, the provider use a key encryption key to decipher the data key,
deciphers the data key, and finally recovers the plain text. Communication between the control
plane and the KMS requires in-transit protection, such as TLS.
Using envelope encryption creates dependence on the key encryption key, which is not stored
in Kubernetes. In the KMS case, an attacker who intends to get unauthorised access to the
plaintext values would need to compromise etcd and the third-party KMS provider.
---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- aescbc:
keys:
- name: key1
# See the following text for more details about the secret value
secret: <BASE 64 ENCODED SECRET>
- identity: {} # this fallback allows reading unencrypted secrets;
# for example, during initial migration
1. Generate a 32-byte random key and base64 encode it. If you're on Linux or macOS, run
the following command:
You will need to mount the new encryption config file to the kube-apiserver static pod.
Here is an example on how to do that:
---
#
# This is a fragment of a manifest for a static Pod.
# Check whether this is correct for your cluster and for your API server.
#
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.20.30.40:443
creationTimestamp: null
labels:
app.kubernetes.io/component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
...
- --encryption-provider-config=/etc/kubernetes/enc/enc.yaml # add this line
volumeMounts:
...
- name: enc # add this line
mountPath: /etc/kubernetes/enc # add this line
readOnly: true # add this line
...
volumes:
...
- name: enc # add this line
hostPath: # add this line
path: /etc/kubernetes/enc # add this line
type: DirectoryOrCreate # add this line
...
Caution: Your config file contains keys that can decrypt the contents in etcd, so you must
properly restrict permissions on your control-plane nodes so only the user who runs the kube-
apiserver can read it.
If you have multiple API servers in your cluster, you should deploy the changes in turn to each
API server.
Make sure that you use the same encryption configuration on each control plane host.
Data is encrypted when written to etcd. After restarting your kube-apiserver, any newly
created or updated Secret (or other resource kinds configured in EncryptionConfiguration)
should be encrypted when stored.
To check this, you can use the etcdctl command line program to retrieve the contents of your
secret data.
This example shows how to check this for encrypting the Secret API.
2. Using the etcdctl command line tool, read that Secret out of etcd:
where [...] must be the additional arguments for connecting to the etcd server.
For example:
ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get /registry/secrets/default/secret1 | hexdump -C
00000000 2f 72 65 67 69 73 74 72 79 2f 73 65 63 72 65 74 |/registry/secret|
00000010 73 2f 64 65 66 61 75 6c 74 2f 73 65 63 72 65 74 |s/default/secret|
00000020 31 0a 6b 38 73 3a 65 6e 63 3a 61 65 73 63 62 63 |1.k8s:enc:aescbc|
00000030 3a 76 31 3a 6b 65 79 31 3a c7 6c e7 d3 09 bc 06 |:v1:key1:.l.....|
00000040 25 51 91 e4 e0 6c e5 b1 4d 7a 8b 3d b9 c2 7c 6e |%Q...l..Mz.=..|n|
00000050 b4 79 df 05 28 ae 0d 8e 5f 35 13 2c c0 18 99 3e |.y..(..._5.,...>|
[...]
00000110 23 3a 0d fc 28 ca 48 2d 6b 2d 46 cc 72 0b 70 4c |#:..(.H-k-F.r.pL|
00000120 a5 fc 35 43 12 4e 60 ef bf 6f fe cf df 0b ad 1f |..5C.N`..o......|
00000130 82 c4 88 53 02 da 3e 66 ff 0a |...S..>f..|
0000013a
3. Verify the stored Secret is prefixed with k8s:enc:aescbc:v1: which indicates the aescbc
provider has encrypted the resulting data. Confirm that the key name shown in etcd
matches the key name specified in the EncryptionConfiguration mentioned above. In this
example, you can see that the encryption key named key1 is used in etcd and in
EncryptionConfiguration.
4. Verify the Secret is correctly decrypted when retrieved via the API:
The output should contain mykey: bXlkYXRh, with contents of mydata encoded using
base64; read decoding a Secret to learn how to completely decode the Secret.
It's often not enough to make sure that new objects get encrypted: you also want that
encryption to apply to the objects that are already stored.
For this example, you have configured your cluster so that Secrets are encrypted on write.
Performing a replace operation for each Secret will encrypt that content at rest, where the
objects are unchanged.
You can make this change across all Secrets in your cluster:
# Run this as an administrator that can read and write all Secrets
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
The command above reads all Secrets and then updates them with the same data, in order to
apply server side encryption.
Note:
If an error occurs due to a conflicting write, retry the command. It is safe to run that command
more than once.
For larger clusters, you may wish to subdivide the Secrets by namespace, or script an update.
1. Generate a new key and add it as the second key entry for the current provider on all
servers
2. Restart all kube-apiserver processes to ensure each server can decrypt using the new key
3. Make the new key the first entry in the keys array so that it is used for encryption in the
config
4. Restart all kube-apiserver processes to ensure each server now encrypts using the new
key
5. Run kubectl get secrets --all-namespaces -o json | kubectl replace -f - to encrypt all
existing Secrets with the new key
6. Remove the old decryption key from the config after you have backed up etcd with the
new key in use and updated all Secrets
To allow automatic reloading, configure the API server to run with: --encryption-provider-
config-automatic-reload=true
What's next
• Read about decrypting data that are already stored at rest
• Learn more about the EncryptionConfiguration configuration API (v1).
Note:
This task covers encryption for resource data stored using the Kubernetes API. For example,
you can encrypt Secret objects, including the key-value data they contain.
If you wanted to manage encryption for data in filesystems that are mounted into containers,
you instead need to either:
◦ Killercoda
◦ Play with Kubernetes
• This task assumes that you are running the Kubernetes API server as a static pod on each
control plane node.
• Your cluster's control plane must use etcd v3.x (major version 3, any minor version).
• To encrypt a custom resource, your cluster must be running Kubernetes v1.26 or newer.
• You should have some API data that are already encrypted.
The format of that configuration file is YAML, representing a configuration API kind named
EncryptionConfiguration. You can see an example configuration in Encryption at rest
configuration.
If --encryption-provider-config is set, check which resources (such as secrets) are configured for
encryption, and what provider is used. Make sure that the preferred provider for that resource
type is not identity; you only set identity (no encryption) as default when you want to disable
encryption at rest. Verify that the first-listed provider for a resource is something other than
identity, which means that any new information written to resources of that type will be
encrypted as configured. If you do see identity as the first-listed provider for any resource, this
means that those resources are being written out to etcd without encryption.
First, find the API server configuration files. On each control plane node, static Pod manifest for
the kube-apiserver specifies a command line argument, --encryption-provider-config. You are
likely to find that this file is mounted into the static Pod using a hostPath volume mount. Once
you locate the volume you can find the file on the node filesystem and inspect it.
To disable encryption at rest, place the identity provider as the first entry in your encryption
configuration file.
---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
# Do not use this (invalid) example key for encryption
- name: example
secret: 2KfZgdiq2K0g2YrYpyDYs9mF2LPZhQ==
---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {} # add this line
- aescbc:
keys:
- name: example
secret: 2KfZgdiq2K0g2YrYpyDYs9mF2LPZhQ==
and restart the kube-apiserver Pod on this node.
If you have multiple API servers in your cluster, you should deploy the changes in turn to each
API server.
Make sure that you use the same encryption configuration on each control plane host.
Force decryption
Once you have replaced all existing encrypted resources with backing data that don't use
encryption, you can remove the encryption settings from the kube-apiserver.
• --encryption-provider-config
• --encryption-provider-config-automatic-reload
If you have multiple API servers in your cluster, you should again deploy the changes in turn to
each API server.
Make sure that you use the same encryption configuration on each control plane host.
What's next
• Learn more about the EncryptionConfiguration configuration API (v1).
• Killercoda
• Play with Kubernetes
Key Terms
masq/non-masq example
The agent configuration file must be written in YAML or JSON syntax, and may contain three
optional keys:
Traffic to 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 ranges will NOT be masqueraded. Any
other traffic (assumed to be internet) will be masqueraded. An example of a local destination
from a pod could be its Node's IP address as well as another node's address or one of the IP
addresses in Cluster's IP range. Any other traffic will be masqueraded by default. The below
entries show the default set of rules that are applied by the ip-masq-agent:
By default, in GCE/Google Kubernetes Engine, if network policy is enabled or you are using a
cluster CIDR not in the 10.0.0.0/8 range, the ip-masq-agent will run in your cluster. If you are
running in another environment, you can add the ip-masq-agent DaemonSet to your cluster.
Create an ip-masq-agent
To create an ip-masq-agent, run the following kubectl command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/ip-masq-agent/master/ip-
masq-agent.yaml
You must also apply the appropriate node label to any nodes in your cluster that you want the
agent to run on.
In most cases, the default set of rules should be sufficient; however, if this is not the case for
your cluster, you can create and apply a ConfigMap to customize the IP ranges that are affected.
For example, to allow only 10.0.0.0/8 to be considered by the ip-masq-agent, you can create the
following ConfigMap in a file called "config".
Note:
It is important that the file is called config since, by default, that will be used as the key for
lookup by the ip-masq-agent:
nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s
This will update a file located at /etc/config/ip-masq-agent which is periodically checked every
resyncInterval and applied to the cluster node. After the resync interval has expired, you should
see the iptables rules reflect your changes:
By default, the link local range (169.254.0.0/16) is also handled by the ip-masq agent, which sets
up the appropriate iptables rules. To have the ip-masq-agent ignore link local, you can set
masqLinkLocal to true in the ConfigMap.
nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s
masqLinkLocal: true
Limit Storage Consumption
This example demonstrates how to limit the amount of storage consumed in a namespace.
The following resources are used in the demonstration: ResourceQuota, LimitRange, and
PersistentVolumeClaim.
◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi
max.
apiVersion: v1
kind: LimitRange
metadata:
name: storagelimits
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 2Gi
min:
storage: 1Gi
Minimum storage requests are used when the underlying storage provider requires certain
minimums. For example, AWS EBS volumes have a 1Gi minimum requirement.
In this example, a 6th PVC in the namespace would be rejected because it exceeds the
maximum count of 5. Alternatively, a 5Gi maximum quota when combined with the 2Gi max
limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested for a
namespace capped at 5Gi.
apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "5"
requests.storage: "5Gi"
Summary
A limit range can put a ceiling on how much storage is requested while a resource quota can
effectively cap the storage consumed by a namespace through claim counts and cumulative
storage capacity. The allows a cluster-admin to plan their cluster's storage budget without risk
of any one project going over their allotment.
By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.
Background
As part of the cloud provider extraction effort, all cloud specific controllers must be moved out
of the kube-controller-manager. All existing clusters that run cloud controllers in the kube-
controller-manager must migrate to instead run the controllers in a cloud provider specific
cloud-controller-manager.
Leader Migration provides a mechanism in which HA clusters can safely migrate "cloud
specific" controllers between the kube-controller-manager and the cloud-controller-manager
via a shared resource lock between the two components while upgrading the replicated control
plane. For a single-node control plane, or if unavailability of controller managers can be
tolerated during the upgrade, Leader Migration is not needed and this guide can be ignored.
This guide walks you through the manual process of upgrading the control plane from kube-
controller-manager with built-in cloud provider to running both kube-controller-manager and
cloud-controller-manager. If you use a tool to deploy and manage the cluster, please refer to the
documentation of the tool and the cloud provider for specific instructions of the migration.
The control plane nodes should run kube-controller-manager with Leader Election enabled,
which is the default. As of version N, an in-tree cloud provider must be set with --cloud-
provider flag and cloud-controller-manager should not yet be deployed.
The out-of-tree cloud provider must have built a cloud-controller-manager with Leader
Migration implementation. If the cloud provider imports k8s.io/cloud-provider and k8s.io/
controller-manager of version v0.21.0 or later, Leader Migration will be available. However, for
version before v0.22.0, Leader Migration is alpha and requires feature gate
ControllerManagerLeaderMigration to be enabled in cloud-controller-manager.
This guide assumes that kubelet of each control plane node starts kube-controller-manager and
cloud-controller-manager as static pods defined by their manifests. If the components run in a
different setting, please adjust the steps accordingly.
For authorization, this guide assumes that the cluster uses RBAC. If another authorization mode
grants permissions to kube-controller-manager and cloud-controller-manager components,
please grant the needed access in a way that matches the mode.
The default permissions of the controller manager allow only accesses to their main Lease. In
order for the migration to work, accesses to another Lease are required.
You can grant kube-controller-manager full access to the leases API by modifying the
system::leader-locking-kube-controller-manager role. This task guide assumes that the name of
the migration lease is cloud-provider-extraction-migration.
Leader Migration optionally takes a configuration file representing the state of controller-to-
manager assignment. At this moment, with in-tree cloud provider, kube-controller-manager
runs route, service, and cloud-node-lifecycle. The following example configuration shows the
assignment.
Leader Migration can be enabled without a configuration. Please see Default Configuration for
details.
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: kube-controller-manager
- name: service
component: kube-controller-manager
- name: cloud-node-lifecycle
component: kube-controller-manager
Alternatively, because the controllers can run under either controller managers, setting
component to * for both sides makes the configuration file consistent between both parties of
the migration.
# wildcard version
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: *
- name: service
component: *
- name: cloud-node-lifecycle
component: *
On each control plane node, save the content to /etc/leadermigration.conf, and update the
manifest of kube-controller-manager so that the file is mounted inside the container at the same
location. Also, update the same manifest to add the following arguments:
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: cloud-controller-manager
- name: service
component: cloud-controller-manager
- name: cloud-node-lifecycle
component: cloud-controller-manager
When creating control plane nodes of version N + 1, the content should be deployed to /etc/
leadermigration.conf. The manifest of cloud-controller-manager should be updated to mount
the configuration file in the same manner as kube-controller-manager of version N. Similarly,
add --enable-leader-migration and --leader-migration-config=/etc/leadermigration.conf to the
arguments of cloud-controller-manager.
Create a new control plane node of version N + 1 with the updated cloud-controller-manager
manifest, and with the --cloud-provider flag set to external for kube-controller-manager. kube-
controller-manager of version N + 1 MUST NOT have Leader Migration enabled because, with
an external cloud provider, it does not run the migrated controllers anymore, and thus it is not
involved in the migration.
Please refer to Cloud Controller Manager Administration for more detail on how to deploy
cloud-controller-manager.
The control plane now contains nodes of both version N and N + 1. The nodes of version N run
kube-controller-manager only, and these of version N + 1 run both kube-controller-manager
and cloud-controller-manager. The migrated controllers, as specified in the configuration, are
running under either kube-controller-manager of version N or cloud-controller-manager of
version N + 1 depending on which controller manager holds the migration lease. No controller
will ever be running under both controller managers at any time.
In a rolling manner, create a new control plane node of version N + 1 and bring down one of
version N until the control plane contains only nodes of version N + 1. If a rollback from
version N + 1 to N is required, add nodes of version N with Leader Migration enabled for kube-
controller-manager back to the control plane, replacing one of version N + 1 each time until
there are only nodes of version N.
(Optional) Disable Leader Migration
Now that the control plane has been upgraded to run both kube-controller-manager and cloud-
controller-manager of version N + 1, Leader Migration has finished its job and can be safely
disabled to save one Lease resource. It is safe to re-enable Leader Migration for the rollback in
the future.
Default Configuration
Starting Kubernetes 1.22, Leader Migration provides a default configuration suitable for the
default controller-to-manager assignment. The default configuration can be enabled by setting
--enable-leader-migration but without --leader-migration-config=.
For kube-controller-manager and cloud-controller-manager, if there are no flags that enable any
in-tree cloud provider or change ownership of controllers, the default configuration can be used
to avoid manual creation of the configuration file.
If your cloud provider provides an implementation of Node IPAM controller, you should switch
to the implementation in cloud-controller-manager. Disable Node IPAM controller in kube-
controller-manager of version N + 1 by adding --controllers=*,-nodeipam to its flags. Then add
nodeipam to the list of migrated controllers.
What's next
• Read the Controller Manager Leader Migration enhancement proposal.
Namespaces Walkthrough
Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes
cluster.
This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.
• Killercoda
• Play with Kubernetes
Prerequisites
This example assumes the following:
Assuming you have a fresh cluster, you can inspect the available namespaces by doing the
following:
The development team would like to maintain a space in the cluster where they can get a view
on the list of Pods, Services, and Deployments they use to build and run their application. In
this space, Kubernetes resources come and go, and the restrictions on who can or cannot
modify resources are relaxed to enable agile development.
The operations team would like to maintain a space in the cluster where they can enforce strict
procedures on who can or cannot manipulate the set of Pods, Services, and Deployments that
run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two
namespaces: development and production.
admin/namespace-dev.yaml
apiVersion: v1
kind: Namespace
metadata:
name: development
labels:
name: development
Save the following contents into file namespace-prod.yaml which describes a production
namespace:
admin/namespace-prod.yaml
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
name: production
To be sure things are right, let's list all of the namespaces in our cluster.
Users interacting with one namespace do not see the content in another namespace.
To demonstrate this, let's spin up a simple Deployment and Pods in the development
namespace.
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin
lithe-cocoa-92103_kubernetes
The next step is to define a context for the kubectl client to work in each namespace. The value
of "cluster" and "user" fields are copied from the current context.
By default, the above commands add two contexts that are saved into file .kube/config. You can
now view the contexts and alternate against the two new request contexts depending on which
namespace you wish to work against.
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: development
user: lithe-cocoa-92103_kubernetes
name: dev
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: production
user: lithe-cocoa-92103_kubernetes
name: prod
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin
At this point, all requests we make to the Kubernetes cluster from the command line are scoped
to the development namespace.
admin/snowflake-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: snowflake
name: snowflake
spec:
replicas: 2
selector:
matchLabels:
app: snowflake
template:
metadata:
labels:
app: snowflake
spec:
containers:
- image: registry.k8s.io/serve_hostname
imagePullPolicy: Always
name: snowflake
We have created a deployment whose replica size is 2 that is running the pod called snowflake
with a basic container that serves the hostname.
And this is great, developers are able to do what they want, and they do not have to worry
about affecting content in the production namespace.
Let's switch to the production namespace and show how resources in one namespace are
hidden from the other.
At this point, it should be clear that the resources users create in one namespace are hidden
from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can
provide different authorization rules for each namespace.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for
the data.
You can find in-depth information about etcd in the official documentation.
Prerequisites
• Run etcd as a cluster of odd members.
etcd is a leader-based distributed system. Ensure that the leader periodically send
• heartbeats on time to all followers to keep the cluster stable.
Performance and stability of the cluster is sensitive to network and disk I/O. Any
resource starvation can lead to heartbeat timeout, causing instability of the cluster. An
unstable etcd indicates that no leader is elected. Under such circumstances, a cluster
cannot make any changes to its current state, which implies no new pods can be
scheduled.
• Keeping etcd clusters stable is critical to the stability of Kubernetes clusters. Therefore,
run etcd clusters on dedicated machines or isolated environments for guaranteed
resource requirements.
• The minimum recommended etcd versions to run in production are 3.4.22+ and 3.5.6+.
Resource requirements
Operating etcd with limited resources is suitable only for testing purposes. For deploying in
production, advanced hardware configuration is required. Before deploying etcd in production,
see resource requirement reference.
etcd --listen-client-urls=http://$PRIVATE_IP:2379 \
--advertise-client-urls=http://$PRIVATE_IP:2379
For durability and high availability, run etcd as a multi-node cluster in production and back it
up periodically. A five-member cluster is recommended in production. For more information,
see FAQ documentation.
Configure an etcd cluster either by static member information or by dynamic discovery. For
more information on clustering, see etcd clustering documentation.
For an example, consider a five-member etcd cluster running with the following client URLs:
http://$IP1:2379, http://$IP2:2379, http://$IP3:2379, http://$IP4:2379, and http://$IP5:2379. To
start a Kubernetes API server:
etcd --listen-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:2379,http://$IP4:
2379,http://$IP5:2379 --advertise-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:
2379,http://$IP4:2379,http://$IP5:2379
Make sure the IP<n> variables are set to your client IP addresses.
To secure etcd, either set up firewall rules or use the security features provided by etcd. etcd
security features depend on x509 Public Key Infrastructure (PKI). To begin, establish secure
communication channels by generating a key and certificate pair. For example, use key pairs
peer.key and peer.cert for securing communication between etcd members, and client.key and
client.cert for securing communication between etcd and its clients. See the example scripts
provided by the etcd project to generate key pairs and CA files for client authentication.
Securing communication
To configure etcd with secure peer communication, specify flags --peer-key-file=peer.key and --
peer-cert-file=peer.cert, and use HTTPS as the URL schema.
Similarly, to configure etcd with secure client communication, specify flags --key-
file=k8sclient.key and --cert-file=k8sclient.cert, and use HTTPS as the URL schema. Here is an
example on a client command that uses secure communication:
After configuring secure communication, restrict the access of etcd cluster to only the
Kubernetes API servers. Use TLS authentication to do so.
For example, consider key pairs k8sclient.key and k8sclient.cert that are trusted by the CA
etcd.ca. When etcd is configured with --client-cert-auth along with TLS, it verifies the
certificates from clients by using system CAs or the CA passed in by --trusted-ca-file flag.
Specifying flags --client-cert-auth=true and --trusted-ca-file=etcd.ca will restrict the access to
clients with the certificate k8sclient.cert.
Once etcd is configured correctly, only clients with valid certificates can access it. To give
Kubernetes API servers the access, configure them with the flags --etcd-certfile=k8sclient.cert,
--etcd-keyfile=k8sclient.key and --etcd-cafile=ca.cert.
Note: etcd authentication is not currently supported by Kubernetes. For more information, see
the related issue Support Basic Auth for Etcd v2.
Though etcd keeps unique member IDs internally, it is recommended to use a unique name for
each member to avoid human errors. For example, consider a three-member etcd cluster. Let the
URLs be, member1=http://10.0.0.1, member2=http://10.0.0.2, and member3=http://10.0.0.3.
When member1 fails, replace it with member4=http://10.0.0.4.
1. If each Kubernetes API server is configured to communicate with all etcd members,
remove the failed member from the --etcd-servers flag, then restart each
Kubernetes API server.
2. If each Kubernetes API server communicates with a single etcd member, then stop
the Kubernetes API server that communicates with the failed etcd.
3. Stop the etcd server on the broken node. It is possible that other clients besides the
Kubernetes API server is causing traffic to etcd and it is desirable to stop all traffic to
prevent writes to the data dir.
export ETCD_NAME="member4"
export ETCD_INITIAL_CLUSTER="member2=http://10.0.0.2:2380,member3=http://
10.0.0.3:2380,member4=http://10.0.0.4:2380"
export ETCD_INITIAL_CLUSTER_STATE=existing
etcd [flags]
1. If each Kubernetes API server is configured to communicate with all etcd members,
add the newly added member to the --etcd-servers flag, then restart each
Kubernetes API server.
2. If each Kubernetes API server communicates with a single etcd member, start the
Kubernetes API server that was stopped in step 2. Then configure Kubernetes API
server clients to again route requests to the Kubernetes API server that was
stopped. This can often be done by configuring a load balancer.
Backing up an etcd cluster can be accomplished in two ways: etcd built-in snapshot and volume
snapshot.
Built-in snapshot
etcd supports built-in snapshot. A snapshot may either be taken from a live member with the
etcdctl snapshot save command or by copying the member/snap/db file from an etcd data
directory that is not currently used by an etcd process. Taking the snapshot will not affect the
performance of the member.
Below is an example for taking a snapshot of the keyspace served by $ENDPOINT to the file
snapshot.db:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| fe01cf57 | 10 | 7 | 2.1 MB |
+----------+----------+------------+------------+
Volume snapshot
If etcd is running on a storage volume that supports backup, such as Amazon Elastic Block
Store, back up etcd data by taking a snapshot of the storage volume.
We can also take the snapshot using various options given by etcdctl. For example
ETCDCTL_API=3 etcdctl -h
will list various options available from etcdctl. For example, you can take a snapshot by
specifying the endpoint, certificates etc as shown below:
where trusted-ca-file, cert-file and key-file can be obtained from the description of the etcd Pod.
Before starting the restore operation, a snapshot file must be present. It can either be a snapshot
file from a previous backup operation, or from a remaining data directory.
Here is an example:
where <data-dir-location> is a directory that will be created during the restore process.
Yet another example would be to first export the ETCDCTL_API environment variable:
export ETCDCTL_API=3
etcdctl --data-dir <data-dir-location> snapshot restore snapshot.db
For more information and examples on restoring a cluster from a snapshot file, see etcd disaster
recovery documentation.
If the access URLs of the restored cluster is changed from the previous cluster, the Kubernetes
API server must be reconfigured accordingly. In this case, restart Kubernetes API servers with
the flag --etcd-servers=$NEW_ETCD_CLUSTER instead of the flag --etcd-
servers=$OLD_ETCD_CLUSTER. Replace $NEW_ETCD_CLUSTER and
$OLD_ETCD_CLUSTER with the respective IP addresses. If a load balancer is used in front of
an etcd cluster, you might need to update the load balancer instead.
If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In
this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled
pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd
cluster and potentially reconfigure Kubernetes API servers to fix the issue.
Note:
If any API servers are running in your cluster, you should not attempt to restore instances of
etcd. Instead, follow these steps to restore etcd:
Note: Before you start an upgrade, please back up your etcd cluster first.
You can also run the defragmentation tool as a Kubernetes CronJob, to make sure that
defragmentation happens regularly. See etcd-defrag-cronjob.yaml for details.
The kubelet exposes a feature named 'Node Allocatable' that helps to reserve compute resources
for system daemons. Kubernetes recommends cluster administrators to configure 'Node
Allocatable' based on their workload density on each node.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version 1.8. To check the version, enter kubectl
version. Your Kubernetes server must be at or later than version 1.17 to use the kubelet
command line option --reserved-cpus to set an explicitly reserved CPU list.
Node Allocatable
node capacity
'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are
available for pods. The scheduler does not over-subscribe 'Allocatable'. 'CPU', 'memory' and
'ephemeral-storage' are supported as of now.
Node Allocatable is exposed as part of v1.Node object in the API and as part of kubectl describe
node in the CLI.
Resources can be reserved for two categories of system daemons in the kubelet.
To properly enforce node allocatable constraints on the node, you must enable the new cgroup
hierarchy via the --cgroups-per-qos flag. This flag is enabled by default. When enabled, the
kubelet will parent all end-user pods under a cgroup hierarchy managed by the kubelet.
The kubelet supports manipulation of the cgroup hierarchy on the host using a cgroup driver.
The driver is configured via the --cgroup-driver flag.
• cgroupfs is the default driver that performs direct manipulation of the cgroup filesystem
on the host in order to manage cgroup sandboxes.
• systemd is an alternative driver that manages cgroup sandboxes using transient slices for
resources that are supported by that init system.
Depending on the configuration of the associated container runtime, operators may have to
choose a particular cgroup driver to ensure proper system behavior. For example, if operators
use the systemd cgroup driver provided by the containerd runtime, the kubelet must be
configured to use the systemd cgroup driver.
Kube Reserved
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the
kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for
system daemons that are run as pods. kube-reserved is typically a function of pod density on
the nodes.
In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the
specified number of process IDs for kubernetes system daemons.
To optionally enforce kube-reserved on kubernetes system daemons, specify the parent control
group for kube daemons as the value for --kube-reserved-cgroup kubelet flag.
It is recommended that the kubernetes system daemons are placed under a top level control
group (runtime.slice on systemd machines for example). Each system daemon should ideally
run within its own child control group. Refer to the design proposal for more details on
recommended control group hierarchy.
Note that Kubelet does not create --kube-reserved-cgroup if it doesn't exist. The kubelet will
fail to start if an invalid cgroup is specified. With systemd cgroup driver, you should follow a
specific pattern for the name of the cgroup you define: the name should be the value you set for
--kube-reserved-cgroup, with .slice appended.
System Reserved
system-reserved is meant to capture resource reservation for OS system daemons like sshd,
udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is
not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is
also recommended (user.slice in systemd world).
In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the
specified number of process IDs for OS system daemons.
To optionally enforce system-reserved on system daemons, specify the parent control group for
OS system daemons as the value for --system-reserved-cgroup kubelet flag.
It is recommended that the OS system daemons are placed under a top level control group
(system.slice on systemd machines for example).
Note that kubelet does not create --system-reserved-cgroup if it doesn't exist. kubelet will fail
if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific
pattern for the name of the cgroup you define: the name should be the value you set for --
system-reserved-cgroup, with .slice appended.
reserved-cpus is meant to define an explicit CPU set for OS system daemons and kubernetes
system daemons. reserved-cpus is for systems that do not intend to define separate top level
cgroups for OS system daemons and kubernetes system daemons with regard to cpuset
resource. If the Kubelet does not have --system-reserved-cgroup and --kube-reserved-cgroup,
the explicit cpuset provided by reserved-cpus will take precedence over the CPUs defined by --
kube-reserved and --system-reserved options.
This option is specifically designed for Telco/NFV use cases where uncontrolled interrupts/
timers may impact the workload performance. you can use this option to define the explicit
cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on
the system can be used exclusively for workloads, with less impact from uncontrolled
interrupts/timers. To move the system daemon, kubernetes daemons and interrupts/timers to
the explicit cpuset defined by this option, other mechanism outside Kubernetes should be used.
For example: in Centos, you can do this using the tuned toolset.
Eviction Thresholds
Memory pressure at the node level leads to System OOMs which affects the entire node and all
pods running on it. Nodes can go offline temporarily until memory has been reclaimed. To
avoid (or reduce the probability of) system OOMs kubelet provides out of resource
management. Evictions are supported for memory and ephemeral-storage only. By reserving
some memory via --eviction-hard flag, the kubelet attempts to evict pods whenever memory
availability on the node drops below the reserved value. Hypothetically, if system daemons did
not exist on a node, pods cannot use more than capacity - eviction-hard. For this reason,
resources reserved for evictions are not available for pods.
kubelet enforce 'Allocatable' across pods by default. Enforcement is performed by evicting pods
whenever the overall usage across all pods exceeds 'Allocatable'. More details on eviction policy
can be found on the node pressure eviction page. This enforcement is controlled by specifying
pods value to the kubelet flag --enforce-node-allocatable.
General Guidelines
System daemons are expected to be treated similar to Guaranteed pods. System daemons can
burst within their bounding control groups and this behavior needs to be managed as part of
kubernetes deployments. For example, kubelet should have its own control group and share
kube-reserved resources with the container runtime. However, Kubelet cannot burst and use up
all available Node resources if kube-reserved is enforced.
Be extra careful while enforcing system-reserved reservation since it can lead to critical system
services being CPU starved, OOM killed, or unable to fork on the node. The recommendation is
to enforce system-reserved only if a user has profiled their nodes exhaustively to come up with
precise estimates and is confident in their ability to recover if any process in that group is oom-
killed.
The resource requirements of kube system daemons may grow over time as more and more
features are added. Over time, kubernetes project will attempt to bring down utilization of node
system daemons, but that is not a priority as of now. So expect a drop in Allocatable capacity in
future releases.
Example Scenario
Here is an example to illustrate Node Allocatable computation:
Under this scenario, 'Allocatable' will be 14.5 CPUs, 28.5Gi of memory and 88Gi of local storage.
Scheduler ensures that the total memory requests across all pods on this node does not exceed
28.5Gi and storage doesn't exceed 88Gi. Kubelet evicts pods whenever the overall memory
usage across pods exceeds 28.5Gi, or if overall disk usage exceeds 88Gi. If all processes on the
node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.
If kube-reserved and/or system-reserved is not enforced and system daemons exceed their
reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi
or storage is greater than 90Gi.
This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI,
and CNI without root privileges, by using a user namespace.
Note:
This document describes how to run Kubernetes Node components (and hence pods) as a non-
root user.
If you are just looking for how to run a pod as a non-root user, see SecurityContext.
• Enable Cgroup v2
• Enable systemd with user session
• Configure several sysctl values, depending on host Linux distribution
• Ensure that your unprivileged user is listed in /etc/subuid and /etc/subgid
• Enable the KubeletInUserNamespace feature gate
minikube also supports running Kubernetes inside Rootless Docker or Rootless Podman.
• Rootless Docker
• Rootless Podman
sysbox
Sysbox is an open-source container runtime (similar to "runc") that supports running system-
level workloads such as Docker and Kubernetes inside unprivileged containers isolated with the
Linux user namespace.
Sysbox supports running Kubernetes inside unprivileged containers without requiring Cgroup
v2 and without the KubeletInUserNamespace feature gate. It does this by exposing specially
crafted /proc and /sys filesystems inside the container plus several other advanced OS
virtualization techniques.
K3s
Usernetes
Usernetes supports both containerd and CRI-O as CRI runtimes. Usernetes supports multi-node
clusters using Flannel (VXLAN).
Note: This section is intended to be read by developers of Kubernetes distributions, not by end
users.
If you are trying to run Kubernetes in a user-namespaced container such as Rootless Docker/
Podman or LXC/LXD, you are all set, and you can go to the next subsection.
Otherwise you have to create a user namespace by yourself, by calling unshare(2) with
CLONE_NEWUSER.
A user namespace can be also unshared by using command line tools such as:
• unshare(1)
• RootlessKit
• become-root
After unsharing the user namespace, you will also have to unshare other namespaces such as
mount namespace.
You do not need to call chroot() nor pivot_root() after unsharing the mount namespace,
however, you have to mount writable filesystems on several directories in the namespace.
At least, the following directories need to be writable in the namespace (not outside the
namespace):
• /etc
• /run
• /var/logs
• /var/lib/kubelet
• /var/lib/cni
• /var/lib/containerd (for containerd)
• /var/lib/containers (for CRI-O)
In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2.
Note: Kubernetes support for running Node components in user namespaces requires cgroup
v2. Cgroup v1 is not supported.
Otherwise you have to create a systemd unit with Delegate=yes property to delegate a cgroup
tree with writable permission.
On your node, systemd must already be configured to allow delegation; for more details, see
cgroup v2 in the Rootless Containers documentation.
Configuring network
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.
The network namespace of the Node components has to have a non-loopback interface, which
can be for example configured with slirp4netns, VPNKit, or lxc-user-nic(1).
The network namespaces of the Pods can be configured with regular CNI plugins. For multi-
node networking, Flannel (VXLAN, 8472/UDP) is known to work.
Ports such as the kubelet port (10250/TCP) and NodePort service ports have to be exposed from
the Node network namespace to the host with an external port forwarder, such as RootlessKit,
slirp4netns, or socat(1).
You can use the port forwarder from K3s. See Running K3s in Rootless Mode for more details.
The implementation can be found in the pkg/rootlessports package of k3s.
Configuring CRI
The kubelet relies on a container runtime. You should deploy a container runtime such as
containerd or CRI-O and ensure that it is running within the user namespace before the kubelet
starts.
• containerd
• CRI-O
Running CRI plugin of containerd in a user namespace is supported since containerd 1.4.
version = 2
[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
disable_apparmor = true
# Ignore an error during setting oom_score_adj
restrict_oom_score_adj = true
# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb
controller)
disable_hugetlb_controller = true
[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be
disabled
snapshotter = "fuse-overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver
# (unless you run another systemd in the namespace)
SystemdCgroup = false
The default path of the configuration file is /etc/containerd/config.toml. The path can be
specified with containerd -c /path/to/containerd/config.toml.
[crio]
storage_driver = "overlay"
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be
disabled
storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]
[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroup_manager = "cgroupfs"
The default path of the configuration file is /etc/crio/crio.conf. The path can be specified with
crio --config /path/to/crio/crio.conf.
Configuring kubelet
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
KubeletInUserNamespace: true
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroupDriver: "cgroupfs"
When the KubeletInUserNamespace feature gate is enabled, the kubelet ignores errors that may
happen during setting the following sysctl values on the node.
• vm.overcommit_memory
• vm.panic_on_oom
• kernel.panic
• kernel.panic_on_oops
• kernel.keys.root_maxkeys
• kernel.keys.root_maxbytes.
Within a user namespace, the kubelet also ignores any error raised from trying to open /dev/
kmsg. This feature gate also allows kube-proxy to ignore an error during setting
RLIMIT_NOFILE.
The KubeletInUserNamespace feature gate was introduced in Kubernetes v1.22 with "alpha"
status.
Running kubelet in a user namespace without using this feature gate is also possible by
mounting a specially crafted proc filesystem (as done by Sysbox), but not officially supported.
Configuring kube-proxy
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "iptables" # or "userspace"
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s
Caveats
• Most of "non-local" volume drivers such as nfs and iscsi do not work. Local volumes like
local, hostPath, emptyDir, configMap, secret, and downwardAPI are known to work.
• Some CNI plugins may not work. Flannel (VXLAN) is known to work.
For more on this, see the Caveats and Future work page on the rootlesscontaine.rs website.
See Also
• rootlesscontaine.rs
• Rootless Containers 2020 (KubeCon NA 2020)
• Running kind with Rootless Docker
• Usernetes
• Running K3s with rootless mode
• KEP-2033: Kubelet-in-UserNS (aka Rootless mode)
1. You do not require your applications to be highly available during the node drain, or
2. You have read about the PodDisruptionBudget concept, and have configured
PodDisruptionBudgets for applications that need them.
If availability is important for any applications that run or could run on the node(s) that you are
draining, configure a PodDisruptionBudgets first and then continue following this guide.
Note: By default kubectl drain ignores certain system pods on the node that cannot be killed;
see the kubectl drain documentation for more details.
When kubectl drain returns successfully, that indicates that all of the pods (except the ones
excluded as described in the previous paragraph) have been safely evicted (respecting the
desired graceful termination period, and respecting the PodDisruptionBudget you have
defined). It is then safe to bring down the node by powering down its physical machine or, if
running on a cloud platform, deleting its virtual machine.
Note:
If any new Pods tolerate the node.kubernetes.io/unschedulable taint, then those Pods might be
scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.
If you or another API user directly set the nodeName field for a Pod (bypassing the scheduler),
then the Pod is bound to the specified node and will run there, even though you have drained
that node and marked it unschedulable.
First, identify the name of the node you wish to drain. You can list all of the nodes in your
cluster with
If there are pods managed by a DaemonSet, you will need to specify --ignore-daemonsets with
kubectl to successfully drain the node. The kubectl drain subcommand on its own does not
actually drain a node of its DaemonSet pods: the DaemonSet controller (part of the control
plane) immediately replaces missing Pods with new equivalent Pods. The DaemonSet controller
also creates Pods that ignore unschedulable taints, which allows the new Pods to launch onto a
node that you are draining.
Once it returns (without giving an error), you can power down the node (or equivalently, if on a
cloud platform, delete the virtual machine backing the node). If you leave the node in the
cluster during the maintenance operation, you need to run
afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.
For example, if you have a StatefulSet with three replicas and have set a PodDisruptionBudget
for that set specifying minAvailable: 2, kubectl drain only evicts a pod from the StatefulSet if all
three replicas pods are healthy; if then you issue multiple drain commands in parallel,
Kubernetes respects the PodDisruptionBudget and ensures that only 1 (calculated as replicas -
minAvailable) Pod is unavailable at any given time. Any drains that would cause the number of
healthy replicas to fall below the specified budget are blocked.
What's next
• Follow steps to protect your application by configuring a Pod Disruption Budget.
Securing a Cluster
This document covers topics related to protecting a cluster from accidental or malicious access
and provides recommendations on overall security.
◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.
Kubernetes expects that all API communication in the cluster is encrypted by default with TLS,
and the majority of installation methods will allow the necessary certificates to be created and
distributed to the cluster components. Note that some components and installation methods
may enable local ports over HTTP and administrators should familiarize themselves with the
settings of each component to identify potentially unsecured traffic.
API Authentication
Choose an authentication mechanism for the API servers to use that matches the common
access patterns when you install a cluster. For instance, small, single-user clusters may wish to
use a simple certificate or static Bearer token approach. Larger clusters may wish to integrate
an existing OIDC or LDAP server that allow users to be subdivided into groups.
All API clients must be authenticated, even those that are part of the infrastructure like nodes,
proxies, the scheduler, and volume plugins. These clients are typically service accounts or use
x509 client certificates, and they are created automatically at cluster startup or are setup as part
of the cluster installation.
API Authorization
Once authenticated, every API call is also expected to pass an authorization check. Kubernetes
ships an integrated Role-Based Access Control (RBAC) component that matches an incoming
user or group to a set of permissions bundled into roles. These permissions combine verbs (get,
create, delete) with resources (pods, services, nodes) and can be namespace-scoped or cluster-
scoped. A set of out-of-the-box roles are provided that offer reasonable default separation of
responsibility depending on what actions a client might want to perform. It is recommended
that you use the Node and RBAC authorizers together, in combination with the NodeRestriction
admission plugin.
As with authentication, simple and broad roles may be appropriate for smaller clusters, but as
more users interact with the cluster, it may become necessary to separate teams into separate
namespaces with more limited roles.
With authorization, it is important to understand how updates on one object may cause actions
in other places. For instance, a user may not be able to create pods directly, but allowing them
to create a deployment, which creates pods on their behalf, will let them create those pods
indirectly. Likewise, deleting a node from the API will result in the pods scheduled to that node
being terminated and recreated on other nodes. The out-of-the box roles represent a balance
between flexibility and common use cases, but more limited roles should be carefully reviewed
to prevent accidental escalation. You can make roles specific to your use case if the out-of-box
ones don't meet your needs.
Resource quota limits the number or capacity of resources granted to a namespace. This is most
often used to limit the amount of CPU, memory, or persistent disk a namespace can allocate,
but can also control how many pods, services, or volumes exist in each namespace.
Limit ranges restrict the maximum or minimum size of some of the resources above, to prevent
users from requesting unreasonably high or low values for commonly reserved resources like
memory, or to provide default limits when none are specified.
A pod definition contains a security context that allows it to request access to run as a specific
Linux user on a node (like root), access to run privileged or access the host network, and other
controls that would otherwise allow it to run unfettered on a hosting node.
You can configure Pod security admission to enforce use of a particular Pod Security Standard
in a namespace, or to detect breaches.
Generally, most application workloads need limited access to host resources so they can
successfully run as a root process (uid 0) without access to host information. However,
considering the privileges associated with the root user, you should write application containers
to run as a non-root user. Similarly, administrators who wish to prevent client applications from
escaping their containers should apply the Baseline or Restricted Pod Security Standard.
Preventing containers from loading unwanted kernel modules
The Linux kernel automatically loads kernel modules from disk if needed in certain
circumstances, such as when a piece of hardware is attached or a filesystem is mounted. Of
particular relevance to Kubernetes, even unprivileged processes can cause certain network-
protocol-related kernel modules to be loaded, just by creating a socket of the appropriate type.
This may allow an attacker to exploit a security hole in a kernel module that the administrator
assumed was not in use.
To prevent specific modules from being automatically loaded, you can uninstall them from the
node, or add rules to block them. On most Linux distributions, you can do that by creating a file
such as /etc/modprobe.d/kubernetes-blacklist.conf with contents like:
# SCTP is not used in most Kubernetes clusters, and has also had
# vulnerabilities in the past.
blacklist sctp
To block module loading more generically, you can use a Linux Security Module (such as
SELinux) to completely deny the module_request permission to containers, preventing the
kernel from loading modules for containers under any circumstances. (Pods would still be able
to use modules that had been loaded manually, or modules that were loaded by the kernel on
behalf of some more-privileged process.)
The network policies for a namespace allows application authors to restrict which pods in other
namespaces may access pods and ports within their namespaces. Many of the supported
Kubernetes networking providers now respect network policy.
Quota and limit ranges can also be used to control whether users may request node ports or
load-balanced services, which on many clusters can control whether those users applications
are visible outside of the cluster.
Additional protections may be available that control network rules on a per-plugin or per-
environment basis, such as per-node firewalls, physically separating cluster nodes to prevent
cross talk, or advanced networking policy.
Cloud platforms (AWS, Azure, GCE, etc.) often expose metadata services locally to instances. By
default these APIs are accessible by pods running on an instance and can contain cloud
credentials for that node, or provisioning data such as kubelet credentials. These credentials can
be used to escalate within the cluster or to other cloud services under the same account.
When running Kubernetes on a cloud platform, limit permissions given to instance credentials,
use network policies to restrict pod access to the metadata API, and avoid using provisioning
data to deliver secrets.
Controlling which nodes pods may access
By default, there are no restrictions on which nodes may run a pod. Kubernetes offers a rich set
of policies for controlling placement of pods onto nodes and the taint-based pod placement and
eviction that are available to end users. For many clusters use of these policies to separate
workloads can be a convention that authors adopt or enforce via tooling.
As an administrator, a beta admission plugin PodNodeSelector can be used to force pods within
a namespace to default or require a specific node selector, and if end users cannot alter
namespaces, this can strongly limit the placement of all of the pods in a specific workload.
Write access to the etcd backend for the API is equivalent to gaining root on the entire cluster,
and read access can be used to escalate fairly quickly. Administrators should always use strong
credentials from the API servers to their etcd server, such as mutual auth via TLS client
certificates, and it is often recommended to isolate the etcd servers behind a firewall that only
the API servers may access.
Caution: Allowing other components within the cluster to access the master etcd instance with
read or write access to the full keyspace is equivalent to granting cluster-admin access. Using
separate etcd instances for non-master components or using etcd ACLs to restrict read and
write access to a subset of the keyspace is strongly recommended.
The audit logger is a beta feature that records actions taken by the API for later analysis in the
event of a compromise. It is recommended to enable audit logging and archive the audit file on
a secure server.
Alpha and beta Kubernetes features are in active development and may have limitations or bugs
that result in security vulnerabilities. Always assess the value an alpha or beta feature may
provide against the possible risk to your security posture. When in doubt, disable features you
do not use.
The shorter the lifetime of a secret or credential the harder it is for an attacker to make use of
that credential. Set short lifetimes on certificates and automate their rotation. Use an
authentication provider that can control how long issued tokens are available and use short
lifetimes where possible. If you use service-account tokens in external integrations, plan to
rotate those tokens frequently. For example, once the bootstrap phase is complete, a bootstrap
token used for setting up nodes should be revoked or its authorization removed.
Review third party integrations before enabling them
Many third party integrations to Kubernetes may alter the security profile of your cluster.
When enabling an integration, always review the permissions that an extension requests before
granting it access. For example, many security integrations may request access to view all
secrets on your cluster which is effectively making that component a cluster admin. When in
doubt, restrict the integration to functioning in a single namespace if possible.
Components that create pods may also be unexpectedly powerful if they can do so inside
namespaces like the kube-system namespace, because those pods can gain access to service
account secrets or run with elevated permissions if those service accounts are granted access to
permissive PodSecurityPolicies.
If you use Pod Security admission and allow any component to create Pods within a namespace
that permits privileged Pods, those Pods may be able to escape their containers and use this
widened access to elevate their privileges.
You should not allow untrusted components to create Pods in any system namespace (those
with names that start with kube-) nor in any namespace where that access grant allows the
possibility of privilege escalation.
In general, the etcd database will contain any information accessible via the Kubernetes API
and may grant an attacker significant visibility into the state of your cluster. Always encrypt
your backups using a well reviewed backup and encryption solution, and consider using full
disk encryption where possible.
Kubernetes supports optional encryption at rest for information in the Kubernetes API. This
lets you ensure that when Kubernetes stores data for objects (for example, Secret or ConfigMap
objects), the API server writes an encrypted representation of the object. That encryption
means that even someone who has access to etcd backup data is unable to view the content of
those objects. In Kubernetes 1.28 you can also encrypt custom resources; encryption-at-rest for
extension APIs defined in CustomResourceDefinitions was added to Kubernetes as part of the
v1.26 release.
Join the kubernetes-announce group for emails about security announcements. See the security
reporting page for more on how to report vulnerabilities.
What's next
• Security Checklist for additional information on Kubernetes security guidance.
The configuration file must be a JSON or YAML representation of the parameters in this struct.
Make sure the kubelet has read permissions on the file.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"
Note: In the example, by changing the default value of only one parameter for evictionHard,
the default values of other parameters will not be inherited and will be set to zero. In order to
provide custom values, you should provide all the threshold values respectively.
The imagefs is an optional filesystem that container runtimes use to store container images and
container writable layers.
Start the kubelet with the --config flag set to the path of the kubelet's config file. The kubelet
will then load its config from this file.
Note that command line flags which target the same value as a config file will override that
value. This helps ensure backwards compatibility with the command-line API.
Note that relative file paths in the kubelet config file are resolved relative to the location of the
kubelet config file, whereas relative paths in command line flags are resolved relative to the
kubelet's current working directory.
Note that some default values differ between command-line flags and the kubelet config file. If
--config is provided and the values are not specified via the command line, the defaults for the
KubeletConfiguration version apply. In the above example, this version is kubelet.config.k8s.io/
v1beta1.
You can only set --config-dir if you set the environment variable
KUBELET_CONFIG_DROPIN_DIR_ALPHA for the kubelet process (the value of that variable
does not matter). For Kubernetes v1.28, the kubelet returns an error if you specify --config-dir
without that variable set, and startup fails. You cannot specify the drop-in configuration
directory using the kubelet configuration file; only the CLI argument --config-dir can set it.
One can use the kubelet configuration directory in a similar way to the kubelet config file.
Note: The suffix of a valid kubelet drop-in configuration file must be .conf. For instance: 99-
kubelet-address.conf
For instance, you may want a baseline kubelet configuration for all nodes, but you may want to
customize the address field. This can be done as follows:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "200Mi"
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
This produces the same outcome as if you used the single configuration file used in the earlier
example.
What's next
• Learn more about kubelet configuration by checking the KubeletConfiguration reference.
Viewing namespaces
List the current namespaces in a cluster using:
Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.
Resource Limits
Type Resource Min Max Default
---- -------- --- --- ---
Container cpu - - 100m
Note that these details show both resource quota (if present) as well as resource limit ranges.
Resource quota tracks aggregate usage of resources in the Namespace and allows cluster
operators to define Hard resource usage limits that a Namespace may consume.
A limit range defines min/max constraints on the amount of resources a single entity can
consume in a Namespace.
apiVersion: v1
kind: Namespace
metadata:
name: <insert-namespace-name-here>
Then run:
There's an optional field finalizers, which allows observables to purge resources whenever the
namespace is deleted. Keep in mind that if you specify a nonexistent finalizer, the namespace
will be created but will get stuck in the Terminating state if the user tries to delete it.
This delete is asynchronous, so for a time you will see the namespace in the Terminating state.
Assuming you have a fresh cluster, you can introspect the available namespaces by doing the
following:
For this exercise, we will create two additional Kubernetes namespaces to hold our content.
In a scenario where an organization is using a shared Kubernetes cluster for development and
production use cases:
• The development team would like to maintain a space in the cluster where they can get a
view on the list of Pods, Services, and Deployments they use to build and run their
application. In this space, Kubernetes resources come and go, and the restrictions on who
can or cannot modify resources are relaxed to enable agile development.
• The operations team would like to maintain a space in the cluster where they can enforce
strict procedures on who can or cannot manipulate the set of Pods, Services, and
Deployments that run the production site.
One pattern this organization could follow is to partition the Kubernetes cluster into two
namespaces: development and production. Let's create two new namespaces to hold our work.
To be sure things are right, list all of the namespaces in our cluster.
A Kubernetes namespace provides the scope for Pods, Services, and Deployments in the cluster.
Users interacting with one namespace do not see the content in another namespace. To
demonstrate this, let's spin up a simple Deployment and Pods in the development namespace.
We have created a deployment whose replica size is 2 that is running the pod called snowflake
with a basic container that serves the hostname.
And this is great, developers are able to do what they want, and they do not have to worry
about affecting content in the production namespace.
Let's switch to the production namespace and show how resources in one namespace are
hidden from the other. The production namespace should be empty, and the following
commands should return nothing.
At this point, it should be clear that the resources users create in one namespace are hidden
from the other namespace.
As the policy support in Kubernetes evolves, we will extend this scenario to show how you can
provide different authorization rules for each namespace.
Each user community wants to be able to work in isolation from other communities. Each user
community has its own:
A cluster operator may create a Namespace for each unique user community.
What's next
• Learn more about setting the namespace preference.
• Learn more about setting the namespace for a request
• See namespaces design.
Upgrade A Cluster
This page provides an overview of the steps you should follow to upgrade a Kubernetes cluster.
The way that you upgrade a cluster depends on how you initially deployed it and on any
subsequent changes.
Upgrade approaches
kubeadm
If your cluster was deployed using the kubeadm tool, refer to Upgrading kubeadm clusters for
detailed information on how to upgrade the cluster.
Once you have upgraded the cluster, remember to install the latest version of kubectl.
Manual deployments
Caution: These steps do not account for third-party extensions such as network and storage
plugins.
You should manually update the control plane following this sequence:
For each node in your cluster, drain that node and then either replace it with a new node that
uses the 1.28 kubelet, or upgrade the kubelet on that node and bring the node back into service.
Caution: Draining nodes before upgrading kubelet ensures that pods are re-admitted and
containers are re-created, which may be necessary to resolve some security issues or other
important bugs.
Other deployments
Refer to the documentation for your cluster deployment tool to learn the recommended set up
steps for maintenance.
Post-upgrade tasks
Switch your cluster's storage API version
The objects that are serialized into etcd for a cluster's internal representation of the Kubernetes
resources active in the cluster are written using a particular version of the API.
When the supported API changes, these objects may need to be rewritten in the newer API.
Failure to do this will eventually result in resources that are no longer decodable or usable by
the Kubernetes API server.
For each affected object, fetch it using the latest supported API and then write it back also using
the latest supported API.
Update manifests
You can use kubectl convert command to convert manifests between different API versions. For
example:
The kubectl tool replaces the contents of pod.yaml with a manifest that sets kind to Pod
(unchanged), but with a revised apiVersion.
Device Plugins
If your cluster is running device plugins and the node needs to be upgraded to a Kubernetes
release with a newer device plugin API version, device plugins must be upgraded to support
both version before the node is upgraded in order to guarantee that device allocations continue
to complete successfully during the upgrade.
Refer to API compatibility and Kubelet Device Manager API Versions for more details.
• Killercoda
• Play with Kubernetes
You also need to create a sample Deployment to experiment with the different types of
cascading deletion. You will need to recreate the Deployment for each type.
apiVersion: v1
...
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-deployment-6b474476c4
uid: 4fdcd81c-bd5d-41f7-97af-3a3b759af9a7
...
Using kubectl
"kind": "Deployment",
"apiVersion": "apps/v1",
"metadata": {
"name": "nginx-deployment",
"namespace": "default",
"uid": "d1ce1b02-cae8-4288-8a53-30e84d8fa505",
"resourceVersion": "1363097",
"creationTimestamp": "2021-07-08T20:24:37Z",
"deletionTimestamp": "2021-07-08T20:27:39Z",
"finalizers": [
"foregroundDeletion"
]
...
You can delete objects using background cascading deletion using kubectl or the Kubernetes
API.
Kubernetes uses background cascading deletion by default, and does so even if you run the
following commands without the --cascade flag or the propagationPolicy argument.
Using kubectl
"kind": "Status",
"apiVersion": "v1",
...
"status": "Success",
"details": {
"name": "nginx-deployment",
"group": "apps",
"kind": "deployments",
"uid": "cc9eefb9-2d49-4445-b1c1-d261c9396456"
}
Using kubectl
"kind": "Deployment",
"apiVersion": "apps/v1",
"namespace": "default",
"uid": "6f577034-42a0-479d-be21-78018c466f1f",
"creationTimestamp": "2021-07-09T16:46:37Z",
"deletionTimestamp": "2021-07-09T16:47:08Z",
"deletionGracePeriodSeconds": 0,
"finalizers": [
"orphan"
],
...
You can check that the Pods managed by the Deployment are still running:
What's next
• Learn about owners and dependents in Kubernetes.
• Learn about Kubernetes finalizers.
• Learn about garbage collection.
• Killercoda
• Play with Kubernetes
The version of Kubernetes that you need depends on which KMS API version you have selected.
Kubernetes recommends using KMS v2.
• If you selected KMS API v2, you should use Kubernetes v1.28 (if you are running a
different version of Kubernetes that also supports the v2 KMS API, switch to the
documentation for that version of Kubernetes).
• If you selected KMS API v1 to support clusters prior to version v1.27 or if you have a
legacy KMS plugin that only supports KMS v1, any supported Kubernetes version will
work. This API is deprecated as of Kubernetes v1.28. Kubernetes does not recommend the
use of this API.
KMS v1
KMS v2
• For version 1.25 and 1.26, enabling the feature via kube-apiserver feature gate is required.
Set --feature-gates=KMSv2=true to configure a KMS v2 provider. For environments where
all API servers are running version 1.28 or later, and you do not require the ability to
downgrade to Kubernetes v1.27, you can enable the KMSv2KDF feature gate (a beta
feature) for more robust data encryption key generation. The Kubernetes project
recommends enabling KMS v2 KDF if those preconditions are met.
Caution:
The KMS v2 API and implementation changed in incompatible ways in-between the alpha
release in v1.25 and the beta release in v1.27. Attempting to upgrade from old versions with the
alpha feature enabled will result in data loss.
Running mixed API server versions with some servers at v1.27, and others at v1.28 with the
KMSv2KDF feature gate enabled is not supported - and is likely to result in data loss.
The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The
data is encrypted using a data encryption key (DEK). The DEKs are encrypted with a key
encryption key (KEK) that is stored and managed in a remote KMS.
With KMS v2, there are two ways for the API server to generate a DEK. Kubernetes defaults to
generating a new DEK at API server startup, which is then reused for resource encryption.
However, if you use KMS v2 and enable the KMSv2KDF feature gate, then Kubernetes instead
generates a new DEK per encryption: the API server uses a key derivation function to generate
single use data encryption keys from a secret seed combined with some random data.
Whichever approach you configure, the DEK or seed is also rotated whenever the KEK is
rotated (see Understanding key_id and Key Rotation section below for more details).
The KMS provider uses gRPC to communicate with a specific KMS plugin over a UNIX domain
socket. The KMS plugin, which is implemented as a gRPC server and deployed on the same
host(s) as the Kubernetes control plane, is responsible for all communication with the remote
KMS.
Caution:
If you are running virtual machine (VM) based nodes that leverage VM state store with this
feature, using KMS v2 is insecure and an information security risk unless you also explicitly
enable the KMSv2KDF feature gate.
With KMS v2, the API server uses AES-GCM with a 12 byte nonce (8 byte atomic counter and 4
bytes random data) for encryption. The following issues could occur if the VM is saved and
restored:
1. The counter value may be lost or corrupted if the VM is saved in an inconsistent state or
restored improperly. This can lead to a situation where the same counter value is used
twice, resulting in the same nonce being used for two different messages.
2. If the VM is restored to a previous state, the counter value may be set back to its previous
value, resulting in the same nonce being used again.
Although both of these cases are partially mitigated by the 4 byte random nonce, this can
compromise the security of the encryption.
If you have enabled the KMSv2KDF feature gate and are using KMS v2 (not KMS v1), the API
server generates single use data encryption keys from a secret seed. This eliminates the need for
a counter based nonce while avoiding nonce collision concerns. It also removes any specific
concerns with using KMS v2 and VM state store.
KMS v1
• apiVersion: API Version for KMS provider. Leave this value empty or set it to v1.
• name: Display name of the KMS plugin. Cannot be changed once set.
• endpoint: Listen address of the gRPC server (KMS plugin). The endpoint is a UNIX
domain socket.
• cachesize: Number of data encryption keys (DEKs) to be cached in the clear. When
cached, DEKs can be used without another call to the KMS; whereas DEKs that are not
cached require a call to the KMS to unwrap.
• timeout: How long should kube-apiserver wait for kms-plugin to respond before
returning an error (default is 3 seconds).
KMS v2
Refer to your cloud provider for instructions on enabling the cloud provider-specific KMS
plugin.
You can develop a KMS plugin gRPC server using a stub file available for Go. For other
languages, you use a proto file to create a stub file that you can use to develop the gRPC server
code.
KMS v1
• Using Go: Use the functions and data structures in the stub file: api.pb.go to develop the
gRPC server code
• Using languages other than Go: Use the protoc compiler with the proto file: api.proto to
generate a stub file for the specific language
KMS v2
• Using Go: A high level library is provided to make the process easier. Low level
implementations can use the functions and data structures in the stub file: api.pb.go to
develop the gRPC server code
• Using languages other than Go: Use the protoc compiler with the proto file: api.proto to
generate a stub file for the specific language
Then use the functions and data structures in the stub file to develop the server code.
Notes
KMS v1
In response to procedure call Version, a compatible KMS plugin should return v1beta1 as
VersionResponse.version.
message version: v1beta1
•
All messages from KMS provider have the version field set to v1beta1.
The plugin is implemented as a gRPC server that listens at UNIX domain socket. The
plugin deployment should create a file on the file system to run the gRPC unix domain
socket connection. The API server (gRPC client) is configured with the KMS provider
(gRPC server) unix domain socket endpoint in order to communicate with it. An abstract
Linux socket may be used by starting the endpoint with /@, i.e. unix:///@foo. Care must
be taken when using this type of socket as they do not have concept of ACL (unlike
traditional file based sockets). However, they are subject to Linux networking namespace,
so will only be accessible to containers within the same pod unless host networking is
used.
KMS v2
In response to procedure call Status, a compatible KMS plugin should return v2beta1 as
StatusResponse.version, "ok" as StatusResponse.healthz and a key_id (remote KMS KEK
ID) as StatusResponse.key_id.
The API server polls the Status procedure call approximately every minute when
everything is healthy, and every 10 seconds when the plugin is not healthy. Plugins must
take care to optimize this call as it will be under constant load.
• Encryption
The EncryptRequest procedure call provides the plaintext and a UID for logging purposes.
The response must include the ciphertext, the key_id for the KEK used, and, optionally,
any metadata that the KMS plugin needs to aid in future DecryptRequest calls (via the
annotations field). The plugin must guarantee that any distinct plaintext results in a
distinct response (ciphertext, key_id, annotations).
If the plugin returns a non-empty annotations map, all map keys must be fully qualified
domain names such as example.com. An example use case of annotation is
{"kms.example.io/remote-kms-auditid":"<audit ID used by the remote KMS>"}
The API server does not perform the EncryptRequest procedure call at a high rate. Plugin
implementations should still aim to keep each request's latency at under 100 milliseconds.
• Decryption
The DecryptRequest procedure call provides the (ciphertext, key_id, annotations) from
EncryptRequest and a UID for logging purposes. As expected, it is the inverse of the
EncryptRequest call. Plugins must verify that the key_id is one that they understand -
they must not attempt to decrypt data unless they are sure that it was encrypted by them
at an earlier time.
The API server may perform thousands of DecryptRequest procedure calls on startup to
fill its watch cache. Thus plugin implementations must perform these calls as quickly as
possible, and should aim to keep each request's latency at under 10 milliseconds.
Understanding key_id and Key Rotation
•
The key_id is the public, non-secret name of the remote KMS KEK that is currently in use.
It may be logged during regular operation of the API server, and thus must not contain
any private data. Plugin implementations are encouraged to use a hash to avoid leaking
any data. The KMS v2 metrics take care to hash this value before exposing it via the /
metrics endpoint.
The API server considers the key_id returned from the Status procedure call to be
authoritative. Thus, a change to this value signals to the API server that the remote KEK
has changed, and data encrypted with the old KEK should be marked stale when a no-op
write is performed (as described below). If an EncryptRequest procedure call returns a
key_id that is different from Status, the response is thrown away and the plugin is
considered unhealthy. Thus implementations must guarantee that the key_id returned
from Status will be the same as the one returned by EncryptRequest. Furthermore,
plugins must ensure that the key_id is stable and does not flip-flop between values (i.e.
during a remote KEK rotation).
Plugins must not re-use key_ids, even in situations where a previously used remote KEK
has been reinstated. For example, if a plugin was using key_id=A, switched to key_id=B,
and then went back to key_id=A - instead of reporting key_id=A the plugin should report
some derivative value such as key_id=A_001 or use a new value such as key_id=C.
Since the API server polls Status about every minute, key_id rotation is not immediate.
Furthermore, the API server will coast on the last valid state for about three minutes.
Thus if a user wants to take a passive approach to storage migration (i.e. by waiting), they
must schedule a migration to occur at 3 + N + M minutes after the remote KEK has been
rotated (N is how long it takes the plugin to observe the key_id change and M is the
desired buffer to allow config changes to be processed - a minimum M of five minutes is
recommend). Note that no API server restart is required to perform KEK rotation.
Caution: Because you don't control the number of writes performed with the DEK, the
Kubernetes project recommends rotating the KEK at least every 90 days.
The plugin is implemented as a gRPC server that listens at UNIX domain socket. The
plugin deployment should create a file on the file system to run the gRPC unix domain
socket connection. The API server (gRPC client) is configured with the KMS provider
(gRPC server) unix domain socket endpoint in order to communicate with it. An abstract
Linux socket may be used by starting the endpoint with /@, i.e. unix:///@foo. Care must
be taken when using this type of socket as they do not have concept of ACL (unlike
traditional file based sockets). However, they are subject to Linux networking namespace,
so will only be accessible to containers within the same pod unless host networking is
used.
The KMS plugin can communicate with the remote KMS using any protocol supported by the
KMS. All configuration data, including authentication credentials the KMS plugin uses to
communicate with the remote KMS, are stored and managed by the KMS plugin independently.
The KMS plugin can encode the ciphertext with additional metadata that may be required
before sending it to the KMS for decryption (KMS v2 makes this process easier by providing a
dedicated annotations field).
Ensure that the KMS plugin runs on the same host(s) as the Kubernetes API server(s).
1. Create a new EncryptionConfiguration file using the appropriate properties for the kms
provider to encrypt resources like Secrets and ConfigMaps. If you want to encrypt an
extension API that is defined in a CustomResourceDefinition, your cluster must be
running Kubernetes v1.26 or newer.
KMS v1
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- kms:
name: myKmsPluginFoo
endpoint: unix:///tmp/socketfile.sock
cachesize: 100
timeout: 3s
- kms:
name: myKmsPluginBar
endpoint: unix:///tmp/socketfile.sock
cachesize: 100
timeout: 3s
KMS v2
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- kms:
apiVersion: v2
name: myKmsPluginFoo
endpoint: unix:///tmp/socketfile.sock
timeout: 3s
- kms:
apiVersion: v2
name: myKmsPluginBar
endpoint: unix:///tmp/socketfile.sock
timeout: 3s
The following table summarizes the health check endpoints for each KMS version:
Single Healthcheck means that the only health check endpoint is /healthz/kms-providers.
Individual Healthchecks means that each KMS plugin has an associated health check endpoint
based on its location in the encryption config: /healthz/kms-provider-0, /healthz/kms-
provider-1 etc.
These healthcheck endpoint paths are hard coded and generated/controlled by the server. The
indices for individual healthchecks corresponds to the order in which the KMS encryption
config is processed.
At a high level, restarting an API server when a KMS plugin is unhealthy is unlikely to make
the situation better. It can make the situation significantly worse by throwing away the API
server's DEK cache. Thus the general recommendation is to ignore the API server KMS healthz
checks for liveness purposes, i.e. /livez?exclude=kms-providers.
Until the steps defined in Ensuring all secrets are encrypted are performed, the providers list
should end with the identity: {} provider to allow unencrypted data to be read. Once all
resources are encrypted, the identity provider should be removed to prevent the API server
from honoring unencrypted data.
For details about the EncryptionConfiguration format, please check the API server encryption
API reference.
Verifying that the data is encrypted
When encryption at rest is correctly configured, resources are encrypted on write. After
restarting your kube-apiserver, any newly created or updated Secret or other resource types
configured in EncryptionConfiguration should be encrypted when stored. To verify, you can
use the etcdctl command line program to retrieve the contents of your secret data.
2. Using the etcdctl command line, read that secret out of etcd:
where [...] contains the additional arguments for connecting to the etcd server.
3. Verify the stored secret is prefixed with k8s:enc:kms:v1: for KMS v1 or prefixed with
k8s:enc:kms:v2: for KMS v2, which indicates that the kms provider has encrypted the
resulting data.
4. Verify that the secret is correctly decrypted when retrieved via the API:
The following command reads all secrets and then updates them to apply server side
encryption. If an error occurs due to a conflicting write, retry the command. For larger clusters,
you may wish to subdivide the secrets by namespace or script an update.
1. Add the kms provider as the first entry in the configuration file as shown in the following
example.
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- kms:
apiVersion: v2
name : myKmsPlugin
endpoint: unix:///tmp/socketfile.sock
- aescbc:
keys:
- name: key1
secret: <BASE 64 ENCODED SECRET>
3. Run the following command to force all secrets to be re-encrypted using the kms
provider.
1. Place the identity provider as the first entry in the configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {}
- kms:
apiVersion: v2
name : myKmsPlugin
endpoint: unix:///tmp/socketfile.sock
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.9. To check the version, enter
kubectl version.
About CoreDNS
CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. Like
Kubernetes, the CoreDNS project is hosted by the CNCF.
You can use CoreDNS instead of kube-dns in your cluster by replacing kube-dns in an existing
deployment, or by using tools like kubeadm that will deploy and upgrade the cluster for you.
Installing CoreDNS
For manual deployment or replacement of kube-dns, see the documentation at the CoreDNS
GitHub project.
Migrating to CoreDNS
Upgrading an existing cluster with kubeadm
In Kubernetes version 1.21, kubeadm removed its support for kube-dns as a DNS application.
For kubeadm v1.28, the only supported cluster DNS application is CoreDNS.
You can move to CoreDNS when you use kubeadm to upgrade a cluster that is using kube-dns.
In this case, kubeadm generates the CoreDNS configuration ("Corefile") based upon the kube-
dns ConfigMap, preserving configurations for stub domains, and upstream name server.
Upgrading CoreDNS
You can check the version of CoreDNS that kubeadm installs for each version of Kubernetes in
the page CoreDNS version in Kubernetes.
CoreDNS can be upgraded manually in case you want to only upgrade CoreDNS or use your
own custom image. There is a helpful guideline and walkthrough available to ensure a smooth
upgrade. Make sure the existing CoreDNS configuration ("Corefile") is retained when upgrading
your cluster.
If you are upgrading your cluster using the kubeadm tool, kubeadm can take care of retaining
the existing CoreDNS configuration automatically.
Tuning CoreDNS
When resource utilisation is a concern, it may be useful to tune the configuration of CoreDNS.
For more details, check out the documentation on scaling CoreDNS.
What's next
You can configure CoreDNS to support many more use cases than kube-dns does by modifying
the CoreDNS configuration ("Corefile"). For more information, see the documentation for the
kubernetes CoreDNS plugin, or read the Custom DNS Entries for Kubernetes. in the CoreDNS
blog.
• Killercoda
• Play with Kubernetes
Introduction
NodeLocal DNSCache improves Cluster DNS performance by running a DNS caching agent on
cluster nodes as a DaemonSet. In today's architecture, Pods in 'ClusterFirst' DNS mode reach
out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint
via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the
DNS caching agent running on the same node, thereby avoiding iptables DNAT rules and
connection tracking. The local caching agent will query kube-dns service for cache misses of
cluster hostnames ("cluster.local" suffix by default).
Motivation
• With the current DNS architecture, it is possible that Pods with the highest DNS QPS
have to reach out to a different node, if there is no local kube-dns/CoreDNS instance.
Having a local cache will help improve the latency in such scenarios.
• Skipping iptables DNAT and connection tracking will help reduce conntrack races and
avoid UDP DNS entries filling up conntrack table.
• Connections from the local caching agent to kube-dns service can be upgraded to TCP.
TCP conntrack entries will be removed on connection close in contrast with UDP entries
that have to timeout (default nf_conntrack_udp_timeout is 30 seconds)
Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped
• UDP packets and DNS timeouts usually up to 30s (3 retries + 10s timeout). Since the
nodelocal cache listens for UDP DNS queries, applications don't need to be changed.
• Negative caching can be re-enabled, thereby reducing the number of queries for the kube-
dns service.
Architecture Diagram
This is the path followed by DNS Queries after NodeLocal DNSCache is enabled:
Configuration
Note: The local listen IP address for NodeLocal DNSCache can be any address that can be
guaranteed to not collide with any existing IP in your cluster. It's recommended to use an
address with a local scope, for example, from the 'link-local' range '169.254.0.0/16' for IPv4 or
from the 'Unique Local Address' range in IPv6 'fd00::/8'.
• If using IPv6, the CoreDNS configuration file needs to enclose all the IPv6 addresses into
square brackets if used in 'IP:Port' format. If you are using the sample manifest from the
previous point, this will require you to modify the configuration line L70 like this:
"health [__PILLAR__LOCAL__DNS__]:8080"
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/
__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kube
dns/g" nodelocaldns.yaml
sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/
__PILLAR__DNS__DOMAIN__/$domain/g; s/,__PILLAR__DNS__SERVER__//g; s/
__PILLAR__CLUSTER__DNS__/$kubedns/g" nodelocaldns.yaml
Once enabled, the node-local-dns Pods will run in the kube-system namespace on each of the
cluster nodes. This Pod runs CoreDNS in cache mode, so all CoreDNS metrics exposed by the
different plugins will be available on a per-node basis.
You can disable this feature by removing the DaemonSet, using kubectl delete -f <manifest>.
You should also revert any changes you made to the kubelet configuration.
The default cache size is 10000 entries, which uses about 30 MB when completely
filled.
This would be the memory usage for each server block (if the cache gets completely filled).
Memory usage can be reduced by specifying smaller cache sizes.
The number of concurrent queries is linked to the memory demand, because each extra
goroutine used for handling a query requires an amount of memory. You can set an upper limit
using the max_concurrent option in the forward plugin.
If a node-local-dns Pod attempts to use more memory than is available (because of total system
resources, or because of a configured resource limit), the operating system may shut down that
pod's container. If this happens, the container that is terminated (“OOMKilled”) does not clean
up the custom packet filtering rules that it previously added during startup. The node-local-dns
container should get restarted (since managed as part of a DaemonSet), but this will lead to a
brief DNS downtime each time that the container fails: the packet filtering rules direct DNS
queries to a local Pod that is unhealthy.
You can determine a suitable memory limit by running node-local-dns pods without a limit and
measuring the peak usage. You can also set up and use a VerticalPodAutoscaler in recommender
mode, and then check its recommendations.
This document describes how to configure and use kernel parameters within a Kubernetes
cluster using the sysctl interface.
Note: Starting from Kubernetes version 1.23, the kubelet supports the use of either / or . as
separators for sysctl names. Starting from Kubernetes version 1.25, setting Sysctls for a Pod
supports setting sysctls with slashes. For example, you can represent the same sysctl name as
kernel.shm_rmid_forced using a period as the separator, or as kernel/shm_rmid_forced using a
slash as a separator. For more sysctl parameter conversion method details, please refer to the
page sysctl.d(5) from the Linux man-pages project.
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured
to communicate with your cluster. It is recommended to run this tutorial on a cluster with at
least two nodes that are not acting as control plane hosts. If you do not already have a cluster,
you can create one by using minikube or you can use one of these Kubernetes playgrounds:
• Killercoda
• Play with Kubernetes
For some steps, you also need to be able to reconfigure the command line options for the
kubelets running on your cluster.
Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime.
Parameters are available via the /proc/sys/ virtual process file system. The parameters cover
various subsystems such as:
sudo sysctl -a
• must not have any influence on any other pod on the node
• must not allow to harm the node's health
• must not allow to gain CPU or memory resources outside of the resource limits of a pod.
By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls
are supported in the safe set:
• kernel.shm_rmid_forced,
• net.ipv4.ip_local_port_range,
• net.ipv4.tcp_syncookies,
• net.ipv4.ping_group_range (since Kubernetes 1.18),
• net.ipv4.ip_unprivileged_port_start (since Kubernetes 1.22).
Note:
• The net.* sysctls are not allowed with host networking enabled.
• The net.ipv4.tcp_syncookies sysctl is not namespaced on Linux kernel version 4.4 or
lower.
This list will be extended in future Kubernetes versions when the kubelet supports better
isolation mechanisms.
All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on
a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very
special situations such as high-performance or real-time application tuning. Unsafe sysctls are
enabled on a node-by-node basis with a flag of the kubelet; for example:
kubelet --allowed-unsafe-sysctls \
'kernel.msg*,net.core.somaxconn' ...
The following sysctls are known to be namespaced. This list could change in future versions of
the Linux kernel.
• kernel.shm*,
• kernel.msg*,
• kernel.sem,
• fs.mqueue.*,
• Those net.* that can be set in container networking namespace. However, there are
exceptions (e.g., net.netfilter.nf_conntrack_max and
net.netfilter.nf_conntrack_expect_max can be set in container networking namespace but
are unnamespaced before Linux 5.12.2).
Sysctls with no namespace are called node-level sysctls. If you need to set them, you must
manually configure them on each node's operating system, or by using a DaemonSet with
privileged containers.
Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all
containers in the same pod.
This example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two
unsafe sysctls net.core.somaxconn and kernel.msgmax. There is no distinction between safe and
unsafe sysctls in the specification.
Warning: Only modify sysctl parameters after you understand their effects, to avoid
destabilizing your operating system.
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
...
Warning: Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and
can lead to severe problems like wrong behavior of containers, resource shortage or complete
breakage of a node.
It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and
only schedule pods onto them which need those sysctl settings. It is suggested to use the
Kubernetes taints and toleration feature to implement this.
A pod with the unsafe sysctls will fail to launch on any node which has not enabled those two
unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration
feature or taints on nodes to schedule those pods onto the right nodes.
The Kubernetes Memory Manager enables the feature of guaranteed memory (and hugepages)
allocation for pods in the Guaranteed QoS class.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA
affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with
these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected
or admitted to the node.
Moreover, the Memory Manager ensures that the memory which a pod requests is allocated
from a minimum number of NUMA nodes.
• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.21. To check the version, enter
kubectl version.
To align memory resources with other requested resources in a Pod spec:
• the CPU Manager should be enabled and proper CPU Manager policy should be
configured on a Node. See control CPU Management Policies;
• the Topology Manager should be enabled and proper Topology Manager policy should be
configured on a Node. See control Topology Management Policies.
Starting from v1.22, the Memory Manager is enabled by default through MemoryManager
feature gate.
Preceding v1.22, the kubelet must be started with the following flag:
--feature-gates=MemoryManager=true
The Memory Manager is a Hint Provider, and it provides topology hints for the Topology
Manager which then aligns the requested resources according to these topology hints. It also
enforces cgroups (i.e. cpuset.mems) for pods. The complete flow diagram concerning pod
admission and deployment process is illustrated in Memory Manager KEP: Design Overview
and below:
During this process, the Memory Manager updates its internal counters stored in Node Map
and Memory Maps to manage guaranteed memory allocation.
The Memory Manager updates the Node Map during the startup and runtime as follows.
Startup
This occurs once a node administrator employs --reserved-memory (section Reserved memory
flag). In this case, the Node Map becomes updated to reflect this reservation as illustrated in
Memory Manager KEP: Memory Maps at start-up (with examples).
The administrator must provide --reserved-memory flag when Static policy is configured.
Runtime
Reference Memory Manager KEP: Memory Maps at runtime (with examples) illustrates how a
successful pod deployment affects the Node Map, and it also relates to how potential Out-of-
Memory (OOM) situations are handled further by Kubernetes or operating system.
Important topic in the context of Memory Manager operation is the management of NUMA
groups. Each time pod's memory request is in excess of single NUMA node capacity, the
Memory Manager attempts to create a group that comprises several NUMA nodes and features
extend memory capacity. The problem has been solved as elaborated in Memory Manager KEP:
How to enable the guaranteed memory allocation over many NUMA nodes?. Also, reference
Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)
illustrates how the management of groups occurs.
Policies
Memory Manager supports two policies. You can select a policy via a kubelet flag --memory-
manager-policy:
• None (default)
• Static
None policy
This is the default policy and does not affect the memory allocation in any way. It acts the same
as if the Memory Manager is not present at all.
The None policy returns default topology hint. This special hint denotes that Hint Provider
(Memory Manager in this case) has no preference for NUMA affinity with any resource.
Static policy
In the case of the Guaranteed pod, the Static Memory Manager policy returns topology hints
relating to the set of NUMA nodes where the memory can be guaranteed, and reserves the
memory through updating the internal NodeMap object.
In the case of the BestEffort or Burstable pod, the Static Memory Manager policy sends back the
default topology hint as there is no request for the guaranteed memory, and does not reserve
the memory in the internal NodeMap object.
The Node Allocatable mechanism is commonly used by node administrators to reserve K8S
node system resources for the kubelet or operating system processes in order to enhance the
node stability. A dedicated set of flags can be used for this purpose to set the total amount of
reserved memory for a node. This pre-configured value is subsequently utilized to calculate the
real amount of node's "allocatable" memory available to pods.
The Kubernetes scheduler incorporates "allocatable" to optimise pod scheduling process. The
foregoing flags include --kube-reserved, --system-reserved and --eviction-threshold. The sum of
their values will account for the total amount of reserved memory.
A new --reserved-memory flag was added to Memory Manager to allow for this total reserved
memory to be split (by a node administrator) and accordingly reserved across many NUMA
nodes.
The flag specifies a comma-separated list of memory reservations of different memory types per
NUMA node. Memory reservations across multiple NUMA nodes can be specified using
semicolon as separator. This parameter is only useful in the context of the Memory Manager
feature. The Memory Manager will not use this reserved memory for the allocation of container
workloads.
For example, if you have a NUMA node "NUMA0" with 10Gi of memory available, and the --
reserved-memory was specified to reserve 1Gi of memory at "NUMA0", the Memory Manager
assumes that only 9Gi is available for containers.
You can omit this parameter, however, you should be aware that the quantity of reserved
memory from all NUMA nodes should be equal to the quantity of memory specified by the
Node Allocatable feature. If at least one node allocatable parameter is non-zero, you will need
to specify --reserved-memory for at least one NUMA node. In fact, eviction-hard threshold
value is equal to 100Mi by default, so if Static policy is used, --reserved-memory is obligatory.
1. duplicates, i.e. the same NUMA node or memory type, but with a different value;
2. setting zero limit for any of memory types;
3. NUMA node IDs that do not exist in the machine hardware;
4. memory type names different than memory or hugepages-<size> (hugepages of particular
<size> should also exist).
Syntax:
--reserved-memory N:memory-type1=value1,memory-type2=value2,...
Example usage:
--reserved-memory 0:memory=1Gi,hugepages-1Gi=2Gi
or
or
--reserved-memory '0:memory=1Gi;1:memory=2Gi'
When you specify values for --reserved-memory flag, you must comply with the setting that
you prior provided via Node Allocatable Feature flags. That is, the following rule must be
obeyed for each memory type:
If you do not follow the formula above, the Memory Manager will show an error on startup.
In other words, the example above illustrates that for the conventional memory
(type=memory), we reserve 3Gi in total, i.e.:
• --kube-reserved=cpu=500m,memory=50Mi
• --system-reserved=cpu=123m,memory=333Mi
• --eviction-hard=memory.available<500Mi
Note: The default hard eviction threshold is 100MiB, and not zero. Remember to increase the
quantity of memory that you reserve by setting --reserved-memory by that hard eviction
threshold. Otherwise, the kubelet will not start Memory Manager and display an error.
--feature-gates=MemoryManager=true
--kube-reserved=cpu=4,memory=4Gi
--system-reserved=cpu=1,memory=1Gi
--memory-manager-policy=Static
--reserved-memory '0:memory=3Gi;1:memory=2148Mi'
The following excerpts from pod manifests assign a pod to the Guaranteed QoS class.
Pod with integer CPU(s) runs in the Guaranteed QoS class, when requests are equal to limits:
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
Also, a pod sharing CPU(s) runs in the Guaranteed QoS class, when requests are equal to limits.
spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed
QoS class.
Troubleshooting
The following means can be used to troubleshoot the reason why a pod could not be deployed
or became rejected at a node:
• a node has not enough resources available to satisfy the pod's request
• the pod's request is rejected due to particular Topology Manager policy constraints
Use kubectl describe pod <id> or kubectl get events to obtain detailed error message:
Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with
Topology locality
System logs
The set of hints that Memory Manager generated for the pod can be found in the logs. Also, the
set of hints generated by CPU Manager should be present in the logs.
Topology Manager merges these hints to calculate a single best hint. The best hint should be
also present in the logs.
The best hint indicates where to allocate all the resources. Topology Manager tests this hint
against its current policy, and based on the verdict, it either admits the pod to the node or
rejects it.
Also, search the logs for occurrences associated with the Memory Manager, e.g. to find out
information about cgroups and cpuset.mems updates.
apiVersion: v1
kind: Pod
metadata:
name: guaranteed
spec:
containers:
- name: guaranteed
image: consumer
imagePullPolicy: Never
resources:
limits:
cpu: "2"
memory: 150Gi
requests:
cpu: "2"
memory: 150Gi
command: ["sleep","infinity"]
Next, let us log into the node where it was deployed and examine the state file in /var/lib/
kubelet/memory_manager_state:
{
"policyName":"Static",
"machineState":{
"0":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":134987354112,
"systemReserved":3221225472,
"allocatable":131766128640,
"reserved":131766128640,
"free":0
}
},
"nodes":[
0,
1
]
},
"1":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":135286722560,
"systemReserved":2252341248,
"allocatable":133034381312,
"reserved":29295144960,
"free":103739236352
}
},
"nodes":[
0,
1
]
}
},
"entries":{
"fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{
"guaranteed":[
{
"numaAffinity":[
0,
1
],
"type":"memory",
"size":161061273600
}
]
}
},
"checksum":4142013182
}
It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:
"numaAffinity":[
0,
1
],
Pinned term means that pod's memory consumption is constrained (through cgroups
configuration) to these NUMA nodes.
This automatically implies that Memory Manager instantiated a new group that comprises
these two NUMA nodes, i.e. 0 and 1 indexed NUMA nodes.
Notice that the management of groups is handled in a relatively complex manner, and further
elaboration is provided in Memory Manager KEP in this and this sections.
For example, the total amount of free "conventional" memory in the group can be computed by
adding up the free memory available at every NUMA node in the group, i.e., in the "memory"
section of NUMA node 0 ("free":0) and NUMA node 1 ("free":103739236352). So, the total
amount of free "conventional" memory in this group is equal to 0 + 103739236352 bytes.
The line "systemReserved":3221225472 indicates that the administrator of this node reserved
3221225472 bytes (i.e. 3Gi) to serve kubelet and system processes at NUMA node 0, by using --
reserved-memory flag.
The kubelet provides a PodResourceLister gRPC service to enable discovery of resources and
associated metadata. By using its List gRPC endpoint, information about reserved memory for
each container can be retrieved, which is contained in protobuf ContainerMemory message.
This information can be retrieved solely for pods in Guaranteed QoS class.
What's next
• Memory Manager KEP: Design Overview
• Memory Manager KEP: Memory Maps at start-up (with examples)
• Memory Manager KEP: Memory Maps at runtime (with examples)
• Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)
• Memory Manager KEP: The Concept of Node Map and Memory Maps
• Memory Manager KEP: How to enable the guaranteed memory allocation over many
NUMA nodes?
Verify Signed Kubernetes Artifacts
FEATURE STATE: Kubernetes v1.26 [beta]
URL=https://dl.k8s.io/release/v1.28.4/bin/linux/amd64
BINARY=kubectl
FILES=(
"$BINARY"
"$BINARY.sig"
"$BINARY.cert"
)
Note:
Pick one image from this list and verify its signature using the cosign verify command:
To verify all signed control plane images for the latest stable version (v1.28.4), please run the
following commands:
Once you have verified an image, you can specify the image by its digest in your Pod manifests
as per this example:
registry-url/image-
name@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2
For more information, please refer to the Image Pull Policy section.
• Installation
• Configuration Options
• Killercoda
• Play with Kubernetes
To check the version, enter kubectl version.
Each node in your cluster must have at least 300 MiB of memory.
A few of the steps on this page require you to run the metrics-server service in your cluster. If
you have the metrics-server running, you can skip those steps.
If you are running Minikube, run the following command to enable the metrics-server:
To see whether the metrics-server is running, or another provider of the resource metrics API
(metrics.k8s.io), run the following command:
If the resource metrics API is available, the output includes a reference to metrics.k8s.io.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.
In this exercise, you create a Pod that has one Container. The Container has a memory request
of 100 MiB and a memory limit of 200 MiB. Here's the configuration file for the Pod:
pods/resource/memory-request-limit.yaml
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
The args section in the configuration file provides arguments for the Container when it starts.
The "--vm-bytes", "150M" arguments tell the Container to attempt to allocate 150 MiB of
memory.
The output shows that the one Container in the Pod has a memory request of 100 MiB and a
memory limit of 200 MiB.
...
resources:
requests:
memory: 100Mi
limits:
memory: 200Mi
...
The output shows that the Pod is using about 162,900,000 bytes of memory, which is about 150
MiB. This is greater than the Pod's 100 MiB request, but within the Pod's 200 MiB limit.
pods/resource/memory-request-limit-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: memory-demo-2
namespace: mem-example
spec:
containers:
- name: memory-demo-2-ctr
image: polinux/stress
resources:
requests:
memory: "50Mi"
limits:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]
In the args section of the configuration file, you can see that the Container will attempt to
allocate 250 MiB of memory, which is well above the 100 MiB limit.
At this point, the Container might be running or killed. Repeat the preceding command until
the Container is killed:
The output shows that the Container was killed because it is out of memory (OOM):
lastState:
terminated:
containerID: 65183c1877aaec2e8427bc95609cc52677a454b56fcb24340dbd22917c23b10f
exitCode: 137
finishedAt: 2017-06-20T20:52:19Z
reason: OOMKilled
startedAt: null
The Container in this exercise can be restarted, so the kubelet restarts it. Repeat this command
several times to see that the Container is repeatedly killed and restarted:
The output shows that the Container is killed, restarted, killed again, restarted again, and so on:
The output shows that the Container starts and fails repeatedly:
The output includes a record of the Container being killed because of an out-of-memory
condition:
Warning OOMKilling Memory cgroup out of memory: Kill process 4481 (stress) score 1994 or
sacrifice child
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has
enough available memory to satisfy the Pod's memory request.
In this exercise, you create a Pod that has a memory request so big that it exceeds the capacity
of any Node in your cluster. Here is the configuration file for a Pod that has one Container with
a request for 1000 GiB of memory, which likely exceeds the capacity of any Node in your
cluster.
pods/resource/memory-request-limit-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: memory-demo-3
namespace: mem-example
spec:
containers:
- name: memory-demo-3-ctr
image: polinux/stress
resources:
requests:
memory: "1000Gi"
limits:
memory: "1000Gi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
The output shows that the Pod status is PENDING. That is, the Pod is not scheduled to run on
any Node, and it will remain in the PENDING state indefinitely:
The output shows that the Container cannot be scheduled because of insufficient memory on
the Nodes:
Events:
... Reason Message
------ -------
... FailedScheduling No nodes are available that match all of the following predicates::
Insufficient memory (3).
Memory units
The memory resource is measured in bytes. You can express memory as a plain integer or a
fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki. For example,
the following represent approximately the same value:
128974848, 129e6, 129M, 123Mi
• The Container has no upper bound on the amount of memory it uses. The Container
could use all of the memory available on the Node where it is running which in turn
could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no
resource limits will have a greater chance of being killed.
• The Container is running in a namespace that has a default memory limit, and the
Container is automatically assigned the default limit. Cluster administrators can use a
LimitRange to specify a default value for the memory limit.
• The Pod can have bursts of activity where it makes use of memory that happens to be
available.
• The amount of memory a Pod can use during a burst is limited to some reasonable
amount.
Clean up
Delete your namespace. This deletes all the Pods that you created for this task:
What's next
For app developers
• Killercoda
• Play with Kubernetes
Your cluster must have at least 1 CPU available for use to run the task examples.
A few of the steps on this page require you to run the metrics-server service in your cluster. If
you have the metrics-server running, you can skip those steps.
If you are running Minikube, run the following command to enable metrics-server:
To see whether metrics-server (or another provider of the resource metrics API, metrics.k8s.io)
is running, type the following command:
If the resource metrics API is available, the output will include a reference to metrics.k8s.io.
NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a Namespace so that the resources you create in this exercise are isolated from the rest
of your cluster.
In this exercise, you create a Pod that has one container. The container has a request of 0.5 CPU
and a limit of 1 CPU. Here is the configuration file for the Pod:
pods/resource/cpu-request-limit.yaml
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"
The args section of the configuration file provides arguments for the container when it starts.
The -cpus "2" argument tells the Container to attempt to use 2 CPUs.
The output shows that the one container in the Pod has a CPU request of 500 milliCPU and a
CPU limit of 1 CPU.
resources:
limits:
cpu: "1"
requests:
cpu: 500m
This example output shows that the Pod is using 974 milliCPU, which is slightly less than the
limit of 1 CPU specified in the Pod configuration.
Recall that by setting -cpu "2", you configured the Container to attempt to use 2 CPUs, but the
Container is only being allowed to use about 1 CPU. The container's CPU use is being throttled,
because the container is attempting to use more CPU resources than its limit.
Note: Another possible explanation for the CPU use being below 1.0 is that the Node might not
have enough CPU resources available. Recall that the prerequisites for this exercise require
your cluster to have at least 1 CPU available for use. If your Container runs on a Node that has
only 1 CPU, the Container cannot use more than 1 CPU regardless of the CPU limit specified
for the Container.
CPU units
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
• 1 AWS vCPU
• 1 GCP Core
• 1 Azure vCore
• 1 Hyperthread on a bare-metal Intel processor with Hyperthreading
Fractional values are allowed. A Container that requests 0.5 CPU is guaranteed half as much
CPU as a Container that requests 1 CPU. You can use the suffix m to mean milli. For example
100m CPU, 100 milliCPU, and 0.1 CPU are all the same. Precision finer than 1m is not allowed.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same
amount of CPU on a single-core, dual-core, or 48-core machine.
In this exercise, you create a Pod that has a CPU request so big that it exceeds the capacity of
any Node in your cluster. Here is the configuration file for a Pod that has one Container. The
Container requests 100 CPU, which is likely to exceed the capacity of any Node in your cluster.
pods/resource/cpu-request-limit-2.yaml
apiVersion: v1
kind: Pod
metadata:
name: cpu-demo-2
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr-2
image: vish/stress
resources:
limits:
cpu: "100"
requests:
cpu: "100"
args:
- -cpus
- "2"
The output shows that the Pod status is Pending. That is, the Pod has not been scheduled to run
on any Node, and it will remain in the Pending state indefinitely:
The output shows that the Container cannot be scheduled because of insufficient CPU resources
on the Nodes:
Events:
Reason Message
------ -------
FailedScheduling No nodes are available that match all of the following predicates::
Insufficient cpu (3).
Delete your Pod:
• The Container has no upper bound on the CPU resources it can use. The Container could
use all of the CPU resources available on the Node where it is running.
• The Container is running in a namespace that has a default CPU limit, and the Container
is automatically assigned the default limit. Cluster administrators can use a LimitRange to
specify a default value for the CPU limit.
• The Pod can have bursts of activity where it makes use of CPU resources that happen to
be available.
• The amount of CPU resources a Pod can use during a burst is limited to some reasonable
amount.
Clean up
Delete your namespace:
What's next
For app developers
This page shows how to configure Group Managed Service Accounts (GMSA) for Pods and
containers that will run on Windows nodes. Group Managed Service Accounts are a specific
type of Active Directory account that provides automatic password management, simplified
service principal name (SPN) management, and the ability to delegate the management to other
administrators across multiple servers.
Two webhooks need to be configured on the Kubernetes cluster to populate and validate GMSA
credential spec references at the Pod or container level:
1. A mutating webhook that expands references to GMSAs (by name from a Pod
specification) into the full credential spec in JSON form within the Pod spec.
2. A validating webhook ensures all references to GMSAs are authorized to be used by the
Pod service account.
Installing the above webhooks and associated objects require the steps below:
1. Create a certificate key pair (that will be used to allow the webhook container to
communicate to the cluster)
4. Create the validating and mutating webhook configurations referring to the deployment.
A script can be used to deploy and configure the GMSA webhooks and associated objects
mentioned above. The script can be run with a --dry-run=server option to allow you to review
the changes that would be made to your cluster.
The YAML template used by the script may also be used to deploy the webhooks and associated
objects manually (with appropriate substitutions for the parameters)
Following are the steps for generating a GMSA credential spec YAML manually in JSON format
and then converting it:
4. Convert the credspec file from JSON to YAML format and apply the necessary header
fields apiVersion, kind, metadata and credspec to make it a GMSACredentialSpec custom
resource that can be configured in Kubernetes.
The following YAML configuration describes a GMSA credential spec named gmsa-WebApp1:
apiVersion: windows.k8s.io/v1
kind: GMSACredentialSpec
metadata:
name: gmsa-WebApp1 # This is an arbitrary name but it will be used as a reference
credspec:
ActiveDirectoryConfig:
GroupManagedServiceAccounts:
- Name: WebApp1 # Username of the GMSA account
Scope: CONTOSO # NETBIOS Domain Name
- Name: WebApp1 # Username of the GMSA account
Scope: contoso.com # DNS Domain Name
CmsPlugins:
- ActiveDirectory
DomainJoinConfig:
DnsName: contoso.com # DNS Domain Name
DnsTreeName: contoso.com # DNS Domain Name Root
Guid: 244818ae-87ac-4fcd-92ec-e79e5252348a # GUID
MachineAccountName: WebApp1 # Username of the GMSA account
NetBiosName: CONTOSO # NETBIOS Domain Name
Sid: S-1-5-21-2126449477-2524075714-3094792973 # SID of GMSA
The above credential spec resource may be saved as gmsa-Webapp1-credspec.yaml and applied
to the cluster using: kubectl apply -f gmsa-Webapp1-credspec.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: allow-default-svc-account-read-on-gmsa-WebApp1
namespace: default
subjects:
- kind: ServiceAccount
name: default
namespace: default
roleRef:
kind: ClusterRole
name: webapp1-role
apiGroup: rbac.authorization.k8s.io
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-webapp1
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
nodeSelector:
kubernetes.io/os: windows
Individual containers in a Pod spec can also specify the desired GMSA credspec using a per-
container securityContext.windowsOptions.gmsaCredentialSpecName field. For example:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-Webapp1
nodeSelector:
kubernetes.io/os: windows
As Pod specs with GMSA fields populated (as described above) are applied in a cluster, the
following sequence of events take place:
1. The mutating webhook resolves and expands all references to GMSA credential spec
resources to the contents of the GMSA credential spec.
2. The validating webhook ensures the service account associated with the Pod is
authorized for the use verb on the specified GMSA credential spec.
3. The container runtime configures each Windows container with the specified GMSA
credential spec so that the container can assume the identity of the GMSA in Active
Directory and access services in the domain using that identity.
Authenticating to network shares using hostname or
FQDN
If you are experiencing issues connecting to SMB shares from Pods using hostname or FQDN,
but are able to access the shares via their IPv4 address then make sure the following registry
key is set on the Windows nodes.
Running Pods will then need to be recreated to pick up the behavior changes. More information
on how this registry key is used can be found here
Troubleshooting
If you are having difficulties getting GMSA to work in your environment, there are a few
troubleshooting steps you can take.
First, make sure the credspec has been passed to the Pod. To do this you will need to exec into
one of your Pods and check the output of the nltest.exe /parentdomain command.
In the example below the Pod did not get the credspec correctly:
If your Pod did get the credspec correctly, then next check communication with the domain.
First, from inside of your Pod, quickly do an nslookup to find the root of your domain.
If the DNS and communication test passes, next you will need to check if the Pod has
established secure channel communication with the domain. To do this, again, exec into your
Pod and run the nltest.exe /query command.
nltest.exe /query
This tells us that for some reason, the Pod was unable to logon to the domain using the account
specified in the credspec. You can try to repair the secure channel by running the following:
nltest /sc_reset:domain.example
If the command is successful you will see and output similar to this:
If the above corrects the error, you can automate the step by adding the following lifecycle
hook to your Pod spec. If it did not correct the error, you will need to examine your credspec
again and confirm that it is correct and complete.
image: registry.domain.example/iis-auth:1809v1
lifecycle:
postStart:
exec:
command: ["powershell.exe","-command","do { Restart-Service -Name netlogon } while
( $($Result = (nltest.exe /query); if ($Result -like '*0x0 NERR_Success*') {return $true} else
{return $false}) -eq $false)"]
imagePullPolicy: IfNotPresent
If you add the lifecycle section show above to your Pod spec, the Pod will execute the
commands listed to restart the netlogon service until the nltest.exe /query command exits
without error.
This page assumes that you are familiar with Quality of Service for Kubernetes Pods.
This page shows how to resize CPU and memory resources assigned to containers of a running
pod without restarting the pod or its containers. A Kubernetes node allocates resources for a
pod based on its requests, and restricts the pod's resource usage based on the limits specified in
the pod's containers.
• Container's resource requests and limits are mutable for CPU and memory resources.
• allocatedResources field in containerStatuses of the Pod's status reflects the resources
allocated to the pod's containers.
• resources field in containerStatuses of the Pod's status reflects the actual resource
requests and limits that are configured on the running containers as reported by the
container runtime.
• resize field in the Pod's status shows the status of the last requested pending resize. It can
have the following values:
◦ Proposed: This value indicates an acknowledgement of the requested resize and
that the request was validated and recorded.
◦ InProgress: This value indicates that the node has accepted the resize request and is
in the process of applying it to the pod's containers.
◦ Deferred: This value means that the requested resize cannot be granted at this time,