100% found this document useful (1 vote)

5K views786 pages

Kubernetes Tasks

This document provides instructions for installing Kubernetes command line tool (kubectl) on Linux. It describes downloading the kubectl binary using curl or installing it using native package managers like apt, yum or zypper. The instructions cover validating, installing and testing the kubectl version. Alternative options to install kubectl using snap or homebrew are also mentioned.

Uploaded by

cduran1983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

5K views786 pages

Kubernetes Tasks

Uploaded by

cduran1983

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 786

This section of the Kubernetes documentation contains pages that show how to do individual

tasks. A task page shows how to do a single thing, typically by giving a short sequence of steps.

If you would like to write a task page, see Creating a Documentation Pull Request.

Install Tools

Set up Kubernetes tools on your computer.

Administer a Cluster

Learn common tasks for administering a cluster.

Configure Pods and Containers

Perform common configuration tasks for Pods and containers.

Monitoring, Logging, and Debugging

Set up monitoring and logging to troubleshoot a cluster, or debug a containerized application.

Manage Kubernetes Objects

Declarative and imperative paradigms for interacting with the Kubernetes API.

Managing Secrets

Managing confidential settings data using Secrets.

Inject Data Into Applications

Specify configuration and other data for the Pods that run your workload.

Run Applications

Run and manage both stateless and stateful applications.

Run Jobs

Run Jobs using parallel processing.

Access Applications in a Cluster

Configure load balancing, port forwarding, or setup firewall or DNS configurations to access
applications in a cluster.
Extend Kubernetes

Understand advanced ways to adapt your Kubernetes cluster to the needs of your work
environment.

TLS

Understand how to protect traffic within your cluster using Transport Layer Security (TLS).

Manage Cluster Daemons

Perform common tasks for managing a DaemonSet, such as performing a rolling update.

Networking

Learn how to configure networking for your cluster.

Extend kubectl with plugins

Extend kubectl by creating and installing kubectl plugins.

Manage HugePages

Configure and manage huge pages as a schedulable resource in a cluster.

Schedule GPUs

Configure and schedule GPUs for use as a resource by nodes in a cluster.

Install Tools
Set up Kubernetes tools on your computer.

kubectl
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes
clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and
view logs. For more information including a complete list of kubectl operations, see the kubectl
reference documentation.

kubectl is installable on a variety of Linux platforms, macOS and Windows. Find your preferred
operating system below.

• Install kubectl on Linux

• Install kubectl on macOS
• Install kubectl on Windows
kind
kind lets you run Kubernetes on your local computer. This tool requires that you have either
Docker or Podman installed.

The kind Quick Start page shows you what you need to do to get up and running with kind.

View kind Quick Start Guide

minikube
Like kind, minikube is a tool that lets you run Kubernetes locally. minikube runs an all-in-one
or a multi-node local Kubernetes cluster on your personal computer (including Windows,
macOS and Linux PCs) so that you can try out Kubernetes, or for daily development work.

You can follow the official Get Started! guide if your focus is on getting the tool installed.

View minikube Get Started! Guide

Once you have minikube working, you can use it to run a sample application.

kubeadm
You can use the kubeadm tool to create and manage Kubernetes clusters. It performs the actions
necessary to get a minimum viable, secure cluster up and running in a user friendly way.

Installing kubeadm shows you how to install kubeadm. Once installed, you can use it to create a
cluster.

View kubeadm Install Guide

Install and Set Up kubectl on Linux

Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For
example, a v1.28 client can communicate with v1.27, v1.28, and v1.29 control planes. Using the
latest compatible version of kubectl helps avoid unforeseen issues.

Install kubectl on Linux

The following methods exist for installing kubectl on Linux:

• Install kubectl binary with curl on Linux

• Install using native package management
• Install using other package management
Install kubectl binary with curl on Linux

1. Download the latest release with the command:

◦ x86-64
◦ ARM64

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/amd64/kubectl"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/arm64/kubectl"

Note:

To download a specific version, replace the $(curl -L -s https://dl.k8s.io/release/stable.txt)

portion of the command with the specific version.

For example, to download version 1.28.4 on Linux x86-64, type:

curl -LO https://dl.k8s.io/release/v1.28.4/bin/linux/amd64/kubectl

And for Linux ARM64, type:

curl -LO https://dl.k8s.io/release/v1.28.4/bin/linux/arm64/kubectl

2. Validate the binary (optional)

Download the kubectl checksum file:

◦ x86-64
◦ ARM64

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/amd64/kubectl.sha256"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/arm64/kubectl.sha256"

Validate the kubectl binary against the checksum file:

echo "$(cat kubectl.sha256) kubectl" | sha256sum --check

If valid, the output is:

kubectl: OK
If the check fails, sha256 exits with nonzero status and prints output similar to:

kubectl: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

Note: Download the same version of the binary and checksum.

3. Install kubectl

sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

Note:

If you do not have root access on the target system, you can still install kubectl to the
~/.local/bin directory:

chmod +x kubectl
mkdir -p ~/.local/bin
mv ./kubectl ~/.local/bin/kubectl
# and then append (or prepend) ~/.local/bin to $PATH

4. Test to ensure the version you installed is up-to-date:

kubectl version --client

Or use this for detailed view of version:

kubectl version --client --output=yaml

Install using native package management

• Debian-based distributions
• Red Hat-based distributions
• SUSE-based distributions

1. Update the apt package index and install packages needed to use the Kubernetes apt
repository:

sudo apt-get update

# apt-transport-https may be a dummy package; if so, you can skip that package
sudo apt-get install -y apt-transport-https ca-certificates curl

2. Download the public signing key for the Kubernetes package repositories. The same
signing key is used for all repositories so you can disregard the version in the URL:

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.28/deb/Release.key | sudo gpg --dearmor -o /

etc/apt/keyrings/kubernetes-apt-keyring.gpg

3. Add the appropriate Kubernetes apt repository. If you want to use Kubernetes version
different than v1.28, replace v1.28 with the desired minor version in the command below:

# This overwrites any existing configuration in /etc/apt/sources.list.d/kubernetes.list

echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/
core:/stable:/v1.28/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
Note: To upgrade kubectl to another minor release, you'll need to bump the version in /etc/apt/
sources.list.d/kubernetes.list before running apt-get update and apt-get upgrade. This procedure
is described in more detail in Changing The Kubernetes Package Repository.

1. Update apt package index, then install kubectl:

sudo apt-get update

sudo apt-get install -y kubectl

Note: In releases older than Debian 12 and Ubuntu 22.04, /etc/apt/keyrings does not exist by
default, and can be created using sudo mkdir -m 755 /etc/apt/keyrings

1. Add the Kubernetes yum repository. If you want to use Kubernetes version different than
v1.28, replace v1.28 with the desired minor version in the command below.

# This overwrites any existing configuration in /etc/yum.repos.d/kubernetes.repo

cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/repodata/repomd.xml.key
EOF

Note: To upgrade kubectl to another minor release, you'll need to bump the version in /etc/
yum.repos.d/kubernetes.repo before running yum update. This procedure is described in more
detail in Changing The Kubernetes Package Repository.

1. Install kubectl using yum:

sudo yum install -y kubectl

1. Add the Kubernetes zypper repository. If you want to use Kubernetes version different
than v1.28, replace v1.28 with the desired minor version in the command below.

# This overwrites any existing configuration in /etc/zypp/repos.d/kubernetes.repo

cat <<EOF | sudo tee /etc/zypp/repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.28/rpm/repodata/repomd.xml.key
EOF

Note: To upgrade kubectl to another minor release, you'll need to bump the version in /etc/
zypp/repos.d/kubernetes.repo before running zypper update. This procedure is described in
more detail in Changing The Kubernetes Package Repository.

1. Install kubectl using zypper:

sudo zypper install -y kubectl

Install using other package management

• Snap
• Homebrew

If you are on Ubuntu or another Linux distribution that supports the snap package manager,
kubectl is available as a snap application.

snap install kubectl --classic

kubectl version --client

If you are on Linux and using Homebrew package manager, kubectl is available for installation.

brew install kubectl

kubectl version --client

Verify kubectl configuration

In order for kubectl to find and access a Kubernetes cluster, it needs a kubeconfig file, which is
created automatically when you create a cluster using kube-up.sh or successfully deploy a
Minikube cluster. By default, kubectl configuration is located at ~/.kube/config.

Check that kubectl is properly configured by getting the cluster state:

kubectl cluster-info

If you see a URL response, kubectl is correctly configured to access your cluster.

If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.

The connection to the server <server-name:port> was refused - did you specify the right host or
port?

For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.

If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:

kubectl cluster-info dump

Optional kubectl configurations and plugins

Enable shell autocompletion

kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save
you a lot of typing.

Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.

• Bash
• Fish
• Zsh

Introduction

The kubectl completion script for Bash can be generated with the command kubectl completion
bash. Sourcing the completion script in your shell enables kubectl autocompletion.

However, the completion script depends on bash-completion, which means that you have to
install this software first (you can test if you have bash-completion already installed by running
type _init_completion).

Install bash-completion

bash-completion is provided by many package managers (see here). You can install it with apt-
get install bash-completion or yum install bash-completion, etc.

The above commands create /usr/share/bash-completion/bash_completion, which is the main

script of bash-completion. Depending on your package manager, you have to manually source
this file in your ~/.bashrc file.

To find out, reload your shell and run type _init_completion. If the command succeeds, you're
already set, otherwise add the following to your ~/.bashrc file:

source /usr/share/bash-completion/bash_completion

Reload your shell and verify that bash-completion is correctly installed by typing type
_init_completion.

Enable kubectl autocompletion

Bash

You now need to ensure that the kubectl completion script gets sourced in all your shell
sessions. There are two ways in which you can do this:

• User
• System

echo 'source <(kubectl completion bash)' >>~/.bashrc

kubectl completion bash | sudo tee /etc/bash_completion.d/kubectl > /dev/null

sudo chmod a+r /etc/bash_completion.d/kubectl

If you have an alias for kubectl, you can extend shell completion to work with that alias:

echo 'alias k=kubectl' >>~/.bashrc

echo 'complete -o default -F __start_kubectl k' >>~/.bashrc

Note: bash-completion sources all completion scripts in /etc/bash_completion.d.

Both approaches are equivalent. After reloading your shell, kubectl autocompletion should be
working. To enable bash autocompletion in current session of shell, source the ~/.bashrc file:

source ~/.bashrc

Note: Autocomplete for Fish requires kubectl 1.23 or later.

The kubectl completion script for Fish can be generated with the command kubectl completion
fish. Sourcing the completion script in your shell enables kubectl autocompletion.

To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:

kubectl completion fish | source

After reloading your shell, kubectl autocompletion should be working.

The kubectl completion script for Zsh can be generated with the command kubectl completion
zsh. Sourcing the completion script in your shell enables kubectl autocompletion.

To do so in all your shell sessions, add the following to your ~/.zshrc file:

source <(kubectl completion zsh)

If you have an alias for kubectl, kubectl autocompletion will automatically work with it.

After reloading your shell, kubectl autocompletion should be working.

If you get an error like 2: command not found: compdef, then add the following to the
beginning of your ~/.zshrc file:

autoload -Uz compinit

compinit

Install kubectl convert plugin

A plugin for Kubernetes command-line tool kubectl, which allows you to convert manifests
between different API versions. This can be particularly helpful to migrate manifests to a non-
deprecated api version with newer Kubernetes release. For more info, visit migrate to non
deprecated apis

1. Download the latest release with the command:

◦ x86-64
◦ ARM64

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/amd64/kubectl-convert"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/arm64/kubectl-convert"
Validate the binary (optional)
2.
Download the kubectl-convert checksum file:

◦ x86-64
◦ ARM64

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/amd64/kubectl-convert.sha256"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

linux/arm64/kubectl-convert.sha256"

Validate the kubectl-convert binary against the checksum file:

echo "$(cat kubectl-convert.sha256) kubectl-convert" | sha256sum --check

If valid, the output is:

kubectl-convert: OK

If the check fails, sha256 exits with nonzero status and prints output similar to:

kubectl-convert: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match

Note: Download the same version of the binary and checksum.

3. Install kubectl-convert

sudo install -o root -g root -m 0755 kubectl-convert /usr/local/bin/kubectl-convert

4. Verify plugin is successfully installed

kubectl convert --help

If you do not see an error, it means the plugin is successfully installed.

5. After installing the plugin, clean up the installation files:

rm kubectl-convert kubectl-convert.sha256

What's next
• Install Minikube
• See the getting started guides for more about creating clusters.
• Learn how to launch and expose your application.
• If you need access to a cluster you didn't create, see the Sharing Cluster Access document.
• Read the kubectl reference docs
Install and Set Up kubectl on macOS
Before you begin
You must use a kubectl version that is within one minor version difference of your cluster. For
example, a v1.28 client can communicate with v1.27, v1.28, and v1.29 control planes. Using the
latest compatible version of kubectl helps avoid unforeseen issues.

Install kubectl on macOS

The following methods exist for installing kubectl on macOS:

• Install kubectl on macOS

◦ Install kubectl binary with curl on macOS
◦ Install with Homebrew on macOS
◦ Install with Macports on macOS
• Verify kubectl configuration
• Optional kubectl configurations and plugins
◦ Enable shell autocompletion
◦ Install kubectl convert plugin

Install kubectl binary with curl on macOS

1. Download the latest release:

◦ Intel
◦ Apple Silicon

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/amd64/kubectl"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/arm64/kubectl"

Note:

To download a specific version, replace the $(curl -L -s https://dl.k8s.io/release/stable.txt)

portion of the command with the specific version.

For example, to download version 1.28.4 on Intel macOS, type:

curl -LO "https://dl.k8s.io/release/v1.28.4/bin/darwin/amd64/kubectl"

And for macOS on Apple Silicon, type:

curl -LO "https://dl.k8s.io/release/v1.28.4/bin/darwin/arm64/kubectl"

Validate the binary (optional)
2.
Download the kubectl checksum file:

◦ Intel
◦ Apple Silicon

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/amd64/kubectl.sha256"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/arm64/kubectl.sha256"

Validate the kubectl binary against the checksum file:

echo "$(cat kubectl.sha256) kubectl" | shasum -a 256 --check

If valid, the output is:

kubectl: OK

If the check fails, shasum exits with nonzero status and prints output similar to:

kubectl: FAILED
shasum: WARNING: 1 computed checksum did NOT match

Note: Download the same version of the binary and checksum.

3. Make the kubectl binary executable.

chmod +x ./kubectl

4. Move the kubectl binary to a file location on your system PATH.

sudo mv ./kubectl /usr/local/bin/kubectl

sudo chown root: /usr/local/bin/kubectl

Note: Make sure /usr/local/bin is in your PATH environment variable.

5. Test to ensure the version you installed is up-to-date:

kubectl version --client

Or use this for detailed view of version:

kubectl version --client --output=yaml

6. After installing the plugin, clean up the installation files:

rm kubectl kubectl.sha256
Install with Homebrew on macOS

If you are on macOS and using Homebrew package manager, you can install kubectl with
Homebrew.

1. Run the installation command:

brew install kubectl

brew install kubernetes-cli

2. Test to ensure the version you installed is up-to-date:

kubectl version --client

Install with Macports on macOS

If you are on macOS and using Macports package manager, you can install kubectl with
Macports.

1. Run the installation command:

sudo port selfupdate

sudo port install kubectl

2. Test to ensure the version you installed is up-to-date:

kubectl version --client

Verify kubectl configuration

Check that kubectl is properly configured by getting the cluster state:

kubectl cluster-info

If you see a URL response, kubectl is correctly configured to access your cluster.

If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.

The connection to the server <server-name:port> was refused - did you specify the right host or
port?

For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.

If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:
kubectl cluster-info dump

Optional kubectl configurations and plugins

Enable shell autocompletion

kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell which can save
you a lot of typing.

Below are the procedures to set up autocompletion for Bash, Fish, and Zsh.

• Bash
• Fish
• Zsh

Introduction

The kubectl completion script for Bash can be generated with kubectl completion bash.
Sourcing this script in your shell enables kubectl completion.

However, the kubectl completion script depends on bash-completion which you thus have to
previously install.

Warning: There are two versions of bash-completion, v1 and v2. V1 is for Bash 3.2 (which is
the default on macOS), and v2 is for Bash 4.1+. The kubectl completion script doesn't work
correctly with bash-completion v1 and Bash 3.2. It requires bash-completion v2 and Bash
4.1+. Thus, to be able to correctly use kubectl completion on macOS, you have to install and use
Bash 4.1+ (instructions). The following instructions assume that you use Bash 4.1+ (that is, any
Bash version of 4.1 or newer).

Upgrade Bash

The instructions here assume you use Bash 4.1+. You can check your Bash's version by running:

echo $BASH_VERSION

If it is too old, you can install/upgrade it using Homebrew:

brew install bash

Reload your shell and verify that the desired version is being used:

echo $BASH_VERSION $SHELL

Homebrew usually installs it at /usr/local/bin/bash.

Install bash-completion

Note: As mentioned, these instructions assume you use Bash 4.1+, which means you will install
bash-completion v2 (in contrast to Bash 3.2 and bash-completion v1, in which case kubectl
completion won't work).
You can test if you have bash-completion v2 already installed with type _init_completion. If
not, you can install it with Homebrew:

brew install bash-completion@2

As stated in the output of this command, add the following to your ~/.bash_profile file:

brew_etc="$(brew --prefix)/etc" && [[ -r "${brew_etc}/profile.d/bash_completion.sh" ]] && . "${b

rew_etc}/profile.d/bash_completion.sh"

Reload your shell and verify that bash-completion v2 is correctly installed with type
_init_completion.

Enable kubectl autocompletion

You now have to ensure that the kubectl completion script gets sourced in all your shell
sessions. There are multiple ways to achieve this:

• Source the completion script in your ~/.bash_profile file:

echo 'source <(kubectl completion bash)' >>~/.bash_profile

• Add the completion script to the /usr/local/etc/bash_completion.d directory:

kubectl completion bash >/usr/local/etc/bash_completion.d/kubectl

• If you have an alias for kubectl, you can extend shell completion to work with that alias:

echo 'alias k=kubectl' >>~/.bash_profile

echo 'complete -o default -F __start_kubectl k' >>~/.bash_profile

• If you installed kubectl with Homebrew (as explained here), then the kubectl completion
script should already be in /usr/local/etc/bash_completion.d/kubectl. In that case, you
don't need to do anything.

Note: The Homebrew installation of bash-completion v2 sources all the files in the
BASH_COMPLETION_COMPAT_DIR directory, that's why the latter two methods work.

In any case, after reloading your shell, kubectl completion should be working.

Note: Autocomplete for Fish requires kubectl 1.23 or later.

The kubectl completion script for Fish can be generated with the command kubectl completion
fish. Sourcing the completion script in your shell enables kubectl autocompletion.

To do so in all your shell sessions, add the following line to your ~/.config/fish/config.fish file:

kubectl completion fish | source

After reloading your shell, kubectl autocompletion should be working.

The kubectl completion script for Zsh can be generated with the command kubectl completion
zsh. Sourcing the completion script in your shell enables kubectl autocompletion.

To do so in all your shell sessions, add the following to your ~/.zshrc file:
source <(kubectl completion zsh)

If you have an alias for kubectl, kubectl autocompletion will automatically work with it.

After reloading your shell, kubectl autocompletion should be working.

If you get an error like 2: command not found: compdef, then add the following to the
beginning of your ~/.zshrc file:

autoload -Uz compinit

compinit

Install kubectl convert plugin

1. Download the latest release with the command:

◦ Intel
◦ Apple Silicon

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/amd64/kubectl-convert"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/arm64/kubectl-convert"

2. Validate the binary (optional)

Download the kubectl-convert checksum file:

◦ Intel
◦ Apple Silicon

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/amd64/kubectl-convert.sha256"

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/

darwin/arm64/kubectl-convert.sha256"

Validate the kubectl-convert binary against the checksum file:

echo "$(cat kubectl-convert.sha256) kubectl-convert" | shasum -a 256 --check

If valid, the output is:

kubectl-convert: OK

If the check fails, shasum exits with nonzero status and prints output similar to:

kubectl-convert: FAILED
shasum: WARNING: 1 computed checksum did NOT match

Note: Download the same version of the binary and checksum.

3. Make kubectl-convert binary executable

chmod +x ./kubectl-convert

4. Move the kubectl-convert binary to a file location on your system PATH.

sudo mv ./kubectl-convert /usr/local/bin/kubectl-convert

sudo chown root: /usr/local/bin/kubectl-convert

Note: Make sure /usr/local/bin is in your PATH environment variable.

5. Verify plugin is successfully installed

kubectl convert --help

If you do not see an error, it means the plugin is successfully installed.

6. After installing the plugin, clean up the installation files:

rm kubectl-convert kubectl-convert.sha256

Uninstall kubectl on macOS

Depending on how you installed kubectl, use one of the following methods.

Uninstall kubectl using the command-line

1. Locate the kubectl binary on your system:

which kubectl

2. Remove the kubectl binary:

sudo rm <path>

Replace <path> with the path to the kubectl binary from the previous step. For example,
sudo rm /usr/local/bin/kubectl.

Uninstall kubectl using homebrew

If you installed kubectl using Homebrew, run the following command:

brew remove kubectl

Install and Set Up kubectl on Windows

Install kubectl on Windows

The following methods exist for installing kubectl on Windows:

• Install kubectl binary with curl on Windows

• Install on Windows using Chocolatey, Scoop, or winget

Install kubectl binary with curl on Windows

1. Download the latest 1.28 patch release: kubectl 1.28.4.

Or if you have curl installed, use this command:

curl.exe -LO "https://dl.k8s.io/release/v1.28.4/bin/windows/amd64/kubectl.exe"

Note: To find out the latest stable version (for example, for scripting), take a look at
https://dl.k8s.io/release/stable.txt.

2. Validate the binary (optional)

Download the kubectl checksum file:

curl.exe -LO "https://dl.k8s.io/v1.28.4/bin/windows/amd64/kubectl.exe.sha256"

Validate the kubectl binary against the checksum file:

◦ Using Command Prompt to manually compare CertUtil's output to the checksum

file downloaded:

CertUtil -hashfile kubectl.exe SHA256

type kubectl.exe.sha256

◦ Using PowerShell to automate the verification using the -eq operator to get a True
or False result:
$(Get-FileHash -Algorithm SHA256 .\kubectl.exe).Hash -eq $(Get-Content .
\kubectl.exe.sha256)

3. Append or prepend the kubectl binary folder to your PATH environment variable.

4. Test to ensure the version of kubectl is the same as downloaded:

kubectl version --client

Or use this for detailed view of version:

kubectl version --client --output=yaml

Note: Docker Desktop for Windows adds its own version of kubectl to PATH. If you have
installed Docker Desktop before, you may need to place your PATH entry before the one added
by the Docker Desktop installer or remove the Docker Desktop's kubectl.

Install on Windows using Chocolatey, Scoop, or winget

1. To install kubectl on Windows you can use either Chocolatey package manager, Scoop
command-line installer, or winget package manager.

◦ choco
◦ scoop
◦ winget

choco install kubernetes-cli

scoop install kubectl

winget install -e --id Kubernetes.kubectl

2. Test to ensure the version you installed is up-to-date:

kubectl version --client

3. Navigate to your home directory:

# If you're using cmd.exe, run: cd %USERPROFILE%

cd ~

4. Create the .kube directory:

mkdir .kube

5. Change to the .kube directory you just created:

cd .kube

6. Configure kubectl to use a remote Kubernetes cluster:

New-Item config -type file

Note: Edit the config file with a text editor of your choice, such as Notepad.
Verify kubectl configuration
In order for kubectl to find and access a Kubernetes cluster, it needs a kubeconfig file, which is
created automatically when you create a cluster using kube-up.sh or successfully deploy a
Minikube cluster. By default, kubectl configuration is located at ~/.kube/config.

Check that kubectl is properly configured by getting the cluster state:

kubectl cluster-info

If you see a URL response, kubectl is correctly configured to access your cluster.

If you see a message similar to the following, kubectl is not configured correctly or is not able
to connect to a Kubernetes cluster.

The connection to the server <server-name:port> was refused - did you specify the right host or
port?

For example, if you are intending to run a Kubernetes cluster on your laptop (locally), you will
need a tool like Minikube to be installed first and then re-run the commands stated above.

If kubectl cluster-info returns the url response but you can't access your cluster, to check
whether it is configured properly, use:

kubectl cluster-info dump

Optional kubectl configurations and plugins

Enable shell autocompletion

kubectl provides autocompletion support for Bash, Zsh, Fish, and PowerShell, which can save
you a lot of typing.

Below are the procedures to set up autocompletion for PowerShell.

The kubectl completion script for PowerShell can be generated with the command kubectl
completion powershell.

To do so in all your shell sessions, add the following line to your $PROFILE file:

kubectl completion powershell | Out-String | Invoke-Expression

This command will regenerate the auto-completion script on every PowerShell start up. You can
also add the generated script directly to your $PROFILE file.

To add the generated script to your $PROFILE file, run the following line in your powershell
prompt:

kubectl completion powershell >> $PROFILE

After reloading your shell, kubectl autocompletion should be working.

Install kubectl convert plugin

1. Download the latest release with the command:

curl.exe -LO "https://dl.k8s.io/release/v1.28.4/bin/windows/amd64/kubectl-convert.exe"

2. Validate the binary (optional).

Download the kubectl-convert checksum file:

curl.exe -LO "https://dl.k8s.io/v1.28.4/bin/windows/amd64/kubectl-convert.exe.sha256"

Validate the kubectl-convert binary against the checksum file:

◦ Using Command Prompt to manually compare CertUtil's output to the checksum

file downloaded:

CertUtil -hashfile kubectl-convert.exe SHA256

type kubectl-convert.exe.sha256

◦ Using PowerShell to automate the verification using the -eq operator to get a True
or False result:

$($(CertUtil -hashfile .\kubectl-convert.exe SHA256)[1] -replace " ", "") -eq $(type .\k
ubectl-convert.exe.sha256)

3. Append or prepend the kubectl-convert binary folder to your PATH environment

variable.

4. Verify the plugin is successfully installed.

kubectl convert --help

If you do not see an error, it means the plugin is successfully installed.

5. After installing the plugin, clean up the installation files:

del kubectl-convert.exe kubectl-convert.exe.sha256

Administration with kubeadm

Migrating from dockershim

Generate Certificates Manually

Manage Memory, CPU, and API Resources

Install a Network Policy Provider

Access Clusters Using the Kubernetes API

Advertise Extended Resources for a Node

Autoscale the DNS Service in a Cluster

Change the default StorageClass

Switching from Polling to CRI Event-based Updates to Container Status

Change the Reclaim Policy of a PersistentVolume

Cloud Controller Manager Administration

Configure a kubelet image credential provider

Configure Quotas for API Objects

Control CPU Management Policies on the Node

Control Topology Management Policies on a node

Customizing DNS Service

Debugging DNS Resolution

Declare Network Policy

Developing Cloud Controller Manager

Enable Or Disable A Kubernetes API

Encrypting Confidential Data at Rest

Decrypt Confidential Data that is Already Encrypted at Rest

Guaranteed Scheduling For Critical Add-On Pods

IP Masquerade Agent User Guide

Limit Storage Consumption

Migrate Replicated Control Plane To Use Cloud Controller Manager

Namespaces Walkthrough

Operating etcd clusters for Kubernetes

Reserve Compute Resources for System Daemons

Running Kubernetes Node Components as a Non-root User

Safely Drain a Node

Securing a Cluster

Set Kubelet Parameters Via A Configuration File

Share a Cluster with Namespaces

Upgrade A Cluster

Use Cascading Deletion in a Cluster

Using a KMS provider for data encryption

Using CoreDNS for Service Discovery

Using NodeLocal DNSCache in Kubernetes Clusters

Using sysctls in a Kubernetes Cluster

Utilizing the NUMA-aware Memory Manager

Verify Signed Kubernetes Artifacts

Administration with kubeadm

Certificate Management with kubeadm

Configuring a cgroup driver

Reconfiguring a kubeadm cluster

Upgrading kubeadm clusters

Upgrading Linux nodes

Upgrading Windows nodes

Changing The Kubernetes Package Repository

Certificate Management with kubeadm

FEATURE STATE: Kubernetes v1.15 [stable]

Client certificates generated by kubeadm expire after 1 year. This page explains how to manage
certificate renewals with kubeadm. It also covers other tasks related to kubeadm certificate
management.

Before you begin

You should be familiar with PKI certificates and requirements in Kubernetes.

Using custom certificates

By default, kubeadm generates all the certificates needed for a cluster to run. You can override
this behavior by providing your own certificates.

To do so, you must place them in whatever directory is specified by the --cert-dir flag or the
certificatesDir field of kubeadm's ClusterConfiguration. By default this is /etc/kubernetes/pki.

If a given certificate and private key pair exists before running kubeadm init, kubeadm does not
overwrite them. This means you can, for example, copy an existing CA into /etc/kubernetes/
pki/ca.crt and /etc/kubernetes/pki/ca.key, and kubeadm will use this CA for signing the rest of
the certificates.

External CA mode
It is also possible to provide only the ca.crt file and not the ca.key file (this is only available for
the root CA file, not other cert pairs). If all other certificates and kubeconfig files are in place,
kubeadm recognizes this condition and activates the "External CA" mode. kubeadm will proceed
without the CA key on disk.

Instead, run the controller-manager standalone with --controllers=csrsigner and point to the
CA certificate and key.

PKI certificates and requirements includes guidance on setting up a cluster to use an external
CA.

Check certificate expiration

You can use the check-expiration subcommand to check when certificates expire:

kubeadm certs check-expiration

The output is similar to this:

CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY

EXTERNALLY MANAGED
admin.conf Dec 30, 2020 23:36 UTC 364d no
apiserver Dec 30, 2020 23:36 UTC 364d ca no
apiserver-etcd-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
apiserver-kubelet-client Dec 30, 2020 23:36 UTC 364d ca no
controller-manager.conf Dec 30, 2020 23:36 UTC 364d no
etcd-healthcheck-client Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-peer Dec 30, 2020 23:36 UTC 364d etcd-ca no
etcd-server Dec 30, 2020 23:36 UTC 364d etcd-ca no
front-proxy-client Dec 30, 2020 23:36 UTC 364d front-proxy-ca no
scheduler.conf Dec 30, 2020 23:36 UTC 364d no

CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY

MANAGED
ca Dec 28, 2029 23:36 UTC 9y no
etcd-ca Dec 28, 2029 23:36 UTC 9y no
front-proxy-ca Dec 28, 2029 23:36 UTC 9y no

The command shows expiration/residual time for the client certificates in the /etc/kubernetes/
pki folder and for the client certificate embedded in the kubeconfig files used by kubeadm
(admin.conf, controller-manager.conf and scheduler.conf).

Additionally, kubeadm informs the user if the certificate is externally managed; in this case, the
user should take care of managing certificate renewal manually/using other tools.

Warning: kubeadm cannot manage certificates signed by an external CA.

Note: kubelet.conf is not included in the list above because kubeadm configures kubelet for
automatic certificate renewal with rotatable certificates under /var/lib/kubelet/pki. To repair an
expired kubelet client certificate see Kubelet client certificate rotation fails.
Warning:

On nodes created with kubeadm init, prior to kubeadm version 1.17, there is a bug where you
manually have to modify the contents of kubelet.conf. After kubeadm init finishes, you should
update kubelet.conf to point to the rotated kubelet client certificates, by replacing client-
certificate-data and client-key-data with:

client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

Automatic certificate renewal

kubeadm renews all the certificates during control plane upgrade.

This feature is designed for addressing the simplest use cases; if you don't have specific
requirements on certificate renewal and perform Kubernetes version upgrades regularly (less
than 1 year in between each upgrade), kubeadm will take care of keeping your cluster up to
date and reasonably secure.

Note: It is a best practice to upgrade your cluster frequently in order to stay secure.
If you have more complex requirements for certificate renewal, you can opt out from the
default behavior by passing --certificate-renewal=false to kubeadm upgrade apply or to
kubeadm upgrade node.

Warning: Prior to kubeadm version 1.17 there is a bug where the default value for --certificate-
renewal is false for the kubeadm upgrade node command. In that case, you should explicitly set
--certificate-renewal=true.

Manual certificate renewal

You can renew your certificates manually at any time with the kubeadm certs renew command,
with the appropriate command line options.

This command performs the renewal using CA (or front-proxy-CA) certificate and key stored in
/etc/kubernetes/pki.

After running the command you should restart the control plane Pods. This is required since
dynamic certificate reload is currently not supported for all components and certificates. Static
Pods are managed by the local kubelet and not by the API Server, thus kubectl cannot be used
to delete and restart them. To restart a static Pod you can temporarily remove its manifest file
from /etc/kubernetes/manifests/ and wait for 20 seconds (see the fileCheckFrequency value in
KubeletConfiguration struct. The kubelet will terminate the Pod if it's no longer in the manifest
directory. You can then move the file back and after another fileCheckFrequency period, the
kubelet will recreate the Pod and the certificate renewal for the component can complete.

Warning: If you are running an HA cluster, this command needs to be executed on all the
control-plane nodes.
Note: certs renew uses the existing certificates as the authoritative source for attributes
(Common Name, Organization, SAN, etc.) instead of the kubeadm-config ConfigMap. It is
strongly recommended to keep them both in sync.

kubeadm certs renew can renew any specific certificate or, with the subcommand all, it can
renew all of them, as shown below:

kubeadm certs renew all

Note:

Clusters built with kubeadm often copy the admin.conf certificate into $HOME/.kube/config, as
instructed in Creating a cluster with kubeadm. On such a system, to update the contents of
$HOME/.kube/config after renewing the admin.conf, you must run the following commands:

sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

sudo chown $(id -u):$(id -g) $HOME/.kube/config

Renew certificates with the Kubernetes certificates API

This section provides more details about how to execute manual certificate renewal using the
Kubernetes certificates API.

Caution: These are advanced topics for users who need to integrate their organization's
certificate infrastructure into a kubeadm-built cluster. If the default kubeadm configuration
satisfies your needs, you should let kubeadm manage certificates instead.
Set up a signer

The Kubernetes Certificate Authority does not work out of the box. You can configure an
external signer such as cert-manager, or you can use the built-in signer.

The built-in signer is part of kube-controller-manager.

To activate the built-in signer, you must pass the --cluster-signing-cert-file and --cluster-
signing-key-file flags.

If you're creating a new cluster, you can use a kubeadm configuration file:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
controllerManager:
extraArgs:
cluster-signing-cert-file: /etc/kubernetes/pki/ca.crt
cluster-signing-key-file: /etc/kubernetes/pki/ca.key

Create certificate signing requests (CSR)

See Create CertificateSigningRequest for creating CSRs with the Kubernetes API.

Renew certificates with external CA

This section provide more details about how to execute manual certificate renewal using an
external CA.

To better integrate with external CAs, kubeadm can also produce certificate signing requests
(CSRs). A CSR represents a request to a CA for a signed certificate for a client. In kubeadm
terms, any certificate that would normally be signed by an on-disk CA can be produced as a
CSR instead. A CA, however, cannot be produced as a CSR.

Create certificate signing requests (CSR)

You can create certificate signing requests with kubeadm certs renew --csr-only.

Both the CSR and the accompanying private key are given in the output. You can pass in a
directory with --csr-dir to output the CSRs to the specified location. If --csr-dir is not specified,
the default certificate directory (/etc/kubernetes/pki) is used.

Certificates can be renewed with kubeadm certs renew --csr-only. As with kubeadm init, an
output directory can be specified with the --csr-dir flag.

A CSR contains a certificate's name, domains, and IPs, but it does not specify usages. It is the
responsibility of the CA to specify the correct cert usages when issuing a certificate.

• In openssl this is done with the openssl ca command.

• In cfssl you specify usages in the config file.

After a certificate is signed using your preferred method, the certificate and the private key
must be copied to the PKI directory (by default /etc/kubernetes/pki).
Certificate authority (CA) rotation
Kubeadm does not support rotation or replacement of CA certificates out of the box.

For more information about manual rotation or replacement of CA, see manual rotation of CA
certificates.

Enabling signed kubelet serving certificates

By default the kubelet serving certificate deployed by kubeadm is self-signed. This means a
connection from external services like the metrics-server to a kubelet cannot be secured with
TLS.

To configure the kubelets in a new kubeadm cluster to obtain properly signed serving
certificates you must pass the following minimal configuration to kubeadm init:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true

If you have already created the cluster you must adapt it by doing the following:

• Find and edit the kubelet-config-1.28 ConfigMap in the kube-system namespace. In that
ConfigMap, the kubelet key has a KubeletConfiguration document as its value. Edit the
KubeletConfiguration document to set serverTLSBootstrap: true.
• On each node, add the serverTLSBootstrap: true field in /var/lib/kubelet/config.yaml and
restart the kubelet with systemctl restart kubelet

The field serverTLSBootstrap: true will enable the bootstrap of kubelet serving certificates by
requesting them from the certificates.k8s.io API. One known limitation is that the CSRs
(Certificate Signing Requests) for these certificates cannot be automatically approved by the
default signer in the kube-controller-manager - kubernetes.io/kubelet-serving. This will require
action from the user or a third party controller.

These CSRs can be viewed using:

kubectl get csr

NAME AGE SIGNERNAME REQUESTOR CONDITION

csr-9wvgt 112s kubernetes.io/kubelet-serving system:node:worker-1 Pending
csr-lz97v 1m58s kubernetes.io/kubelet-serving system:node:control-plane-1 Pending

To approve them you can do the following:

kubectl certificate approve <CSR-name>

By default, these serving certificate will expire after one year. Kubeadm sets the
KubeletConfiguration field rotateCertificates to true, which means that close to expiration a
new set of CSRs for the serving certificates will be created and must be approved to complete
the rotation. To understand more see Certificate Rotation.
If you are looking for a solution for automatic approval of these CSRs it is recommended that
you contact your cloud provider and ask if they have a CSR signer that verifies the node
identity with an out of band mechanism.

Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.

Third party custom controllers can be used:

• kubelet-csr-approver

Such a controller is not a secure mechanism unless it not only verifies the CommonName in the
CSR but also verifies the requested IPs and domain names. This would prevent a malicious actor
that has access to a kubelet client certificate to create CSRs requesting serving certificates for
any IP or domain name.

Generating kubeconfig files for additional users

During cluster creation, kubeadm signs the certificate in the admin.conf to have Subject: O =
system:masters, CN = kubernetes-admin. system:masters is a break-glass, super user group that
bypasses the authorization layer (for example, RBAC). Sharing the admin.conf with additional
users is not recommended!

Instead, you can use the kubeadm kubeconfig user command to generate kubeconfig files for
additional users. The command accepts a mixture of command line flags and kubeadm
configuration options. The generated kubeconfig will be written to stdout and can be piped to a
file using kubeadm kubeconfig user ... > somefile.conf.

Example configuration file that can be used with --config:

# example.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
# Will be used as the target "cluster" in the kubeconfig
clusterName: "kubernetes"
# Will be used as the "server" (IP or DNS name) of this cluster in the kubeconfig
controlPlaneEndpoint: "some-dns-address:6443"
# The cluster CA key and certificate will be loaded from this local directory
certificatesDir: "/etc/kubernetes/pki"

Make sure that these settings match the desired target cluster settings. To see the settings of an
existing cluster use:

kubectl get cm kubeadm-config -n kube-system -o=jsonpath="{.data.ClusterConfiguration}"

The following example will generate a kubeconfig file with credentials valid for 24 hours for a
new user johndoe that is part of the appdevs group:

kubeadm kubeconfig user --config example.yaml --org appdevs --client-name johndoe --

validity-period 24h
The following example will generate a kubeconfig file with administrator credentials valid for 1
week:

kubeadm kubeconfig user --config example.yaml --client-name admin --validity-period 168h

Configuring a cgroup driver

This page explains how to configure the kubelet's cgroup driver to match the container runtime
cgroup driver for kubeadm clusters.

Before you begin

You should be familiar with the Kubernetes container runtime requirements.

Configuring the container runtime cgroup driver

The Container runtimes page explains that the systemd driver is recommended for kubeadm
based setups instead of the kubelet's default cgroupfs driver, because kubeadm manages the
kubelet as a systemd service.

The page also provides details on how to set up a number of different container runtimes with
the systemd driver by default.

Configuring the kubelet cgroup driver

kubeadm allows you to pass a KubeletConfiguration structure during kubeadm init. This
KubeletConfiguration can include the cgroupDriver field which controls the cgroup driver of
the kubelet.

Note:

In v1.22 and later, if the user does not set the cgroupDriver field under KubeletConfiguration,
kubeadm defaults it to systemd.

In Kubernetes v1.28, you can enable automatic detection of the cgroup driver as an alpha
feature. See systemd cgroup driver for more details.

A minimal example of configuring the field explicitly:

# kubeadm-config.yaml
kind: ClusterConfiguration
apiVersion: kubeadm.k8s.io/v1beta3
kubernetesVersion: v1.21.0
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd

Such a configuration file can then be passed to the kubeadm command:

kubeadm init --config kubeadm-config.yaml

Note:

Kubeadm uses the same KubeletConfiguration for all nodes in the cluster. The
KubeletConfiguration is stored in a ConfigMap object under the kube-system namespace.

Executing the sub commands init, join and upgrade would result in kubeadm writing the
KubeletConfiguration as a file under /var/lib/kubelet/config.yaml and passing it to the local
node kubelet.

Using the cgroupfs driver

To use cgroupfs and to prevent kubeadm upgrade from modifying the KubeletConfiguration
cgroup driver on existing setups, you must be explicit about its value. This applies to a case
where you do not wish future versions of kubeadm to apply the systemd driver by default.

See the below section on "Modify the kubelet ConfigMap" for details on how to be explicit
about the value.

If you wish to configure a container runtime to use the cgroupfs driver, you must refer to the
documentation of the container runtime of your choice.

Migrating to the systemd driver

To change the cgroup driver of an existing kubeadm cluster from cgroupfs to systemd in-place,
a similar procedure to a kubelet upgrade is required. This must include both steps outlined
below.

Note: Alternatively, it is possible to replace the old nodes in the cluster with new ones that use
the systemd driver. This requires executing only the first step below before joining the new
nodes and ensuring the workloads can safely move to the new nodes before deleting the old
nodes.

Modify the kubelet ConfigMap

• Call kubectl edit cm kubelet-config -n kube-system.

• Either modify the existing cgroupDriver value or add a new field that looks like this:

cgroupDriver: systemd

This field must be present under the kubelet: section of the ConfigMap.

Update the cgroup driver on all nodes

For each node in the cluster:

• Drain the node using kubectl drain <node-name> --ignore-daemonsets

• Stop the kubelet using systemctl stop kubelet
• Stop the container runtime
• Modify the container runtime cgroup driver to systemd
• Set cgroupDriver: systemd in /var/lib/kubelet/config.yaml
• Start the container runtime
• Start the kubelet using systemctl start kubelet
• Uncordon the node using kubectl uncordon <node-name>

Execute these steps on nodes one at a time to ensure workloads have sufficient time to schedule
on different nodes.

Once the process is complete ensure that all nodes and workloads are healthy.

Reconfiguring a kubeadm cluster

kubeadm does not support automated ways of reconfiguring components that were deployed
on managed nodes. One way of automating this would be by using a custom operator.

To modify the components configuration you must manually edit associated cluster objects and
files on disk.

This guide shows the correct sequence of steps that need to be performed to achieve kubeadm
cluster reconfiguration.

Before you begin

• You need a cluster that was deployed using kubeadm
• Have administrator credentials (/etc/kubernetes/admin.conf) and network connectivity to
a running kube-apiserver in the cluster from a host that has kubectl installed
• Have a text editor installed on all hosts

Reconfiguring the cluster

kubeadm writes a set of cluster wide component configuration options in ConfigMaps and
other objects. These objects must be manually edited. The command kubectl edit can be used for
that.

The kubectl edit command will open a text editor where you can edit and save the object
directly.

You can use the environment variables KUBECONFIG and KUBE_EDITOR to specify the
location of the kubectl consumed kubeconfig file and preferred text editor.

For example:

KUBECONFIG=/etc/kubernetes/admin.conf KUBE_EDITOR=nano kubectl edit <parameters>

Note: Upon saving any changes to these cluster objects, components running on nodes may not
be automatically updated. The steps below instruct you on how to perform that manually.
Warning: Component configuration in ConfigMaps is stored as unstructured data (YAML
string). This means that validation will not be performed upon updating the contents of a
ConfigMap. You have to be careful to follow the documented API format for a particular
component configuration and avoid introducing typos and YAML indentation mistakes.
Applying cluster configuration changes

Updating the ClusterConfiguration

During cluster creation and upgrade, kubeadm writes its ClusterConfiguration in a ConfigMap
called kubeadm-config in the kube-system namespace.

To change a particular option in the ClusterConfiguration you can edit the ConfigMap with this
command:

kubectl edit cm -n kube-system kubeadm-config

The configuration is located under the data.ClusterConfiguration key.

Note: The ClusterConfiguration includes a variety of options that affect the configuration of
individual components such as kube-apiserver, kube-scheduler, kube-controller-manager,
CoreDNS, etcd and kube-proxy. Changes to the configuration must be reflected on node
components manually.

Reflecting ClusterConfiguration changes on control plane nodes

kubeadm manages the control plane components as static Pod manifests located in the
directory /etc/kubernetes/manifests. Any changes to the ClusterConfiguration under the
apiServer, controllerManager, scheduler or etcd keys must be reflected in the associated files in
the manifests directory on a control plane node.

Such changes may include:

• extraArgs - requires updating the list of flags passed to a component container

• extraMounts - requires updated the volume mounts for a component container
• *SANs - requires writing new certificates with updated Subject Alternative Names.

Before proceeding with these changes, make sure you have backed up the directory /etc/
kubernetes/.

To write new certificates you can use:

kubeadm init phase certs <component-name> --config <config-file>

To write new manifest files in /etc/kubernetes/manifests you can use:

kubeadm init phase control-plane <component-name> --config <config-file>

The <config-file> contents must match the updated ClusterConfiguration. The <component-
name> value must be the name of the component.

Note: Updating a file in /etc/kubernetes/manifests will tell the kubelet to restart the static Pod
for the corresponding component. Try doing these changes one node at a time to leave the
cluster without downtime.
Applying kubelet configuration changes

Updating the KubeletConfiguration

During cluster creation and upgrade, kubeadm writes its KubeletConfiguration in a ConfigMap
called kubelet-config in the kube-system namespace.

You can edit the ConfigMap with this command:

kubectl edit cm -n kube-system kubelet-config

The configuration is located under the data.kubelet key.

Reflecting the kubelet changes

To reflect the change on kubeadm nodes you must do the following:

• Log in to a kubeadm node

• Run kubeadm upgrade node phase kubelet-config to download the latest kubelet-config
ConfigMap contents into the local file /var/lib/kubelet/config.yaml
• Edit the file /var/lib/kubelet/kubeadm-flags.env to apply additional configuration with
flags
• Restart the kubelet service with systemctl restart kubelet

Note: Do these changes one node at a time to allow workloads to be rescheduled properly.
Note: During kubeadm upgrade, kubeadm downloads the KubeletConfiguration from the
kubelet-config ConfigMap and overwrite the contents of /var/lib/kubelet/config.yaml. This
means that node local configuration must be applied either by flags in /var/lib/kubelet/
kubeadm-flags.env or by manually updating the contents of /var/lib/kubelet/config.yaml after
kubeadm upgrade, and then restarting the kubelet.

Applying kube-proxy configuration changes

Updating the KubeProxyConfiguration

During cluster creation and upgrade, kubeadm writes its KubeProxyConfiguration in a

ConfigMap in the kube-system namespace called kube-proxy.

This ConfigMap is used by the kube-proxy DaemonSet in the kube-system namespace.

To change a particular option in the KubeProxyConfiguration, you can edit the ConfigMap with
this command:

kubectl edit cm -n kube-system kube-proxy

The configuration is located under the data.config.conf key.

Reflecting the kube-proxy changes

Once the kube-proxy ConfigMap is updated, you can restart all kube-proxy Pods:

Obtain the Pod names:

kubectl get po -n kube-system | grep kube-proxy

Delete a Pod with:

kubectl delete po -n kube-system <pod-name>

New Pods that use the updated ConfigMap will be created.

Note: Because kubeadm deploys kube-proxy as a DaemonSet, node specific configuration is

unsupported.

Applying CoreDNS configuration changes

Updating the CoreDNS Deployment and Service

kubeadm deploys CoreDNS as a Deployment called coredns and with a Service kube-dns, both
in the kube-system namespace.

To update any of the CoreDNS settings, you can edit the Deployment and Service objects:

kubectl edit deployment -n kube-system coredns

kubectl edit service -n kube-system kube-dns

Reflecting the CoreDNS changes

Once the CoreDNS changes are applied you can delete the CoreDNS Pods:

Obtain the Pod names:

kubectl get po -n kube-system | grep coredns

Delete a Pod with:

kubectl delete po -n kube-system <pod-name>

New Pods with the updated CoreDNS configuration will be created.

Note: kubeadm does not allow CoreDNS configuration during cluster creation and upgrade.
This means that if you execute kubeadm upgrade apply, your changes to the CoreDNS objects
will be lost and must be reapplied.

Persisting the reconfiguration

During the execution of kubeadm upgrade on a managed node, kubeadm might overwrite
configuration that was applied after the cluster was created (reconfiguration).

Persisting Node object reconfiguration

kubeadm writes Labels, Taints, CRI socket and other information on the Node object for a
particular Kubernetes node. To change any of the contents of this Node object you can use:

kubectl edit no <node-name>

During kubeadm upgrade the contents of such a Node might get overwritten. If you would like
to persist your modifications to the Node object after upgrade, you can prepare a kubectl patch
and apply it to the Node object:

kubectl patch no <node-name> --patch-file <patch-file>

Persisting control plane component reconfiguration

The main source of control plane configuration is the ClusterConfiguration object stored in the
cluster. To extend the static Pod manifests configuration, patches can be used.

These patch files must remain as files on the control plane nodes to ensure that they can be
used by the kubeadm upgrade ... --patches <directory>.

If reconfiguration is done to the ClusterConfiguration and static Pod manifests on disk, the set
of node specific patches must be updated accordingly.

Persisting kubelet reconfiguration

Any changes to the KubeletConfiguration stored in /var/lib/kubelet/config.yaml will be

overwritten on kubeadm upgrade by downloading the contents of the cluster wide kubelet-
config ConfigMap. To persist kubelet node specific configuration either the file /var/lib/kubelet/
config.yaml has to be updated manually post-upgrade or the file /var/lib/kubelet/kubeadm-
flags.env can include flags. The kubelet flags override the associated KubeletConfiguration
options, but note that some of the flags are deprecated.

A kubelet restart will be required after changing /var/lib/kubelet/config.yaml or /var/lib/

kubelet/kubeadm-flags.env.

What's next
• Upgrading kubeadm clusters
• Customizing components with the kubeadm API
• Certificate management with kubeadm
• Find more about kubeadm set-up

Upgrading kubeadm clusters

This page explains how to upgrade a Kubernetes cluster created with kubeadm from version
1.27.x to version 1.28.x, and from version 1.28.x to 1.28.y (where y > x). Skipping MINOR
versions when upgrading is unsupported. For more details, please visit Version Skew Policy.

To see information about upgrading clusters created using older versions of kubeadm, please
refer to following pages instead:

• Upgrading a kubeadm cluster from 1.26 to 1.27

• Upgrading a kubeadm cluster from 1.25 to 1.26
• Upgrading a kubeadm cluster from 1.24 to 1.25
• Upgrading a kubeadm cluster from 1.23 to 1.24
The upgrade workflow at high level is the following:

1. Upgrade a primary control plane node.

2. Upgrade additional control plane nodes.
3. Upgrade worker nodes.

Before you begin

• Make sure you read the release notes carefully.
• The cluster should use a static control plane and etcd pods or external etcd.
• Make sure to back up any important components, such as app-level state stored in a
database. kubeadm upgrade does not touch your workloads, only components internal to
Kubernetes, but backups are always a best practice.
• Swap must be disabled.

Additional information

• The instructions below outline when to drain each node during the upgrade process. If
you are performing a minor version upgrade for any kubelet, you must first drain the
node (or nodes) that you are upgrading. In the case of control plane nodes, they could be
running CoreDNS Pods or other critical workloads. For more information see Draining
nodes.
• All containers are restarted after upgrade, because the container spec hash value is
changed.
• To verify that the kubelet service has successfully restarted after the kubelet has been
upgraded, you can execute systemctl status kubelet or view the service logs with
journalctl -xeu kubelet.
• Usage of the --config flag of kubeadm upgrade with kubeadm configuration API types
with the purpose of reconfiguring the cluster is not recommended and can have
unexpected results. Follow the steps in Reconfiguring a kubeadm cluster instead.

Changing the package repository

If you're using the community-owned package repositories (pkgs.k8s.io), you need to enable the
package repository for the desired Kubernetes minor release. This is explained in Changing the
Kubernetes package repository document.

Note: The legacy package repositories (apt.kubernetes.io and yum.kubernetes.io) have been
deprecated and frozen starting from September 13, 2023. Using the new package repositories
hosted at pkgs.k8s.io is strongly recommended and required in order to install
Kubernetes versions released after September 13, 2023. The deprecated legacy repositories,
and their contents, might be removed at any time in the future and without a further notice
period. The new package repositories provide downloads for Kubernetes versions starting with
v1.24.0.

Determine which version to upgrade to

Find the latest patch release for Kubernetes 1.28 using the OS package manager:

• Ubuntu, Debian or HypriotOS

• CentOS, RHEL or Fedora
# Find the latest 1.28 version in the list.
# It should look like 1.28.x-*, where x is the latest patch.
apt update
apt-cache madison kubeadm

# Find the latest 1.28 version in the list.

# It should look like 1.28.x-*, where x is the latest patch.
yum list --showduplicates kubeadm --disableexcludes=kubernetes

Upgrading control plane nodes

The upgrade procedure on control plane nodes should be executed one node at a time. Pick a
control plane node that you wish to upgrade first. It must have the /etc/kubernetes/admin.conf
file.

Call "kubeadm upgrade"

For the first control plane node

1. Upgrade kubeadm:

◦ Ubuntu, Debian or HypriotOS

◦ CentOS, RHEL or Fedora

# replace x in 1.28.x-* with the latest patch version

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm='1.28.x-*' && \
apt-mark hold kubeadm

# replace x in 1.28.x-* with the latest patch version

yum install -y kubeadm-'1.28.x-*' --disableexcludes=kubernetes

2. Verify that the download works and has the expected version:

kubeadm version

3. Verify the upgrade plan:

kubeadm upgrade plan

This command checks that your cluster can be upgraded, and fetches the versions you
can upgrade to. It also shows a table with the component config version states.

Note: kubeadm upgrade also automatically renews the certificates that it manages on this
node. To opt-out of certificate renewal the flag --certificate-renewal=false can be used.
For more information see the certificate management guide.
Note: If kubeadm upgrade plan shows any component configs that require manual
upgrade, users must provide a config file with replacement configs to kubeadm upgrade
apply via the --config command line flag. Failing to do so will cause kubeadm upgrade
apply to exit with an error and not perform an upgrade.

4. Choose a version to upgrade to, and run the appropriate command. For example:
# replace x with the patch version you picked for this upgrade
sudo kubeadm upgrade apply v1.28.x

Once the command finishes you should see:

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.28.x". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with
upgrading your kubelets if you haven't already done so.

Note: For versions earlier than v1.28, kubeadm defaulted to a mode that upgrades the
addons (including CoreDNS and kube-proxy) immediately during kubeadm upgrade
apply, regardless of whether there are other control plane instances that have not been
upgraded. This may cause compatibility problems. Since v1.28, kubeadm defaults to a
mode that checks whether all the control plane instances have been upgraded before
starting to upgrade the addons. You must perform control plane instances upgrade
sequentially or at least ensure that the last control plane instance upgrade is not started
until all the other control plane instances have been upgraded completely, and the addons
upgrade will be performed after the last control plane instance is upgraded. If you want
to keep the old upgrade behavior, please enable the UpgradeAddonsBeforeControlPlane
feature gate by kubeadm upgrade apply --feature-
gates=UpgradeAddonsBeforeControlPlane=true. The Kubernetes project does not in
general recommend enabling this feature gate, you should instead change your upgrade
process or cluster addons so that you do not need to enable the legacy behavior. The
UpgradeAddonsBeforeControlPlane feature gate will be removed in a future release.

5. Manually upgrade your CNI provider plugin.

Your Container Network Interface (CNI) provider may have its own upgrade instructions
to follow. Check the addons page to find your CNI provider and see whether additional
upgrade steps are required.

This step is not required on additional control plane nodes if the CNI provider runs as a
DaemonSet.

For the other control plane nodes

Same as the first control plane node but use:

sudo kubeadm upgrade node

instead of:

sudo kubeadm upgrade apply

Also calling kubeadm upgrade plan and upgrading the CNI provider plugin is no longer needed.

Drain the node

Prepare the node for maintenance by marking it unschedulable and evicting the workloads:

# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets
Upgrade kubelet and kubectl

1. Upgrade the kubelet and kubectl:

◦ Ubuntu, Debian or HypriotOS

◦ CentOS, RHEL or Fedora

# replace x in 1.28.x-* with the latest patch version

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet='1.28.x-*' kubectl='1.28.x-*' && \
apt-mark hold kubelet kubectl

# replace x in 1.28.x-* with the latest patch version

yum install -y kubelet-'1.28.x-*' kubectl-'1.28.x-*' --disableexcludes=kubernetes

2. Restart the kubelet:

sudo systemctl daemon-reload

sudo systemctl restart kubelet

Uncordon the node

Bring the node back online by marking it schedulable:

# replace <node-to-uncordon> with the name of your node

kubectl uncordon <node-to-uncordon>

Upgrade worker nodes

The upgrade procedure on worker nodes should be executed one node at a time or few nodes at
a time, without compromising the minimum required capacity for running your workloads.

The following pages show how to upgrade Linux and Windows worker nodes:

• Upgrade Linux nodes

• Upgrade Windows nodes

Verify the status of the cluster

After the kubelet is upgraded on all nodes verify that all nodes are available again by running
the following command from anywhere kubectl can access the cluster:

kubectl get nodes

The STATUS column should show Ready for all your nodes, and the version number should be
updated.

Recovering from a failure state

If kubeadm upgrade fails and does not roll back, for example because of an unexpected
shutdown during execution, you can run kubeadm upgrade again. This command is idempotent
and eventually makes sure that the actual state is the desired state you declare.
To recover from a bad state, you can also run kubeadm upgrade apply --force without changing
the version that your cluster is running.

During upgrade kubeadm writes the following backup folders under /etc/kubernetes/tmp:

• kubeadm-backup-etcd-<date>-<time>
• kubeadm-backup-manifests-<date>-<time>

kubeadm-backup-etcd contains a backup of the local etcd member data for this control plane
Node. In case of an etcd upgrade failure and if the automatic rollback does not work, the
contents of this folder can be manually restored in /var/lib/etcd. In case external etcd is used
this backup folder will be empty.

kubeadm-backup-manifests contains a backup of the static Pod manifest files for this control
plane Node. In case of a upgrade failure and if the automatic rollback does not work, the
contents of this folder can be manually restored in /etc/kubernetes/manifests. If for some reason
there is no difference between a pre-upgrade and post-upgrade manifest file for a certain
component, a backup file for it will not be written.

How it works
kubeadm upgrade apply does the following:

• Checks that your cluster is in an upgradeable state:

◦ The API server is reachable
◦ All nodes are in the Ready state
◦ The control plane is healthy
• Enforces the version skew policies.
• Makes sure the control plane images are available or available to pull to the machine.
• Generates replacements and/or uses user supplied overwrites if component configs
require version upgrades.
• Upgrades the control plane components or rollbacks if any of them fails to come up.
• Applies the new CoreDNS and kube-proxy manifests and makes sure that all necessary
RBAC rules are created.
• Creates new certificate and key files of the API server and backs up old files if they're
about to expire in 180 days.

kubeadm upgrade node does the following on additional control plane nodes:

• Fetches the kubeadm ClusterConfiguration from the cluster.

• Optionally backups the kube-apiserver certificate.
• Upgrades the static Pod manifests for the control plane components.
• Upgrades the kubelet configuration for this node.

kubeadm upgrade node does the following on worker nodes:

• Fetches the kubeadm ClusterConfiguration from the cluster.

• Upgrades the kubelet configuration for this node.

Upgrading Linux nodes

This page explains how to upgrade a Linux Worker Nodes created with kubeadm.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured
to communicate with your cluster. It is recommended to run this tutorial on a cluster with at
least two nodes that are not acting as control plane hosts. If you do not already have a cluster,
you can create one by using minikube or you can use one of these Kubernetes playgrounds:

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

• Familiarize yourself with the process for upgrading the rest of your kubeadm cluster. You
will want to upgrade the control plane nodes before upgrading your Linux Worker nodes.

Changing the package repository

Upgrading worker nodes

Upgrade kubeadm

Upgrade kubeadm:

• Ubuntu, Debian or HypriotOS

• CentOS, RHEL or Fedora

# replace x in 1.28.x-* with the latest patch version

apt-mark unhold kubeadm && \
apt-get update && apt-get install -y kubeadm='1.28.x-*' && \
apt-mark hold kubeadm

# replace x in 1.28.x-* with the latest patch version

yum install -y kubeadm-'1.28.x-*' --disableexcludes=kubernetes

Call "kubeadm upgrade"

For worker nodes this upgrades the local kubelet configuration:

sudo kubeadm upgrade node

Drain the node

Prepare the node for maintenance by marking it unschedulable and evicting the workloads:

# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets

Upgrade kubelet and kubectl

1. Upgrade the kubelet and kubectl:

◦ Ubuntu, Debian or HypriotOS

◦ CentOS, RHEL or Fedora

# replace x in 1.28.x-* with the latest patch version

apt-mark unhold kubelet kubectl && \
apt-get update && apt-get install -y kubelet='1.28.x-*' kubectl='1.28.x-*' && \
apt-mark hold kubelet kubectl

# replace x in 1.28.x-* with the latest patch version

yum install -y kubelet-'1.28.x-*' kubectl-'1.28.x-*' --disableexcludes=kubernetes

2. Restart the kubelet:

sudo systemctl daemon-reload

sudo systemctl restart kubelet

Uncordon the node

Bring the node back online by marking it schedulable:

# replace <node-to-uncordon> with the name of your node

kubectl uncordon <node-to-uncordon>

What's next
• See how to Upgrade Windows nodes.

Upgrading Windows nodes

FEATURE STATE: Kubernetes v1.18 [beta]

This page explains how to upgrade a Windows node created with kubeadm.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured
to communicate with your cluster. It is recommended to run this tutorial on a cluster with at
least two nodes that are not acting as control plane hosts. If you do not already have a cluster,
you can create one by using minikube or you can use one of these Kubernetes playgrounds:

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version 1.17. To check the version, enter
kubectl version.

• Familiarize yourself with the process for upgrading the rest of your kubeadm cluster. You
will want to upgrade the control plane nodes before upgrading your Windows nodes.

Upgrading worker nodes

Upgrade kubeadm

1. From the Windows node, upgrade kubeadm:

# replace 1.28.4 with your desired version

curl.exe -Lo <path-to-kubeadm.exe> "https://dl.k8s.io/v1.28.4/bin/windows/amd64/
kubeadm.exe"

Drain the node

1. From a machine with access to the Kubernetes API, prepare the node for maintenance by
marking it unschedulable and evicting the workloads:

# replace <node-to-drain> with the name of your node you are draining
kubectl drain <node-to-drain> --ignore-daemonsets

You should see output similar to this:

node/ip-172-31-85-18 cordoned
node/ip-172-31-85-18 drained

Upgrade the kubelet configuration

1. From the Windows node, call the following command to sync new kubelet configuration:

kubeadm upgrade node

Upgrade kubelet and kube-proxy

1. From the Windows node, upgrade and restart the kubelet:

stop-service kubelet
curl.exe -Lo <path-to-kubelet.exe> "https://dl.k8s.io/v1.28.4/bin/windows/amd64/
kubelet.exe"
restart-service kubelet

2. From the Windows node, upgrade and restart the kube-proxy.

stop-service kube-proxy
curl.exe -Lo <path-to-kube-proxy.exe> "https://dl.k8s.io/v1.28.4/bin/windows/amd64/
kube-proxy.exe"
restart-service kube-proxy

Note: If you are running kube-proxy in a HostProcess container within a Pod, and not as a
Windows Service, you can upgrade kube-proxy by applying a newer version of your kube-
proxy manifests.

Uncordon the node

1. From a machine with access to the Kubernetes API, bring the node back online by
marking it schedulable:

# replace <node-to-drain> with the name of your node

kubectl uncordon <node-to-drain>

What's next
• See how to Upgrade Linux nodes.

Changing The Kubernetes Package

Repository
This page explains how to enable a package repository for the desired Kubernetes minor release
upon upgrading a cluster. This is only needed for users of the community-owned package
repositories hosted at pkgs.k8s.io. Unlike the legacy package repositories, the community-
owned package repositories are structured in a way that there's a dedicated package repository
for each Kubernetes minor version.

Note: This guide only covers a part of the Kubernetes upgrade process. Please see the upgrade
guide for more information about upgrading Kubernetes clusters.
Note: This step is only needed upon upgrading a cluster to another minor release. If you're
upgrading to another patch release within the same minor release (e.g. v1.28.5 to v1.28.7), you
don't need to follow this guide. However, if you're still using the legacy package repositories,
you'll need to migrate to the new community-owned package repositories before upgrading
(see the next section for more details on how to do this).

Before you begin

This document assumes that you're already using the community-owned package repositories
(pkgs.k8s.io). If that's not the case, it's strongly recommended to migrate to the community-
owned package repositories as described in the official announcement.

Verifying if the Kubernetes package repositories are used

If you're unsure whether you're using the community-owned package repositories or the legacy
package repositories, take the following steps to verify:

• Ubuntu, Debian or HypriotOS

• CentOS, RHEL or Fedora
• openSUSE or SLES

Print the contents of the file that defines the Kubernetes apt repository:

# On your system, this configuration file could have a different name

pager /etc/apt/sources.list.d/kubernetes.list

If you see a line similar to:

deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/

v1.27/deb/ /

You're using the Kubernetes package repositories and this guide applies to you.
Otherwise, it's strongly recommended to migrate to the Kubernetes package repositories as
described in the official announcement.

Print the contents of the file that defines the Kubernetes yum repository:

# On your system, this configuration file could have a different name

cat /etc/yum.repos.d/kubernetes.repo

If you see a baseurl similar to the baseurl in the output below:

[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl

Print the contents of the file that defines the Kubernetes zypper repository:

# On your system, this configuration file could have a different name

cat /etc/zypp/repos.d/kubernetes.repo

If you see a baseurl similar to the baseurl in the output below:

Note:

The URL used for the Kubernetes package repositories is not limited to pkgs.k8s.io, it can also
be one of:

• pkgs.k8s.io
• pkgs.kubernetes.io
• packages.kubernetes.io

Switching to another Kubernetes package repository

This step should be done upon upgrading from one to another Kubernetes minor release in
order to get access to the packages of the desired Kubernetes minor version.

• Ubuntu, Debian or HypriotOS

• CentOS, RHEL or Fedora

1. Open the file that defines the Kubernetes apt repository using a text editor of your
choice:

nano /etc/apt/sources.list.d/kubernetes.list

You should see a single line with the URL that contains your current Kubernetes minor
version. For example, if you're using v1.27, you should see this:

deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/

stable:/v1.27/deb/ /

2. Change the version in the URL to the next available minor release, for example:

deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/

stable:/v1.28/deb/ /

3. Save the file and exit your text editor. Continue following the relevant upgrade
instructions.

1. Open the file that defines the Kubernetes yum repository using a text editor of your
choice:

nano /etc/yum.repos.d/kubernetes.repo

You should see a file with two URLs that contain your current Kubernetes minor version.
For example, if you're using v1.27, you should see this:
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.27/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni

2. Change the version in these URLs to the next available minor release, for example:

[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/vv1.28/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/vv1.28/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni

3. Save the file and exit your text editor. Continue following the relevant upgrade
instructions.

What's next
• See how to Upgrade Linux nodes.
• See how to Upgrade Windows nodes.

Migrating from dockershim

This section presents information you need to know when migrating from dockershim to other
container runtimes.

Since the announcement of dockershim deprecation in Kubernetes 1.20, there were questions
on how this will affect various workloads and Kubernetes installations. Our Dockershim
Removal FAQ is there to help you to understand the problem better.

Dockershim was removed from Kubernetes with the release of v1.24. If you use Docker Engine
via dockershim as your container runtime and wish to upgrade to v1.24, it is recommended that
you either migrate to another runtime or find an alternative means to obtain Docker Engine
support. Check out the container runtimes section to know your options.

The version of Kubernetes with dockershim (1.23) is out of support and the v1.24 will run out of
support soon. Make sure to report issues you encountered with the migration so the issues can
be fixed in a timely manner and your cluster would be ready for dockershim removal. After
v1.24 running out of support, you will need to contact your Kubernetes provider for support or
upgrade multiple versions at a time if there are critical issues affecting your cluster.

Your cluster might have more than one kind of node, although this is not a common
configuration.

These tasks will help you to migrate:

• Check whether Dockershim removal affects you

• Migrate Docker Engine nodes from dockershim to cri-dockerd
• Migrating telemetry and security agents from dockershim

What's next
• Check out container runtimes to understand your options for an alternative.
• If you find a defect or other technical concern relating to migrating away from
dockershim, you can report an issue to the Kubernetes project.

Changing the Container Runtime on a

Node from Docker Engine to containerd
This task outlines the steps needed to update your container runtime to containerd from
Docker. It is applicable for cluster operators running Kubernetes 1.23 or earlier. This also covers
an example scenario for migrating from dockershim to containerd. Alternative container
runtimes can be picked from this page.

Before you begin

Install containerd. For more information see containerd's installation documentation and for
specific prerequisite follow the containerd guide.

Drain the node

kubectl drain <node-to-drain> --ignore-daemonsets

Replace <node-to-drain> with the name of your node you are draining.

Stop the Docker daemon

systemctl stop kubelet
systemctl disable docker.service --now

Install Containerd
Follow the guide for detailed steps to install containerd.

• Linux
• Windows (PowerShell)

1. Install the containerd.io package from the official Docker repositories. Instructions for
setting up the Docker repository for your respective Linux distribution and installing the
containerd.io package can be found at Getting started with containerd.
Configure containerd:
2.
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml

3. Restart containerd:

sudo systemctl restart containerd

Start a Powershell session, set $Version to the desired version (ex: $Version="1.4.3"), and then
run the following commands:

1. Download containerd:

curl.exe -L https://github.com/containerd/containerd/releases/download/v$Version/
containerd-$Version-windows-amd64.tar.gz -o containerd-windows-amd64.tar.gz
tar.exe xvf .\containerd-windows-amd64.tar.gz

2. Extract and configure:

Copy-Item -Path ".\bin\" -Destination "$Env:ProgramFiles\containerd" -Recurse -Force

cd $Env:ProgramFiles\containerd\
.\containerd.exe config default | Out-File config.toml -Encoding ascii

# Review the configuration. Depending on setup you may want to adjust:

# - the sandbox_image (Kubernetes pause image)
# - cni bin_dir and conf_dir locations
Get-Content config.toml

# (Optional - but highly recommended) Exclude containerd from Windows Defender

Scans
Add-MpPreference -ExclusionProcess "$Env:ProgramFiles\containerd\containerd.exe"

3. Start containerd:

.\containerd.exe --register-service
Start-Service containerd

Configure the kubelet to use containerd as its container

runtime
Edit the file /var/lib/kubelet/kubeadm-flags.env and add the containerd runtime to the flags; --
container-runtime-endpoint=unix:///run/containerd/containerd.sock.

Users using kubeadm should be aware that the kubeadm tool stores the CRI socket for each
host as an annotation in the Node object for that host. To change it you can execute the
following command on a machine that has the kubeadm /etc/kubernetes/admin.conf file.

kubectl edit no <node-name>

This will start a text editor where you can edit the Node object. To choose a text editor you can
set the KUBE_EDITOR environment variable.

• Change the value of kubeadm.alpha.kubernetes.io/cri-socket from /var/run/

dockershim.sock to the CRI socket path of your choice (for example unix:///run/
containerd/containerd.sock).

Note that new CRI socket paths must be prefixed with unix:// ideally.

• Save the changes in the text editor, which will update the Node object.

Restart the kubelet

systemctl start kubelet

Verify that the node is healthy

Run kubectl get nodes -o wide and containerd appears as the runtime for the node we just
changed.

Remove Docker Engine

If the node appears healthy, remove Docker.

• CentOS
• Debian
• Fedora
• Ubuntu

sudo yum remove docker-ce docker-ce-cli

sudo apt-get purge docker-ce docker-ce-cli

sudo dnf remove docker-ce docker-ce-cli

sudo apt-get purge docker-ce docker-ce-cli

The preceding commands don't remove images, containers, volumes, or customized

configuration files on your host. To delete them, follow Docker's instructions to Uninstall
Docker Engine.

Caution: Docker's instructions for uninstalling Docker Engine create a risk of deleting
containerd. Be careful when executing commands.
Uncordon the node
kubectl uncordon <node-to-uncordon>

Replace <node-to-uncordon> with the name of your node you previously drained.

Migrate Docker Engine nodes from

dockershim to cri-dockerd
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.

This page shows you how to migrate your Docker Engine nodes to use cri-dockerd instead of
dockershim. You should follow these steps in these scenarios:

• You want to switch away from dockershim and still use Docker Engine to run containers
in Kubernetes.
• You want to upgrade to Kubernetes v1.28 and your existing cluster relies on dockershim,
in which case you must migrate from dockershim and cri-dockerd is one of your options.

To learn more about the removal of dockershim, read the FAQ page.

What is cri-dockerd?
In Kubernetes 1.23 and earlier, you could use Docker Engine with Kubernetes, relying on a
built-in component of Kubernetes named dockershim. The dockershim component was removed
in the Kubernetes 1.24 release; however, a third-party replacement, cri-dockerd, is available.
The cri-dockerd adapter lets you use Docker Engine through the Container Runtime Interface.

Note: If you already use cri-dockerd, you aren't affected by the dockershim removal. Before you
begin, Check whether your nodes use the dockershim.

If you want to migrate to cri-dockerd so that you can continue using Docker Engine as your
container runtime, you should do the following for each affected node:

1. Install cri-dockerd.
2. Cordon and drain the node.
3. Configure the kubelet to use cri-dockerd.
4. Restart the kubelet.
5. Verify that the node is healthy.

Test the migration on non-critical nodes first.

You should perform the following steps for each node that you want to migrate to cri-dockerd.

Before you begin

• cri-dockerd installed and started on each node.
• A network plugin.

Cordon and drain the node

1. Cordon the node to stop new Pods scheduling on it:

kubectl cordon <NODE_NAME>

Replace <NODE_NAME> with the name of the node.

2. Drain the node to safely evict running Pods:

kubectl drain <NODE_NAME> \

--ignore-daemonsets

Configure the kubelet to use cri-dockerd

The following steps apply to clusters set up using the kubeadm tool. If you use a different tool,
you should modify the kubelet using the configuration instructions for that tool.

1. Open /var/lib/kubelet/kubeadm-flags.env on each affected node.

2. Modify the --container-runtime-endpoint flag to unix:///var/run/cri-dockerd.sock.
3. Modify the --container-runtime flag to remote (unavailable in Kubernetes v1.27 and later).

The kubeadm tool stores the node's socket as an annotation on the Node object in the control
plane. To modify this socket for each affected node:

1. Edit the YAML representation of the Node object:

KUBECONFIG=/path/to/admin.conf kubectl edit no <NODE_NAME>

Replace the following:

◦ /path/to/admin.conf: the path to the kubectl configuration file, admin.conf.

◦ <NODE_NAME>: the name of the node you want to modify.

2. Change kubeadm.alpha.kubernetes.io/cri-socket from /var/run/dockershim.sock to

unix:///var/run/cri-dockerd.sock.

3. Save the changes. The Node object is updated on save.

Restart the kubelet

systemctl restart kubelet

Verify that the node is healthy

To check whether the node uses the cri-dockerd endpoint, follow the instructions in Find out
which runtime you use. The --container-runtime-endpoint flag for the kubelet should be
unix:///var/run/cri-dockerd.sock.
Uncordon the node
Uncordon the node to let Pods schedule on it:

kubectl uncordon <NODE_NAME>

What's next
• Read the dockershim removal FAQ.
• Learn how to migrate from Docker Engine with dockershim to containerd.

Find Out What Container Runtime is Used

on a Node
This page outlines steps to find out what container runtime the nodes in your cluster use.

Depending on the way you run your cluster, the container runtime for the nodes may have
been pre-configured or you need to configure it. If you're using a managed Kubernetes service,
there might be vendor-specific ways to check what container runtime is configured for the
nodes. The method described on this page should work whenever the execution of kubectl is
allowed.

Before you begin

Install and configure kubectl. See Install Tools section for details.

Find out the container runtime used on a Node

Use kubectl to fetch and show node information:

kubectl get nodes -o wide

The output is similar to the following. The column CONTAINER-RUNTIME outputs the runtime
and its version.

For Docker Engine, the output is similar to this:

NAME STATUS VERSION CONTAINER-RUNTIME

node-1 Ready v1.16.15 docker://19.3.1
node-2 Ready v1.16.15 docker://19.3.1
node-3 Ready v1.16.15 docker://19.3.1

If your runtime shows as Docker Engine, you still might not be affected by the removal of
dockershim in Kubernetes v1.24. Check the runtime endpoint to see if you use dockershim. If
you don't use dockershim, you aren't affected.

For containerd, the output is similar to this:

NAME STATUS VERSION CONTAINER-RUNTIME
node-1 Ready v1.19.6 containerd://1.4.1
node-2 Ready v1.19.6 containerd://1.4.1
node-3 Ready v1.19.6 containerd://1.4.1

Find out more information about container runtimes on Container Runtimes page.

Find out what container runtime endpoint you use

The container runtime talks to the kubelet over a Unix socket using the CRI protocol, which is
based on the gRPC framework. The kubelet acts as a client, and the runtime acts as the server.
In some cases, you might find it useful to know which socket your nodes use. For example, with
the removal of dockershim in Kubernetes v1.24 and later, you might want to know whether you
use Docker Engine with dockershim.

Note: If you currently use Docker Engine in your nodes with cri-dockerd, you aren't affected by
the dockershim removal.

You can check which socket you use by checking the kubelet configuration on your nodes.

1. Read the starting commands for the kubelet process:

tr \\0 ' ' < /proc/"$(pgrep kubelet)"/cmdline

If you don't have tr or pgrep, check the command line for the kubelet process manually.

2. In the output, look for the --container-runtime flag and the --container-runtime-endpoint
flag.

◦ If your nodes use Kubernetes v1.23 and earlier and these flags aren't present or if
the --container-runtime flag is not remote, you use the dockershim socket with
Docker Engine. The --container-runtime command line argument is not available in
Kubernetes v1.27 and later.
◦ If the --container-runtime-endpoint flag is present, check the socket name to find
out which runtime you use. For example, unix:///run/containerd/containerd.sock is
the containerd endpoint.

If you want to change the Container Runtime on a Node from Docker Engine to containerd,
you can find out more information on migrating from Docker Engine to containerd, or, if you
want to continue using Docker Engine in Kubernetes v1.24 and later, migrate to a CRI-
compatible adapter like cri-dockerd.

Troubleshooting CNI plugin-related errors

To avoid CNI plugin-related errors, verify that you are using or upgrading to a container
runtime that has been tested to work correctly with your version of Kubernetes.
About the "Incompatible CNI versions" and "Failed to
destroy network for sandbox" errors
Service issues exist for pod CNI network setup and tear down in containerd v1.6.0-v1.6.3 when
the CNI plugins have not been upgraded and/or the CNI config version is not declared in the
CNI config files. The containerd team reports, "these issues are resolved in containerd v1.6.4."

With containerd v1.6.0-v1.6.3, if you do not upgrade the CNI plugins and/or declare the CNI
config version, you might encounter the following "Incompatible CNI versions" or "Failed to
destroy network for sandbox" error conditions.

Incompatible CNI versions error

If the version of your CNI plugin does not correctly match the plugin version in the config
because the config version is later than the plugin version, the containerd log will likely show
an error message on startup of a pod similar to:

incompatible CNI versions; config is \"1.0.0\", plugin supports [\"0.1.0\" \"0.2.0\" \"0.3.0\" \"0.3.1\"
\"0.4.0\"]"

To fix this issue, update your CNI plugins and CNI config files.

Failed to destroy network for sandbox error

If the version of the plugin is missing in the CNI plugin config, the pod may run. However,
stopping the pod generates an error similar to:

ERRO[2022-04-26T00:43:24.518165483Z] StopPodSandbox for "b" failed

error="failed to destroy network for sandbox
\"bbc85f891eaf060c5a879e27bba9b6b06450210161dfdecfbb2732959fb6500a\": invalid version \"\":
the version is empty"

This error leaves the pod in the not-ready state with a network namespace still attached. To
recover from this problem, edit the CNI config file to add the missing version information. The
next attempt to stop the pod should be successful.

Updating your CNI plugins and CNI config files

If you're using containerd v1.6.0-v1.6.3 and encountered "Incompatible CNI versions" or "Failed
to destroy network for sandbox" errors, consider updating your CNI plugins and editing the
CNI config files.

Here's an overview of the typical steps for each node:

1. Safely drain and cordon the node.

2. After stopping your container runtime and kubelet services, perform the following
upgrade operations:

• If you're running CNI plugins, upgrade them to the latest version.

• If you're using non-CNI plugins, replace them with CNI plugins. Use the latest version of
the plugins.
• Update the plugin configuration file to specify or match a version of the CNI specification
that the plugin supports, as shown in the following "An example containerd configuration
file" section.
• For containerd, ensure that you have installed the latest version (v1.0.0 or later) of the
CNI loopback plugin.
• Upgrade node components (for example, the kubelet) to Kubernetes v1.24
• Upgrade to or install the most current version of the container runtime.

1. Bring the node back into your cluster by restarting your container runtime and kubelet.
Uncordon the node (kubectl uncordon <nodename>).

An example containerd configuration file

The following example shows a configuration for containerd runtime v1.6.x, which supports a
recent version of the CNI specification (v1.0.0).

Please see the documentation from your plugin and networking provider for further
instructions on configuring your system.

On Kubernetes, containerd runtime adds a loopback interface, lo, to pods as a default behavior.
The containerd runtime configures the loopback interface via a CNI plugin, loopback. The
loopback plugin is distributed as part of the containerd release packages that have the cni
designation. containerd v1.6.0 and later includes a CNI v1.0.0-compatible loopback plugin as
well as other default CNI plugins. The configuration for the loopback plugin is done internally
by containerd, and is set to use CNI v1.0.0. This also means that the version of the loopback
plugin must be v1.0.0 or later when this newer version containerd is started.

The following bash command generates an example CNI config. Here, the 1.0.0 value for the
config version is assigned to the cniVersion field for use when containerd invokes the CNI
bridge plugin.

cat << EOF | tee /etc/cni/net.d/10-containerd-net.conflist

{
"cniVersion": "1.0.0",
"name": "containerd-net",
"plugins": [
{
"type": "bridge",
"bridge": "cni0",
"isGateway": true,
"ipMasq": true,
"promiscMode": true,
"ipam": {
"type": "host-local",
"ranges": [
[{
"subnet": "10.88.0.0/16"
}],
[{
"subnet": "2001:db8:4860::/64"
}]
],
"routes": [
{ "dst": "0.0.0.0/0" },
{ "dst": "::/0" }
]
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true},
"externalSetMarkChain": "KUBE-MARK-MASQ"
}
]
}
EOF

Update the IP address ranges in the preceding example with ones that are based on your use
case and network addressing plan.

Check whether dockershim removal

affects you
The dockershim component of Kubernetes allows the use of Docker as a Kubernetes's container
runtime. Kubernetes' built-in dockershim component was removed in release v1.24.

This page explains how your cluster could be using Docker as a container runtime, provides
details on the role that dockershim plays when in use, and shows steps you can take to check
whether any workloads could be affected by dockershim removal.

Finding if your app has a dependencies on Docker

If you are using Docker for building your application containers, you can still run these
containers on any container runtime. This use of Docker does not count as a dependency on
Docker as a container runtime.

When alternative container runtime is used, executing Docker commands may either not work
or yield unexpected output. This is how you can find whether you have a dependency on
Docker:

1. Make sure no privileged Pods execute Docker commands (like docker ps), restart the
Docker service (commands such as systemctl restart docker.service), or modify Docker-
specific files such as /etc/docker/daemon.json.
2. Check for any private registries or image mirror settings in the Docker configuration file
(like /etc/docker/daemon.json). Those typically need to be reconfigured for another
container runtime.
3. Check that scripts and apps running on nodes outside of your Kubernetes infrastructure
do not execute Docker commands. It might be:
◦ SSH to nodes to troubleshoot;
◦ Node startup scripts;
◦ Monitoring and security agents installed on nodes directly.
4. Third-party tools that perform above mentioned privileged operations. See Migrating
telemetry and security agents from dockershim for more information.
5. Make sure there are no indirect dependencies on dockershim behavior. This is an edge
case and unlikely to affect your application. Some tooling may be configured to react to
Docker-specific behaviors, for example, raise alert on specific metrics or search for a
specific log message as part of troubleshooting instructions. If you have such tooling
configured, test the behavior on a test cluster before migration.

Dependency on Docker explained

A container runtime is software that can execute the containers that make up a Kubernetes pod.
Kubernetes is responsible for orchestration and scheduling of Pods; on each node, the kubelet
uses the container runtime interface as an abstraction so that you can use any compatible
container runtime.

In its earliest releases, Kubernetes offered compatibility with one container runtime: Docker.
Later in the Kubernetes project's history, cluster operators wanted to adopt additional container
runtimes. The CRI was designed to allow this kind of flexibility - and the kubelet began
supporting CRI. However, because Docker existed before the CRI specification was invented,
the Kubernetes project created an adapter component, dockershim. The dockershim adapter
allows the kubelet to interact with Docker as if Docker were a CRI compatible runtime.

You can read about it in Kubernetes Containerd integration goes GA blog post.

Dockershim vs. CRI with Containerd

Switching to Containerd as a container runtime eliminates the middleman. All the same
containers can be run by container runtimes like Containerd as before. But now, since
containers schedule directly with the container runtime, they are not visible to Docker. So any
Docker tooling or fancy UI you might have used before to check on these containers is no
longer available.

You cannot get container information using docker ps or docker inspect commands. As you
cannot list containers, you cannot get logs, stop containers, or execute something inside a
container using docker exec.

Note: If you're running workloads via Kubernetes, the best way to stop a container is through
the Kubernetes API rather than directly through the container runtime (this advice applies for
all container runtimes, not only Docker).

You can still pull images or build them using docker build command. But images built or pulled
by Docker would not be visible to container runtime and Kubernetes. They needed to be pushed
to some registry to allow them to be used by Kubernetes.
Known issues
Some filesystem metrics are missing and the metrics format is different

The Kubelet /metrics/cadvisor endpoint provides Prometheus metrics, as documented in Metrics

for Kubernetes system components. If you install a metrics collector that depends on that
endpoint, you might see the following issues:

• The metrics format on the Docker node is k8s_<container-name>_<pod-

name>_<namespace>_<pod-uid>_<restart-count> but the format on other runtime is
different. For example, on containerd node it is <container-id>.
• Some filesystem metrics are missing, as follows:

container_fs_inodes_free
container_fs_inodes_total
container_fs_io_current
container_fs_io_time_seconds_total
container_fs_io_time_weighted_seconds_total
container_fs_limit_bytes
container_fs_read_seconds_total
container_fs_reads_merged_total
container_fs_sector_reads_total
container_fs_sector_writes_total
container_fs_usage_bytes
container_fs_write_seconds_total
container_fs_writes_merged_total

Workaround

You can mitigate this issue by using cAdvisor as a standalone daemonset.

1. Find the latest cAdvisor release with the name pattern vX.Y.Z-containerd-cri (for
example, v0.42.0-containerd-cri).
2. Follow the steps in cAdvisor Kubernetes Daemonset to create the daemonset.
3. Point the installed metrics collector to use the cAdvisor /metrics endpoint which provides
the full set of Prometheus container metrics.

Alternatives:

• Use alternative third party metrics collection solution.

• Collect metrics from the Kubelet summary API that is served at /stats/summary.

What's next
• Read Migrating from dockershim to understand your next steps
• Read the dockershim deprecation FAQ article for more information.
Migrating telemetry and security agents
from dockershim
Note: This section links to third party projects that provide functionality required by
Kubernetes. The Kubernetes project authors aren't responsible for these projects, which are
listed alphabetically. To add a project to this list, read the content guide before submitting a
change. More information.

Kubernetes' support for direct integration with Docker Engine is deprecated and has been
removed. Most apps do not have a direct dependency on runtime hosting containers. However,
there are still a lot of telemetry and monitoring agents that have a dependency on Docker to
collect containers metadata, logs, and metrics. This document aggregates information on how to
detect these dependencies as well as links on how to migrate these agents to use generic tools
or alternative runtimes.

Telemetry and security agents

Within a Kubernetes cluster there are a few different ways to run telemetry or security agents.
Some agents have a direct dependency on Docker Engine when they run as DaemonSets or
directly on nodes.

Why do some telemetry agents communicate with Docker Engine?

Historically, Kubernetes was written to work specifically with Docker Engine. Kubernetes took
care of networking and scheduling, relying on Docker Engine for launching and running
containers (within Pods) on a node. Some information that is relevant to telemetry, such as a
pod name, is only available from Kubernetes components. Other data, such as container
metrics, is not the responsibility of the container runtime. Early telemetry agents needed to
query the container runtime and Kubernetes to report an accurate picture. Over time,
Kubernetes gained the ability to support multiple runtimes, and now supports any runtime that
is compatible with the container runtime interface.

Some telemetry agents rely specifically on Docker Engine tooling. For example, an agent might
run a command such as docker ps or docker top to list containers and processes or docker logs
to receive streamed logs. If nodes in your existing cluster use Docker Engine, and you switch to
a different container runtime, these commands will not work any longer.

Identify DaemonSets that depend on Docker Engine

If a pod wants to make calls to the dockerd running on the node, the pod must either:

• mount the filesystem containing the Docker daemon's privileged socket, as a volume; or
• mount the specific path of the Docker daemon's privileged socket directly, also as a
volume.

For example: on COS images, Docker exposes its Unix domain socket at /var/run/docker.sock
This means that the pod spec will include a hostPath volume mount of /var/run/docker.sock.
Here's a sample shell script to find Pods that have a mount directly mapping the Docker socket.
This script outputs the namespace and name of the pod. You can remove the grep '/var/run/
docker.sock' to review other mounts.

kubectl get pods --all-namespaces \

-o=jsonpath='{range .items[*]}{"\n"}{.metadata.namespace}{":\t"}{.metadata.name}{":\t"}
{range .spec.volumes[*]}{.hostPath.path}{", "}{end}{end}' \
| sort \
| grep '/var/run/docker.sock'

Note: There are alternative ways for a pod to access Docker on the host. For instance, the
parent directory /var/run may be mounted instead of the full path (like in this example). The
script above only detects the most common uses.

Detecting Docker dependency from node agents

If your cluster nodes are customized and install additional security and telemetry agents on the
node, check with the agent vendor to verify whether it has any dependency on Docker.

Telemetry and security agent vendors

This section is intended to aggregate information about various telemetry and security agents
that may have a dependency on container runtimes.

We keep the work in progress version of migration instructions for various telemetry and
security agent vendors in Google doc. Please contact the vendor to get up to date instructions
for migrating from dockershim.

Migration from dockershim

Aqua

No changes are needed: everything should work seamlessly on the runtime switch.

Datadog

How to migrate: Docker deprecation in Kubernetes The pod that accesses Docker Engine may
have a name containing any of:

• datadog-agent
• datadog
• dd-agent

Dynatrace

How to migrate: Migrating from Docker-only to generic container metrics in Dynatrace

Containerd support announcement: Get automated full-stack visibility into containerd-based

Kubernetes environments

CRI-O support announcement: Get automated full-stack visibility into your CRI-O Kubernetes
containers (Beta)
The pod accessing Docker may have name containing:

• dynatrace-oneagent

Falco

How to migrate:

Migrate Falco from dockershim Falco supports any CRI-compatible runtime (containerd is used
in the default configuration); the documentation explains all details. The pod accessing Docker
may have name containing:

• falco

Prisma Cloud Compute

Check documentation for Prisma Cloud, under the "Install Prisma Cloud on a CRI (non-Docker)
cluster" section. The pod accessing Docker may be named like:

• twistlock-defender-ds

SignalFx (Splunk)

The SignalFx Smart Agent (deprecated) uses several different monitors for Kubernetes including
kubernetes-cluster, kubelet-stats/kubelet-metrics, and docker-container-stats. The kubelet-stats
monitor was previously deprecated by the vendor, in favor of kubelet-metrics. The docker-
container-stats monitor is the one affected by dockershim removal. Do not use the docker-
container-stats with container runtimes other than Docker Engine.

How to migrate from dockershim-dependent agent:

1. Remove docker-container-stats from the list of configured monitors. Note, keeping this
monitor enabled with non-dockershim runtime will result in incorrect metrics being
reported when docker is installed on node and no metrics when docker is not installed.
2. Enable and configure kubelet-metrics monitor.

Note: The set of collected metrics will change. Review your alerting rules and dashboards.

The Pod accessing Docker may be named something like:

• signalfx-agent

Yahoo Kubectl Flame

Flame does not support container runtimes other than Docker. See https://github.com/yahoo/
kubectl-flame/issues/51

Generate Certificates Manually

When using client certificate authentication, you can generate certificates manually through
easyrsa, openssl or cfssl.
easyrsa

easyrsa can manually generate certificates for your cluster.

1. Download, unpack, and initialize the patched version of easyrsa3.

curl -LO https://dl.k8s.io/easy-rsa/easy-rsa.tar.gz

tar xzf easy-rsa.tar.gz
cd easy-rsa-master/easyrsa3
./easyrsa init-pki

2. Generate a new certificate authority (CA). --batch sets automatic mode; --req-cn specifies
the Common Name (CN) for the CA's new root certificate.

./easyrsa --batch "--req-cn=${MASTER_IP}@`date +%s`" build-ca nopass

3. Generate server certificate and key.

The argument --subject-alt-name sets the possible IPs and DNS names the API server will
be accessed with. The MASTER_CLUSTER_IP is usually the first IP from the service CIDR
that is specified as the --service-cluster-ip-range argument for both the API server and
the controller manager component. The argument --days is used to set the number of
days after which the certificate expires. The sample below also assumes that you are using
cluster.local as the default DNS domain name.

./easyrsa --subject-alt-name="IP:${MASTER_IP},"\
"IP:${MASTER_CLUSTER_IP},"\
"DNS:kubernetes,"\
"DNS:kubernetes.default,"\
"DNS:kubernetes.default.svc,"\
"DNS:kubernetes.default.svc.cluster,"\
"DNS:kubernetes.default.svc.cluster.local" \
--days=10000 \
build-server-full server nopass

4. Copy pki/ca.crt, pki/issued/server.crt, and pki/private/server.key to your directory.

5. Fill in and add the following parameters into the API server start parameters:

--client-ca-file=/yourdirectory/ca.crt
--tls-cert-file=/yourdirectory/server.crt
--tls-private-key-file=/yourdirectory/server.key

openssl

openssl can manually generate certificates for your cluster.

1. Generate a ca.key with 2048bit:

openssl genrsa -out ca.key 2048

2. According to the ca.key generate a ca.crt (use -days to set the certificate effective time):
openssl req -x509 -new -nodes -key ca.key -subj "/CN=${MASTER_IP}" -days 10000 -out
ca.crt

3. Generate a server.key with 2048bit:

openssl genrsa -out server.key 2048

4. Create a config file for generating a Certificate Signing Request (CSR).

Be sure to substitute the values marked with angle brackets (e.g. <MASTER_IP>) with
real values before saving this to a file (e.g. csr.conf). Note that the value for
MASTER_CLUSTER_IP is the service cluster IP for the API server as described in
previous subsection. The sample below also assumes that you are using cluster.local as
the default DNS domain name.

[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn

[ dn ]
C = <country>
ST = <state>
L = <city>
O = <organization>
OU = <organization unit>
CN = <MASTER_IP>

[ req_ext ]
subjectAltName = @alt_names

[ alt_names ]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster
DNS.5 = kubernetes.default.svc.cluster.local
IP.1 = <MASTER_IP>
IP.2 = <MASTER_CLUSTER_IP>

[ v3_ext ]
authorityKeyIdentifier=keyid,issuer:always
basicConstraints=CA:FALSE
keyUsage=keyEncipherment,dataEncipherment
extendedKeyUsage=serverAuth,clientAuth
subjectAltName=@alt_names

5. Generate the certificate signing request based on the config file:

openssl req -new -key server.key -out server.csr -config csr.conf

6. Generate the server certificate using the ca.key, ca.crt and server.csr:
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out server.crt -days 10000 \
-extensions v3_ext -extfile csr.conf -sha256

7. View the certificate signing request:

openssl req -noout -text -in ./server.csr

8. View the certificate:

openssl x509 -noout -text -in ./server.crt

Finally, add the same parameters into the API server start parameters.

cfssl

cfssl is another tool for certificate generation.

1. Download, unpack and prepare the command line tools as shown below.

Note that you may need to adapt the sample commands based on the hardware
architecture and cfssl version you are using.

curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/
cfssl_1.5.0_linux_amd64 -o cfssl
chmod +x cfssl
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/
cfssljson_1.5.0_linux_amd64 -o cfssljson
chmod +x cfssljson
curl -L https://github.com/cloudflare/cfssl/releases/download/v1.5.0/cfssl-
certinfo_1.5.0_linux_amd64 -o cfssl-certinfo
chmod +x cfssl-certinfo

2. Create a directory to hold the artifacts and initialize cfssl:

mkdir cert
cd cert
../cfssl print-defaults config > config.json
../cfssl print-defaults csr > csr.json

3. Create a JSON config file for generating the CA file, for example, ca-config.json:

{
"signing": {
"default": {
"expiry": "8760h"
},
"profiles": {
"kubernetes": {
"usages": [
"signing",
"key encipherment",
"server auth",
"client auth"
],
"expiry": "8760h"
}
}
}
}

4. Create a JSON config file for CA certificate signing request (CSR), for example, ca-
csr.json. Be sure to replace the values marked with angle brackets with real values you
want to use.

{
"CN": "kubernetes",
"key": {
"algo": "rsa",
"size": 2048
},
"names":[{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}

5. Generate CA key (ca-key.pem) and certificate (ca.pem):

../cfssl gencert -initca ca-csr.json | ../cfssljson -bare ca

6. Create a JSON config file for generating keys and certificates for the API server, for
example, server-csr.json. Be sure to replace the values in angle brackets with real values
you want to use. The <MASTER_CLUSTER_IP> is the service cluster IP for the API server
as described in previous subsection. The sample below also assumes that you are using
cluster.local as the default DNS domain name.

{
"CN": "kubernetes",
"hosts": [
"127.0.0.1",
"<MASTER_IP>",
"<MASTER_CLUSTER_IP>",
"kubernetes",
"kubernetes.default",
"kubernetes.default.svc",
"kubernetes.default.svc.cluster",
"kubernetes.default.svc.cluster.local"
],
"key": {
"algo": "rsa",
"size": 2048
},
"names": [{
"C": "<country>",
"ST": "<state>",
"L": "<city>",
"O": "<organization>",
"OU": "<organization unit>"
}]
}

7. Generate the key and certificate for the API server, which are by default saved into file
server-key.pem and server.pem respectively:

../cfssl gencert -ca=ca.pem -ca-key=ca-key.pem \

--config=ca-config.json -profile=kubernetes \
server-csr.json | ../cfssljson -bare server

Distributing Self-Signed CA Certificate

A client node may refuse to recognize a self-signed CA certificate as valid. For a non-
production deployment, or for a deployment that runs behind a company firewall, you can
distribute a self-signed CA certificate to all clients and refresh the local list for valid certificates.

On each client, perform the following operations:

sudo cp ca.crt /usr/local/share/ca-certificates/kubernetes.crt

sudo update-ca-certificates

Updating certificates in /etc/ssl/certs...

1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d....
done.

Certificates API
You can use the certificates.k8s.io API to provision x509 certificates to use for authentication as
documented in the Managing TLS in a cluster task page.

Manage Memory, CPU, and API Resources

Configure Default Memory Requests and Limits for a Namespace

Define a default memory resource limit for a namespace, so that every new Pod in that
namespace has a memory resource limit configured.

Configure Default CPU Requests and Limits for a Namespace

Define a default CPU resource limits for a namespace, so that every new Pod in that namespace
has a CPU resource limit configured.
Configure Minimum and Maximum Memory Constraints for a Namespace

Define a range of valid memory resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.

Configure Minimum and Maximum CPU Constraints for a Namespace

Define a range of valid CPU resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.

Configure Memory and CPU Quotas for a Namespace

Define overall memory and CPU resource limits for a namespace.

Configure a Pod Quota for a Namespace

Restrict how many Pods you can create within a namespace.

Configure Default Memory Requests and

Limits for a Namespace
Define a default memory resource limit for a namespace, so that every new Pod in that
namespace has a memory resource limit configured.

This page shows how to configure default memory requests and limits for a namespace.

A Kubernetes cluster can be divided into namespaces. Once you have a namespace that has a
default memory limit, and you then try to create a Pod with a container that does not specify its
own memory limit, then the control plane assigns the default memory limit to that container.

Kubernetes assigns a default memory request under certain conditions that are explained later
in this topic.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

Each node in your cluster must have at least 2 GiB of memory.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace default-mem-example

Create a LimitRange and a Pod

Here's a manifest for an example LimitRange. The manifest specifies a default memory request
and a default memory limit.

admin/resource/memory-defaults.yaml

apiVersion: v1
kind: LimitRange
metadata:
name: mem-limit-range
spec:
limits:
- default:
memory: 512Mi
defaultRequest:
memory: 256Mi
type: Container

Create the LimitRange in the default-mem-example namespace:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults.yaml --namespace=d

efault-mem-example

Now if you create a Pod in the default-mem-example namespace, and any container within that
Pod does not specify its own values for memory request and memory limit, then the control
plane applies default values: a memory request of 256MiB and a memory limit of 512MiB.

Here's an example manifest for a Pod that has one container. The container does not specify a
memory request and limit.

admin/resource/memory-defaults-pod.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo
spec:
containers:
- name: default-mem-demo-ctr
image: nginx

Create the Pod.

kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod.yaml --

namespace=default-mem-example
View detailed information about the Pod:

kubectl get pod default-mem-demo --output=yaml --namespace=default-mem-example

The output shows that the Pod's container has a memory request of 256 MiB and a memory
limit of 512 MiB. These are the default values specified by the LimitRange.

containers:
- image: nginx
imagePullPolicy: Always
name: default-mem-demo-ctr
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi

Delete your Pod:

kubectl delete pod default-mem-demo --namespace=default-mem-example

What if you specify a container's limit, but not its

request?
Here's a manifest for a Pod that has one container. The container specifies a memory limit, but
not a request:

admin/resource/memory-defaults-pod-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-2
spec:
containers:
- name: default-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1Gi"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod-2.yaml --

namespace=default-mem-example

View detailed information about the Pod:

kubectl get pod default-mem-demo-2 --output=yaml --namespace=default-mem-example

The output shows that the container's memory request is set to match its memory limit. Notice
that the container was not assigned the default memory request value of 256Mi.
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi

What if you specify a container's request, but not its

limit?
Here's a manifest for a Pod that has one container. The container specifies a memory request,
but not a limit:

admin/resource/memory-defaults-pod-3.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-mem-demo-3
spec:
containers:
- name: default-mem-demo-3-ctr
image: nginx
resources:
requests:
memory: "128Mi"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-defaults-pod-3.yaml --

namespace=default-mem-example

View the Pod's specification:

kubectl get pod default-mem-demo-3 --output=yaml --namespace=default-mem-example

The output shows that the container's memory request is set to the value specified in the
container's manifest. The container is limited to use no more than 512MiB of memory, which
matches the default memory limit for the namespace.

resources:
limits:
memory: 512Mi
requests:
memory: 128Mi

Note: A LimitRange does not check the consistency of the default values it applies. This means
that a default value for the limit that is set by LimitRange may be less than the request value
specified for the container in the spec that a client submits to the API server. If that happens,
the final Pod will not be scheduleable. See Constraints on resource limits and requests for more
details.
Motivation for default memory limits and requests
If your namespace has a memory resource quota configured, it is helpful to have a default value
in place for memory limit. Here are three of the restrictions that a resource quota imposes on a
namespace:

• For every Pod that runs in the namespace, the Pod and each of its containers must have a
memory limit. (If you specify a memory limit for every container in a Pod, Kubernetes
can infer the Pod-level memory limit by adding up the limits for its containers).
• Memory limits apply a resource reservation on the node where the Pod in question is
scheduled. The total amount of memory reserved for all Pods in the namespace must not
exceed a specified limit.
• The total amount of memory actually used by all Pods in the namespace must also not
exceed a specified limit.

When you add a LimitRange:

If any Pod in that namespace that includes a container does not specify its own memory limit,
the control plane applies the default memory limit to that container, and the Pod can be allowed
to run in a namespace that is restricted by a memory ResourceQuota.

Clean up
Delete your namespace:

kubectl delete namespace default-mem-example

What's next
For cluster administrators

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Configure Default CPU Requests and
Limits for a Namespace
Define a default CPU resource limits for a namespace, so that every new Pod in that namespace
has a CPU resource limit configured.

This page shows how to configure default CPU requests and limits for a namespace.

A Kubernetes cluster can be divided into namespaces. If you create a Pod within a namespace
that has a default CPU limit, and any container in that Pod does not specify its own CPU limit,
then the control plane assigns the default CPU limit to that container.

Kubernetes assigns a default CPU request, but only under certain conditions that are explained
later in this page.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

If you're not already familiar with what Kubernetes means by 1.0 CPU, read meaning of CPU.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace default-cpu-example

Create a LimitRange and a Pod

Here's a manifest for an example LimitRange. The manifest specifies a default CPU request and
a default CPU limit.

admin/resource/cpu-defaults.yaml

apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 1
defaultRequest:
cpu: 0.5
type: Container

Create the LimitRange in the default-cpu-example namespace:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults.yaml --namespace=default

-cpu-example

Now if you create a Pod in the default-cpu-example namespace, and any container in that Pod
does not specify its own values for CPU request and CPU limit, then the control plane applies
default values: a CPU request of 0.5 and a default CPU limit of 1.

Here's a manifest for a Pod that has one container. The container does not specify a CPU
request and limit.

admin/resource/cpu-defaults-pod.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo
spec:
containers:
- name: default-cpu-demo-ctr
image: nginx

Create the Pod.

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod.yaml --namespace=de

fault-cpu-example

View the Pod's specification:

kubectl get pod default-cpu-demo --output=yaml --namespace=default-cpu-example

The output shows that the Pod's only container has a CPU request of 500m cpu (which you can
read as “500 millicpu”), and a CPU limit of 1 cpu. These are the default values specified by the
LimitRange.

containers:
- image: nginx
imagePullPolicy: Always
name: default-cpu-demo-ctr
resources:
limits:
cpu: "1"
requests:
cpu: 500m
What if you specify a container's limit, but not its
request?
Here's a manifest for a Pod that has one container. The container specifies a CPU limit, but not
a request:

admin/resource/cpu-defaults-pod-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-2
spec:
containers:
- name: default-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod-2.yaml --

namespace=default-cpu-example

View the specification of the Pod that you created:

kubectl get pod default-cpu-demo-2 --output=yaml --namespace=default-cpu-example

The output shows that the container's CPU request is set to match its CPU limit. Notice that the
container was not assigned the default CPU request value of 0.5 cpu:

resources:
limits:
cpu: "1"
requests:
cpu: "1"

What if you specify a container's request, but not its

limit?
Here's an example manifest for a Pod that has one container. The container specifies a CPU
request, but not a limit:

admin/resource/cpu-defaults-pod-3.yaml

apiVersion: v1
kind: Pod
metadata:
name: default-cpu-demo-3
spec:
containers:
- name: default-cpu-demo-3-ctr
image: nginx
resources:
requests:
cpu: "0.75"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-defaults-pod-3.yaml --

namespace=default-cpu-example

View the specification of the Pod that you created:

kubectl get pod default-cpu-demo-3 --output=yaml --namespace=default-cpu-example

The output shows that the container's CPU request is set to the value you specified at the time
you created the Pod (in other words: it matches the manifest). However, the same container's
CPU limit is set to 1 cpu, which is the default CPU limit for that namespace.

resources:
limits:
cpu: "1"
requests:
cpu: 750m

Motivation for default CPU limits and requests

If your namespace has a CPU resource quota configured, it is helpful to have a default value in
place for CPU limit. Here are two of the restrictions that a CPU resource quota imposes on a
namespace:

• For every Pod that runs in the namespace, each of its containers must have a CPU limit.
• CPU limits apply a resource reservation on the node where the Pod in question is
scheduled. The total amount of CPU that is reserved for use by all Pods in the namespace
must not exceed a specified limit.

When you add a LimitRange:

If any Pod in that namespace that includes a container does not specify its own CPU limit, the
control plane applies the default CPU limit to that container, and the Pod can be allowed to run
in a namespace that is restricted by a CPU ResourceQuota.

Clean up
Delete your namespace:

kubectl delete namespace default-cpu-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Configure Minimum and Maximum

Memory Constraints for a Namespace
Define a range of valid memory resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.

This page shows how to set minimum and maximum values for memory used by containers
running in a namespace. You specify minimum and maximum memory values in a LimitRange
object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot be created
in the namespace.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

Each node in your cluster must have at least 1 GiB of memory available for Pods.
Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace constraints-mem-example

Create a LimitRange and a Pod

Here's an example manifest for a LimitRange:

admin/resource/memory-constraints.yaml

apiVersion: v1
kind: LimitRange
metadata:
name: mem-min-max-demo-lr
spec:
limits:
- max:
memory: 1Gi
min:
memory: 500Mi
type: Container

Create the LimitRange:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints.yaml --

namespace=constraints-mem-example

View detailed information about the LimitRange:

kubectl get limitrange mem-min-max-demo-lr --namespace=constraints-mem-example --

output=yaml

The output shows the minimum and maximum memory constraints as expected. But notice that
even though you didn't specify default values in the configuration file for the LimitRange, they
were created automatically.

limits:
- default:
memory: 1Gi
defaultRequest:
memory: 1Gi
max:
memory: 1Gi
min:
memory: 500Mi
type: Container
Now whenever you define a Pod within the constraints-mem-example namespace, Kubernetes
performs these steps:

• If any container in that Pod does not specify its own memory request and limit, the
control plane assigns the default memory request and limit to that container.

• Verify that every container in that Pod requests at least 500 MiB of memory.

• Verify that every container in that Pod requests no more than 1024 MiB (1 GiB) of
memory.

Here's a manifest for a Pod that has one container. Within the Pod spec, the sole container
specifies a memory request of 600 MiB and a memory limit of 800 MiB. These satisfy the
minimum and maximum memory constraints imposed by the LimitRange.

admin/resource/memory-constraints-pod.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo
spec:
containers:
- name: constraints-mem-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "600Mi"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod.yaml --

namespace=constraints-mem-example

Verify that the Pod is running and that its container is healthy:

kubectl get pod constraints-mem-demo --namespace=constraints-mem-example

View detailed information about the Pod:

kubectl get pod constraints-mem-demo --output=yaml --namespace=constraints-mem-example

The output shows that the container within that Pod has a memory request of 600 MiB and a
memory limit of 800 MiB. These satisfy the constraints imposed by the LimitRange for this
namespace:

resources:
limits:
memory: 800Mi
requests:
memory: 600Mi

Delete your Pod:

kubectl delete pod constraints-mem-demo --namespace=constraints-mem-example

Attempt to create a Pod that exceeds the maximum

memory constraint
Here's a manifest for a Pod that has one container. The container specifies a memory request of
800 MiB and a memory limit of 1.5 GiB.

admin/resource/memory-constraints-pod-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-2
spec:
containers:
- name: constraints-mem-demo-2-ctr
image: nginx
resources:
limits:
memory: "1.5Gi"
requests:
memory: "800Mi"

Attempt to create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-2.yaml --

namespace=constraints-mem-example

The output shows that the Pod does not get created, because it defines a container that requests
more memory than is allowed:

Error from server (Forbidden): error when creating "examples/admin/resource/memory-

constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi,
but limit is 1536Mi.

Attempt to create a Pod that does not meet the minimum

memory request
Here's a manifest for a Pod that has one container. That container specifies a memory request of
100 MiB and a memory limit of 800 MiB.

admin/resource/memory-constraints-pod-3.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-3
spec:
containers:
- name: constraints-mem-demo-3-ctr
image: nginx
resources:
limits:
memory: "800Mi"
requests:
memory: "100Mi"

Attempt to create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-3.yaml --

namespace=constraints-mem-example

The output shows that the Pod does not get created, because it defines a container that requests
less memory than the enforced minimum:

Error from server (Forbidden): error when creating "examples/admin/resource/memory-

constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi,
but request is 100Mi.

Create a Pod that does not specify any memory request

or limit
Here's a manifest for a Pod that has one container. The container does not specify a memory
request, and it does not specify a memory limit.

admin/resource/memory-constraints-pod-4.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-mem-demo-4
spec:
containers:
- name: constraints-mem-demo-4-ctr
image: nginx

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/memory-constraints-pod-4.yaml --

namespace=constraints-mem-example

View detailed information about the Pod:

kubectl get pod constraints-mem-demo-4 --namespace=constraints-mem-example --output=ya

The output shows that the Pod's only container has a memory request of 1 GiB and a memory
limit of 1 GiB. How did that container get those values?
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi

Because your Pod did not define any memory request and limit for that container, the cluster
applied a default memory request and limit from the LimitRange.

This means that the definition of that Pod shows those values. You can check it using kubectl
describe:

# Look for the "Requests:" section of the output

kubectl describe pod constraints-mem-demo-4 --namespace=constraints-mem-example

At this point, your Pod might be running or it might not be running. Recall that a prerequisite
for this task is that your Nodes have at least 1 GiB of memory. If each of your Nodes has only 1
GiB of memory, then there is not enough allocatable memory on any Node to accommodate a
memory request of 1 GiB. If you happen to be using Nodes with 2 GiB of memory, then you
probably have enough space to accommodate the 1 GiB request.

Delete your Pod:

kubectl delete pod constraints-mem-demo-4 --namespace=constraints-mem-example

Enforcement of minimum and maximum memory

constraints
The maximum and minimum memory constraints imposed on a namespace by a LimitRange are
enforced only when a Pod is created or updated. If you change the LimitRange, it does not
affect Pods that were created previously.

Motivation for minimum and maximum memory

constraints
As a cluster administrator, you might want to impose restrictions on the amount of memory
that Pods can use. For example:

• Each Node in a cluster has 2 GiB of memory. You do not want to accept any Pod that
requests more than 2 GiB of memory, because no Node in the cluster can support the
request.

• A cluster is shared by your production and development departments. You want to allow
production workloads to consume up to 8 GiB of memory, but you want development
workloads to be limited to 512 MiB. You create separate namespaces for production and
development, and you apply memory constraints to each namespace.

Clean up
Delete your namespace:
kubectl delete namespace constraints-mem-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Configure Minimum and Maximum CPU

Constraints for a Namespace
Define a range of valid CPU resource limits for a namespace, so that every new Pod in that
namespace falls within the range you configure.

This page shows how to set minimum and maximum values for the CPU resources used by
containers and Pods in a namespace. You specify minimum and maximum CPU values in a
LimitRange object. If a Pod does not meet the constraints imposed by the LimitRange, it cannot
be created in the namespace.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

Each node in your cluster must have at least 1.0 CPU available for Pods. See meaning of CPU to
learn what Kubernetes means by “1 CPU”.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace constraints-cpu-example

Create a LimitRange and a Pod

Here's a manifest for an example LimitRange:

admin/resource/cpu-constraints.yaml

apiVersion: v1
kind: LimitRange
metadata:
name: cpu-min-max-demo-lr
spec:
limits:
- max:
cpu: "800m"
min:
cpu: "200m"
type: Container

Create the LimitRange:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints.yaml --namespace=con

straints-cpu-example

View detailed information about the LimitRange:

kubectl get limitrange cpu-min-max-demo-lr --output=yaml --namespace=constraints-cpu-

example

The output shows the minimum and maximum CPU constraints as expected. But notice that
even though you didn't specify default values in the configuration file for the LimitRange, they
were created automatically.

limits:
- default:
cpu: 800m
defaultRequest:
cpu: 800m
max:
cpu: 800m
min:
cpu: 200m
type: Container
Now whenever you create a Pod in the constraints-cpu-example namespace (or some other
client of the Kubernetes API creates an equivalent Pod), Kubernetes performs these steps:

• If any container in that Pod does not specify its own CPU request and limit, the control
plane assigns the default CPU request and limit to that container.

• Verify that every container in that Pod specifies a CPU request that is greater than or
equal to 200 millicpu.

• Verify that every container in that Pod specifies a CPU limit that is less than or equal to
800 millicpu.

Note: When creating a LimitRange object, you can specify limits on huge-pages or GPUs as
well. However, when both default and defaultRequest are specified on these resources, the two
values must be the same.

Here's a manifest for a Pod that has one container. The container manifest specifies a CPU
request of 500 millicpu and a CPU limit of 800 millicpu. These satisfy the minimum and
maximum CPU constraints imposed by the LimitRange for this namespace.

admin/resource/cpu-constraints-pod.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo
spec:
containers:
- name: constraints-cpu-demo-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "500m"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod.yaml --

namespace=constraints-cpu-example

Verify that the Pod is running and that its container is healthy:

kubectl get pod constraints-cpu-demo --namespace=constraints-cpu-example

View detailed information about the Pod:

kubectl get pod constraints-cpu-demo --output=yaml --namespace=constraints-cpu-example

The output shows that the Pod's only container has a CPU request of 500 millicpu and CPU
limit of 800 millicpu. These satisfy the constraints imposed by the LimitRange.

resources:
limits:
cpu: 800m
requests:
cpu: 500m

Delete the Pod

kubectl delete pod constraints-cpu-demo --namespace=constraints-cpu-example

Attempt to create a Pod that exceeds the maximum CPU

constraint
Here's a manifest for a Pod that has one container. The container specifies a CPU request of 500
millicpu and a cpu limit of 1.5 cpu.

admin/resource/cpu-constraints-pod-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-2
spec:
containers:
- name: constraints-cpu-demo-2-ctr
image: nginx
resources:
limits:
cpu: "1.5"
requests:
cpu: "500m"

Attempt to create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-2.yaml --

namespace=constraints-cpu-example

The output shows that the Pod does not get created, because it defines an unacceptable
container. That container is not acceptable because it specifies a CPU limit that is too large:

Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-

pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but
limit is 1500m.

Attempt to create a Pod that does not meet the minimum

CPU request
Here's a manifest for a Pod that has one container. The container specifies a CPU request of 100
millicpu and a CPU limit of 800 millicpu.

admin/resource/cpu-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-3
spec:
containers:
- name: constraints-cpu-demo-3-ctr
image: nginx
resources:
limits:
cpu: "800m"
requests:
cpu: "100m"

Attempt to create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-3.yaml --

namespace=constraints-cpu-example

The output shows that the Pod does not get created, because it defines an unacceptable
container. That container is not acceptable because it specifies a CPU request that is lower than
the enforced minimum:

Error from server (Forbidden): error when creating "examples/admin/resource/cpu-constraints-

pod-3.yaml":
pods "constraints-cpu-demo-3" is forbidden: minimum cpu usage per Container is 200m, but
request is 100m.

Create a Pod that does not specify any CPU request or

limit
Here's a manifest for a Pod that has one container. The container does not specify a CPU
request, nor does it specify a CPU limit.

admin/resource/cpu-constraints-pod-4.yaml

apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-demo-4
spec:
containers:
- name: constraints-cpu-demo-4-ctr
image: vish/stress

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/cpu-constraints-pod-4.yaml --

namespace=constraints-cpu-example

View detailed information about the Pod:

kubectl get pod constraints-cpu-demo-4 --namespace=constraints-cpu-example --output=yaml

The output shows that the Pod's single container has a CPU request of 800 millicpu and a CPU
limit of 800 millicpu. How did that container get those values?

resources:
limits:
cpu: 800m
requests:
cpu: 800m

Because that container did not specify its own CPU request and limit, the control plane applied
the default CPU request and limit from the LimitRange for this namespace.

At this point, your Pod may or may not be running. Recall that a prerequisite for this task is
that your Nodes must have at least 1 CPU available for use. If each of your Nodes has only 1
CPU, then there might not be enough allocatable CPU on any Node to accommodate a request
of 800 millicpu. If you happen to be using Nodes with 2 CPU, then you probably have enough
CPU to accommodate the 800 millicpu request.

Delete your Pod:

kubectl delete pod constraints-cpu-demo-4 --namespace=constraints-cpu-example

Enforcement of minimum and maximum CPU

constraints
The maximum and minimum CPU constraints imposed on a namespace by a LimitRange are
enforced only when a Pod is created or updated. If you change the LimitRange, it does not
affect Pods that were created previously.

Motivation for minimum and maximum CPU constraints

As a cluster administrator, you might want to impose restrictions on the CPU resources that
Pods can use. For example:

• Each Node in a cluster has 2 CPU. You do not want to accept any Pod that requests more
than 2 CPU, because no Node in the cluster can support the request.

• A cluster is shared by your production and development departments. You want to allow
production workloads to consume up to 3 CPU, but you want development workloads to
be limited to 1 CPU. You create separate namespaces for production and development,
and you apply CPU constraints to each namespace.

Clean up
Delete your namespace:

kubectl delete namespace constraints-cpu-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Configure Memory and CPU Quotas for a

Namespace
Define overall memory and CPU resource limits for a namespace.

This page shows how to set quotas for the total amount memory and CPU that can be used by
all Pods running in a namespace. You specify quotas in a ResourceQuota object.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

Each node in your cluster must have at least 1 GiB of memory.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace quota-mem-cpu-example

Create a ResourceQuota
Here is a manifest for an example ResourceQuota:

admin/resource/quota-mem-cpu.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
name: mem-cpu-demo
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
limits.cpu: "2"
limits.memory: 2Gi

Create the ResourceQuota:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu.yaml --namespace=qu

ota-mem-cpu-example

View detailed information about the ResourceQuota:

kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example --output=ya

The ResourceQuota places these requirements on the quota-mem-cpu-example namespace:

• For every Pod in the namespace, each container must have a memory request, memory
limit, cpu request, and cpu limit.
• The memory request total for all Pods in that namespace must not exceed 1 GiB.
• The memory limit total for all Pods in that namespace must not exceed 2 GiB.
• The CPU request total for all Pods in that namespace must not exceed 1 cpu.
• The CPU limit total for all Pods in that namespace must not exceed 2 cpu.

See meaning of CPU to learn what Kubernetes means by “1 CPU”.

Create a Pod
Here is a manifest for an example Pod:

admin/resource/quota-mem-cpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo
spec:
containers:
- name: quota-mem-cpu-demo-ctr
image: nginx
resources:
limits:
memory: "800Mi"
cpu: "800m"
requests:
memory: "600Mi"
cpu: "400m"

Create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu-pod.yaml --

namespace=quota-mem-cpu-example

Verify that the Pod is running and that its (only) container is healthy:

kubectl get pod quota-mem-cpu-demo --namespace=quota-mem-cpu-example

Once again, view detailed information about the ResourceQuota:

kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example --output=ya

The output shows the quota along with how much of the quota has been used. You can see that
the memory and CPU requests and limits for your Pod do not exceed the quota.

status:
hard:
limits.cpu: "2"
limits.memory: 2Gi
requests.cpu: "1"
requests.memory: 1Gi
used:
limits.cpu: 800m
limits.memory: 800Mi
requests.cpu: 400m
requests.memory: 600Mi

If you have the jq tool, you can also query (using JSONPath) for just the used values, and
pretty-print that that of the output. For example:

kubectl get resourcequota mem-cpu-demo --namespace=quota-mem-cpu-example -o jsonpath='

{ .status.used }' | jq .
Attempt to create a second Pod
Here is a manifest for a second Pod:

admin/resource/quota-mem-cpu-pod-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: quota-mem-cpu-demo-2
spec:
containers:
- name: quota-mem-cpu-demo-2-ctr
image: redis
resources:
limits:
memory: "1Gi"
cpu: "800m"
requests:
memory: "700Mi"
cpu: "400m"

In the manifest, you can see that the Pod has a memory request of 700 MiB. Notice that the sum
of the used memory request and this new memory request exceeds the memory request quota:
600 MiB + 700 MiB > 1 GiB.

Attempt to create the Pod:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-mem-cpu-pod-2.yaml --

namespace=quota-mem-cpu-example

The second Pod does not get created. The output shows that creating the second Pod would
cause the memory request total to exceed the memory request quota.

Error from server (Forbidden): error when creating "examples/admin/resource/quota-mem-cpu-

pod-2.yaml":
pods "quota-mem-cpu-demo-2" is forbidden: exceeded quota: mem-cpu-demo,
requested: requests.memory=700Mi,used: requests.memory=600Mi, limited:
requests.memory=1Gi

Discussion
As you have seen in this exercise, you can use a ResourceQuota to restrict the memory request
total for all Pods running in a namespace. You can also restrict the totals for memory limit, cpu
request, and cpu limit.

Instead of managing total resource use within a namespace, you might want to restrict
individual Pods, or the containers in those Pods. To achieve that kind of limiting, use a
LimitRange.
Clean up
Delete your namespace:

kubectl delete namespace quota-mem-cpu-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Configure a Pod Quota for a Namespace

Restrict how many Pods you can create within a namespace.

This page shows how to set a quota for the total number of Pods that can run in a Namespace.
You specify quotas in a ResourceQuota object.

Before you begin

• Killercoda
• Play with Kubernetes

You must have access to create namespaces in your cluster.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace quota-pod-example

Create a ResourceQuota
Here is an example manifest for a ResourceQuota:

admin/resource/quota-pod.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
name: pod-demo
spec:
hard:
pods: "2"

Create the ResourceQuota:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-pod.yaml --namespace=quota-

pod-example

View detailed information about the ResourceQuota:

kubectl get resourcequota pod-demo --namespace=quota-pod-example --output=yaml

The output shows that the namespace has a quota of two Pods, and that currently there are no
Pods; that is, none of the quota is used.

spec:
hard:
pods: "2"
status:
hard:
pods: "2"
used:
pods: "0"

Here is an example manifest for a Deployment:

admin/resource/quota-pod-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-quota-demo
spec:
selector:
matchLabels:
purpose: quota-demo
replicas: 3
template:
metadata:
labels:
purpose: quota-demo
spec:
containers:
- name: pod-quota-demo
image: nginx

In that manifest, replicas: 3 tells Kubernetes to attempt to create three new Pods, all running the
same application.

Create the Deployment:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-pod-deployment.yaml --

namespace=quota-pod-example

View detailed information about the Deployment:

kubectl get deployment pod-quota-demo --namespace=quota-pod-example --output=yaml

The output shows that even though the Deployment specifies three replicas, only two Pods
were created because of the quota you defined earlier:

spec:
...
replicas: 3
...
status:
availableReplicas: 2
...
lastUpdateTime: 2021-04-02T20:57:05Z
message: 'unable to create pods: pods "pod-quota-demo-1650323038-" is forbidden:
exceeded quota: pod-demo, requested: pods=1, used: pods=2, limited: pods=2'

Choice of resource

In this task you have defined a ResourceQuota that limited the total number of Pods, but you
could also limit the total number of other kinds of object. For example, you might decide to
limit how many CronJobs that can live in a single namespace.

Clean up
Delete your namespace:

kubectl delete namespace quota-pod-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure Quotas for API Objects

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Install a Network Policy Provider

Use Antrea for NetworkPolicy

Use Calico for NetworkPolicy

Use Cilium for NetworkPolicy

Use Kube-router for NetworkPolicy

Romana for NetworkPolicy

Weave Net for NetworkPolicy

Use Antrea for NetworkPolicy

This page shows how to install and use Antrea CNI plugin on Kubernetes. For background on
Project Antrea, read the Introduction to Antrea.

Before you begin

You need to have a Kubernetes cluster. Follow the kubeadm getting started guide to bootstrap
one.
Deploying Antrea with kubeadm
Follow Getting Started guide to deploy Antrea for kubeadm.

What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy.

Use Calico for NetworkPolicy

This page shows a couple of quick ways to create a Calico cluster on Kubernetes.

Before you begin

Decide whether you want to deploy a cloud or local cluster.

Creating a Calico cluster with Google Kubernetes Engine

(GKE)
Prerequisite: gcloud.

1. To launch a GKE cluster with Calico, include the --enable-network-policy flag.

Syntax

gcloud container clusters create [CLUSTER_NAME] --enable-network-policy

Example

gcloud container clusters create my-calico-cluster --enable-network-policy

2. To verify the deployment, use the following command.

kubectl get pods --namespace=kube-system

The Calico pods begin with calico. Check to make sure each one has a status of Running.

Creating a local Calico cluster with kubeadm

To get a local single-host Calico cluster in fifteen minutes using kubeadm, refer to the Calico
Quickstart.

What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy.
Use Cilium for NetworkPolicy
This page shows how to use Cilium for NetworkPolicy.

For background on Cilium, read the Introduction to Cilium.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Deploying Cilium on Minikube for Basic Testing

To get familiar with Cilium easily you can follow the Cilium Kubernetes Getting Started Guide
to perform a basic DaemonSet installation of Cilium in minikube.

To start minikube, minimal version required is >= v1.5.2, run the with the following arguments:

minikube version

minikube version: v1.5.2

minikube start --network-plugin=cni

For minikube you can install Cilium using its CLI tool. To do so, first download the latest
version of the CLI with the following command:

curl -LO https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-

amd64.tar.gz

Then extract the downloaded file to your /usr/local/bin directory with the following command:

sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin

rm cilium-linux-amd64.tar.gz

After running the above commands, you can now install Cilium with the following command:

cilium install

Cilium will then automatically detect the cluster configuration and create and install the
appropriate components for a successful installation. The components are:

• Certificate Authority (CA) in Secret cilium-ca and certificates for Hubble (Cilium's
observability layer).
• Service accounts.
• Cluster roles.
• ConfigMap.
• Agent DaemonSet and an Operator Deployment.

After the installation, you can view the overall status of the Cilium deployment with the cilium
status command. See the expected output of the status command here.

The remainder of the Getting Started Guide explains how to enforce both L3/L4 (i.e., IP address
+ port) security policies, as well as L7 (e.g., HTTP) security policies using an example
application.

Deploying Cilium for Production Use

For detailed instructions around deploying Cilium for production, see: Cilium Kubernetes
Installation Guide This documentation includes detailed requirements, instructions and example
production DaemonSet files.

Understanding Cilium components

Deploying a cluster with Cilium adds Pods to the kube-system namespace. To see this list of
Pods run:

kubectl get pods --namespace=kube-system -l k8s-app=cilium

You'll see a list of Pods similar to this:

NAME READY STATUS RESTARTS AGE

cilium-kkdhz 1/1 Running 0 3m23s
...

A cilium Pod runs on each node in your cluster and enforces network policy on the traffic to/
from Pods on that node using Linux BPF.

What's next
Once your cluster is running, you can follow the Declare Network Policy to try out Kubernetes
NetworkPolicy with Cilium. Have fun, and if you have questions, contact us using the Cilium
Slack Channel.

Use Kube-router for NetworkPolicy

This page shows how to use Kube-router for NetworkPolicy.

Before you begin

You need to have a Kubernetes cluster running. If you do not already have a cluster, you can
create one by using any of the cluster installers like Kops, Bootkube, Kubeadm etc.
Installing Kube-router addon
The Kube-router Addon comes with a Network Policy Controller that watches Kubernetes API
server for any NetworkPolicy and pods updated and configures iptables rules and ipsets to
allow or block traffic as directed by the policies. Please follow the trying Kube-router with
cluster installers guide to install Kube-router addon.

What's next
Once you have installed the Kube-router addon, you can follow the Declare Network Policy to
try out Kubernetes NetworkPolicy.

Romana for NetworkPolicy

This page shows how to use Romana for NetworkPolicy.

Before you begin

Complete steps 1, 2, and 3 of the kubeadm getting started guide.

Installing Romana with kubeadm

Follow the containerized installation guide for kubeadm.

Applying network policies

To apply network policies use one of the following:

• Romana network policies.

◦ Example of Romana network policy.
• The NetworkPolicy API.

What's next
Once you have installed Romana, you can follow the Declare Network Policy to try out
Kubernetes NetworkPolicy.

Weave Net for NetworkPolicy

This page shows how to use Weave Net for NetworkPolicy.

Before you begin

You need to have a Kubernetes cluster. Follow the kubeadm getting started guide to bootstrap
one.
Install the Weave Net addon
Follow the Integrating Kubernetes via the Addon guide.

The Weave Net addon for Kubernetes comes with a Network Policy Controller that
automatically monitors Kubernetes for any NetworkPolicy annotations on all namespaces and
configures iptables rules to allow or block traffic as directed by the policies.

Test the installation

Verify that the weave works.

Enter the following command:

kubectl get pods -n kube-system -o wide

The output is similar to this:

NAME READY STATUS RESTARTS AGE IP NODE

weave-net-1t1qg 2/2 Running 0 9d 192.168.2.10 worknode3
weave-net-231d7 2/2 Running 1 7d 10.2.0.17 worknodegpu
weave-net-7nmwt 2/2 Running 3 9d 192.168.2.131 masternode
weave-net-pmw8w 2/2 Running 0 9d 192.168.2.216 worknode2

Each Node has a weave Pod, and all Pods are Running and 2/2 READY. (2/2 means that each
Pod has weave and weave-npc.)

What's next
Once you have installed the Weave Net addon, you can follow the Declare Network Policy to
try out Kubernetes NetworkPolicy. If you have any question, contact us at #weave-community
on Slack or Weave User Group.

Access Clusters Using the Kubernetes API

This page shows how to access clusters using the Kubernetes API.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Accessing the Kubernetes API
Accessing for the first time with kubectl

When accessing the Kubernetes API for the first time, use the Kubernetes command-line tool,
kubectl.

To access a cluster, you need to know the location of the cluster and have credentials to access
it. Typically, this is automatically set-up when you work through a Getting started guide, or
someone else set up the cluster and provided you with credentials and a location.

Check the location and credentials that kubectl knows about with this command:

kubectl config view

Many of the examples provide an introduction to using kubectl. Complete documentation is

found in the kubectl manual.

Directly accessing the REST API

kubectl handles locating and authenticating to the API server. If you want to directly access the
REST API with an http client like curl or wget, or a browser, there are multiple ways you can
locate and authenticate against the API server:

1. Run kubectl in proxy mode (recommended). This method is recommended, since it uses
the stored apiserver location and verifies the identity of the API server using a self-signed
cert. No man-in-the-middle (MITM) attack is possible using this method.
2. Alternatively, you can provide the location and credentials directly to the http client. This
works with client code that is confused by proxies. To protect against man in the middle
attacks, you'll need to import a root cert into your browser.

Using the Go or Python client libraries provides accessing kubectl in proxy mode.

Using kubectl proxy

The following command runs kubectl in a mode where it acts as a reverse proxy. It handles
locating the API server and authenticating.

Run it like this:

kubectl proxy --port=8080 &

See kubectl proxy for more details.

Then you can explore the API with curl, wget, or a browser, like so:

curl http://localhost:8080/api/

The output is similar to this:

{
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}

Without kubectl proxy

It is possible to avoid using kubectl proxy by passing an authentication token directly to the
API server, like this:

Using grep/cut approach:

# Check all possible clusters, as your .KUBECONFIG may have multiple contexts:
kubectl config view -o jsonpath='{"Cluster name\tServer\n"}{range .clusters[*]}{.name}{"\t"}
{.cluster.server}{"\n"}{end}'

# Select name of cluster you want to interact with from above output:
export CLUSTER_NAME="some_server_name"

# Point to the API server referring the cluster name

APISERVER=$(kubectl config view -o jsonpath="{.clusters[?(@.name==\"$CLUSTER_NAME\")].
cluster.server}")

# Create a secret to hold a token for the default service account

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: default-token
annotations:
kubernetes.io/service-account.name: default
type: kubernetes.io/service-account-token
EOF

# Wait for the token controller to populate the secret with a token:
while ! kubectl describe secret default-token | grep -E '^token' >/dev/null; do
echo "waiting for token..." >&2
sleep 1
done

# Get the token value

TOKEN=$(kubectl get secret default-token -o jsonpath='{.data.token}' | base64 --decode)

# Explore the API with TOKEN

curl -X GET $APISERVER/api --header "Authorization: Bearer $TOKEN" --insecure

The output is similar to this:

{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "10.0.1.149:443"
}
]
}

The above example uses the --insecure flag. This leaves it subject to MITM attacks. When
kubectl accesses the cluster it uses a stored root certificate and client certificates to access the
server. (These are installed in the ~/.kube directory). Since cluster certificates are typically self-
signed, it may take special configuration to get your http client to use root certificate.

On some clusters, the API server does not require authentication; it may serve on localhost, or
be protected by a firewall. There is not a standard for this. Controlling Access to the Kubernetes
API describes how you can configure this as a cluster administrator.

Programmatic access to the API

Kubernetes officially supports client libraries for Go, Python, Java, dotnet, JavaScript, and
Haskell. There are other client libraries that are provided and maintained by their authors, not
the Kubernetes team. See client libraries for accessing the API from other languages and how
they authenticate.

Go client

• To get the library, run the following command: go get k8s.io/client-go@kubernetes-

<kubernetes-version-number> See https://github.com/kubernetes/client-go/releases to
see which versions are supported.
• Write an application atop of the client-go clients.

Note: client-go defines its own API objects, so if needed, import API definitions from client-go
rather than from the main repository. For example, import "k8s.io/client-go/kubernetes" is
correct.

The Go client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

package main

import (
"context"
"fmt"
"k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/clientcmd"
)
func main() {
// uses the current context in kubeconfig
// path-to-kubeconfig -- for example, /root/.kube/config
config, _ := clientcmd.BuildConfigFromFlags("", "<path-to-kubeconfig>")
// creates the clientset
clientset, _ := kubernetes.NewForConfig(config)
// access the API to list pods
pods, _ := clientset.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{})
fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
}

If the application is deployed as a Pod in the cluster, see Accessing the API from within a Pod.

Python client

To use Python client, run the following command: pip install kubernetes. See Python Client
Library page for more installation options.

The Python client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

from kubernetes import client, config

config.load_kube_config()

v1=client.CoreV1Api()
print("Listing pods with their IPs:")
ret = v1.list_pod_for_all_namespaces(watch=False)
for i in ret.items:
print("%s\t%s\t%s" % (i.status.pod_ip, i.metadata.namespace, i.metadata.name))

Java client

To install the Java Client, run:

# Clone java library

git clone --recursive https://github.com/kubernetes-client/java

# Installing project artifacts, POM etc:

cd java
mvn install

See https://github.com/kubernetes-client/java/releases to see which versions are supported.

The Java client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

package io.kubernetes.client.examples;

import io.kubernetes.client.ApiClient;
import io.kubernetes.client.ApiException;
import io.kubernetes.client.Configuration;
import io.kubernetes.client.apis.CoreV1Api;
import io.kubernetes.client.models.V1Pod;
import io.kubernetes.client.models.V1PodList;
import io.kubernetes.client.util.ClientBuilder;
import io.kubernetes.client.util.KubeConfig;
import java.io.FileReader;
import java.io.IOException;

/**
* A simple example of how to use the Java API from an application outside a kubernetes cluster
*
* <p>Easiest way to run this: mvn exec:java
* -Dexec.mainClass="io.kubernetes.client.examples.KubeConfigFileClientExample"
*
*/
public class KubeConfigFileClientExample {
public static void main(String[] args) throws IOException, ApiException {

// file path to your KubeConfig

String kubeConfigPath = "~/.kube/config";

// loading the out-of-cluster config, a kubeconfig from file-system

ApiClient client =
ClientBuilder.kubeconfig(KubeConfig.loadKubeConfig(new
FileReader(kubeConfigPath))).build();

// set the global default api-client to the in-cluster one from above
Configuration.setDefaultApiClient(client);

// the CoreV1Api loads default api-client from global configuration.

CoreV1Api api = new CoreV1Api();

// invokes the CoreV1Api client

V1PodList list = api.listPodForAllNamespaces(null, null, null, null, null, null, null, null, null);
System.out.println("Listing all pods: ");
for (V1Pod item : list.getItems()) {
System.out.println(item.getMetadata().getName());
}
}
}

dotnet client

To use dotnet client, run the following command: dotnet add package KubernetesClient --
version 1.6.1 See dotnet Client Library page for more installation options. See https://
github.com/kubernetes-client/csharp/releases to see which versions are supported.

The dotnet client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

using System;
using k8s;

namespace simple
{
internal class PodList
{
private static void Main(string[] args)
{
var config = KubernetesClientConfiguration.BuildDefaultConfig();
IKubernetes client = new Kubernetes(config);
Console.WriteLine("Starting Request!");

var list = client.ListNamespacedPod("default");

foreach (var item in list.Items)
{
Console.WriteLine(item.Metadata.Name);
}
if (list.Items.Count == 0)
{
Console.WriteLine("Empty!");
}
}
}
}

JavaScript client

To install JavaScript client, run the following command: npm install @kubernetes/client-node.
See https://github.com/kubernetes-client/javascript/releases to see which versions are
supported.

The JavaScript client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

const k8s = require('@kubernetes/client-node');

const kc = new k8s.KubeConfig();

kc.loadFromDefault();

const k8sApi = kc.makeApiClient(k8s.CoreV1Api);

k8sApi.listNamespacedPod('default').then((res) => {
console.log(res.body);
});

Haskell client

See https://github.com/kubernetes-client/haskell/releases to see which versions are supported.

The Haskell client can use the same kubeconfig file as the kubectl CLI does to locate and
authenticate to the API server. See this example:

exampleWithKubeConfig :: IO ()
exampleWithKubeConfig = do
oidcCache <- atomically $ newTVar $ Map.fromList []
(mgr, kcfg) <- mkKubeClientConfig oidcCache $ KubeConfigFile "/path/to/kubeconfig"
dispatchMime
mgr
kcfg
(CoreV1.listPodForAllNamespaces (Accept MimeJSON))
>>= print

What's next
• Accessing the Kubernetes API from a Pod

Advertise Extended Resources for a Node

This page shows how to specify extended resources for a Node. Extended resources allow
cluster administrators to advertise node-level resources that would otherwise be unknown to
Kubernetes.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Get the names of your Nodes

kubectl get nodes

Choose one of your Nodes to use for this exercise.

Advertise a new extended resource on one of your Nodes

To advertise a new extended resource on a Node, send an HTTP PATCH request to the
Kubernetes API server. For example, suppose one of your Nodes has four dongles attached.
Here's an example of a PATCH request that advertises four dongle resources for your Node.

PATCH /api/v1/nodes/<your-node-name>/status HTTP/1.1

Accept: application/json
Content-Type: application/json-patch+json
Host: k8s-master:8080

[
{
"op": "add",
"path": "/status/capacity/example.com~1dongle",
"value": "4"
}
]

Note that Kubernetes does not need to know what a dongle is or what a dongle is for. The
preceding PATCH request tells Kubernetes that your Node has four things that you call
dongles.

Start a proxy, so that you can easily send requests to the Kubernetes API server:

kubectl proxy

In another command window, send the HTTP PATCH request. Replace <your-node-name>
with the name of your Node:

curl --header "Content-Type: application/json-patch+json" \

--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1dongle", "value": "4"}]' \
http://localhost:8001/api/v1/nodes/<your-node-name>/status

Note: In the preceding request, ~1 is the encoding for the character / in the patch path. The
operation path value in JSON-Patch is interpreted as a JSON-Pointer. For more details, see IETF
RFC 6901, section 3.

The output shows that the Node has a capacity of 4 dongles:

"capacity": {
"cpu": "2",
"memory": "2049008Ki",
"example.com/dongle": "4",

Describe your Node:

kubectl describe node <your-node-name>

Once again, the output shows the dongle resource:

Capacity:
cpu: 2
memory: 2049008Ki
example.com/dongle: 4

Now, application developers can create Pods that request a certain number of dongles. See
Assign Extended Resources to a Container.

Discussion
Extended resources are similar to memory and CPU resources. For example, just as a Node has
a certain amount of memory and CPU to be shared by all components running on the Node, it
can have a certain number of dongles to be shared by all components running on the Node.
And just as application developers can create Pods that request a certain amount of memory
and CPU, they can create Pods that request a certain number of dongles.
Extended resources are opaque to Kubernetes; Kubernetes does not know anything about what
they are. Kubernetes knows only that a Node has a certain number of them. Extended resources
must be advertised in integer amounts. For example, a Node can advertise four dongles, but not
4.5 dongles.

Storage example

Suppose a Node has 800 GiB of a special kind of disk storage. You could create a name for the
special storage, say example.com/special-storage. Then you could advertise it in chunks of a
certain size, say 100 GiB. In that case, your Node would advertise that it has eight resources of
type example.com/special-storage.

Capacity:
...
example.com/special-storage: 8

If you want to allow arbitrary requests for special storage, you could advertise special storage
in chunks of size 1 byte. In that case, you would advertise 800Gi resources of type example.com/
special-storage.

Capacity:
...
example.com/special-storage: 800Gi

Then a Container could request any number of bytes of special storage, up to 800Gi.

Clean up
Here is a PATCH request that removes the dongle advertisement from a Node.

PATCH /api/v1/nodes/<your-node-name>/status HTTP/1.1

Accept: application/json
Content-Type: application/json-patch+json
Host: k8s-master:8080

[
{
"op": "remove",
"path": "/status/capacity/example.com~1dongle",
}
]

Start a proxy, so that you can easily send requests to the Kubernetes API server:

kubectl proxy

In another command window, send the HTTP PATCH request. Replace <your-node-name>
with the name of your Node:

curl --header "Content-Type: application/json-patch+json" \

--request PATCH \
--data '[{"op": "remove", "path": "/status/capacity/example.com~1dongle"}]' \
http://localhost:8001/api/v1/nodes/<your-node-name>/status
Verify that the dongle advertisement has been removed:

kubectl describe node <your-node-name> | grep dongle

(you should not see any output)

What's next
For application developers

• Assign Extended Resources to a Container

For cluster administrators

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

Autoscale the DNS Service in a Cluster

This page shows how to enable and configure autoscaling of the DNS service in your
Kubernetes cluster.

Before you begin

• You need to have a Kubernetes cluster, and the kubectl command-line tool must be
configured to communicate with your cluster. It is recommended to run this tutorial on a
cluster with at least two nodes that are not acting as control plane hosts. If you do not
already have a cluster, you can create one by using minikube or you can use one of these
Kubernetes playgrounds:

◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.

• This guide assumes your nodes use the AMD64 or Intel 64 CPU architecture.

• Make sure Kubernetes DNS is enabled.

Determine whether DNS horizontal autoscaling is already

enabled
List the Deployments in your cluster in the kube-system namespace:

kubectl get deployment --namespace=kube-system

The output is similar to this:

NAME READY UP-TO-DATE AVAILABLE AGE

...
dns-autoscaler 1/1 1 1 ...
...

If you see "dns-autoscaler" in the output, DNS horizontal autoscaling is already enabled, and
you can skip to Tuning autoscaling parameters.

Get the name of your DNS Deployment

List the DNS deployments in your cluster in the kube-system namespace:

kubectl get deployment -l k8s-app=kube-dns --namespace=kube-system

The output is similar to this:

NAME READY UP-TO-DATE AVAILABLE AGE

...
coredns 2/2 2 2 ...
...

If you don't see a Deployment for DNS services, you can also look for it by name:

kubectl get deployment --namespace=kube-system

and look for a deployment named coredns or kube-dns.

Your scale target is

Deployment/<your-deployment-name>

where <your-deployment-name> is the name of your DNS Deployment. For example, if the
name of your Deployment for DNS is coredns, your scale target is Deployment/coredns.

Note: CoreDNS is the default DNS service for Kubernetes. CoreDNS sets the label k8s-
app=kube-dns so that it can work in clusters that originally used kube-dns.

Enable DNS horizontal autoscaling

In this section, you create a new Deployment. The Pods in the Deployment run a container
based on the cluster-proportional-autoscaler-amd64 image.

Create a file named dns-horizontal-autoscaler.yaml with this content:

admin/dns/dns-horizontal-autoscaler.yaml

kind: ServiceAccount
apiVersion: v1
metadata:
name: kube-dns-autoscaler
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["replicationcontrollers/scale"]
verbs: ["get", "update"]
- apiGroups: ["apps"]
resources: ["deployments/scale", "replicasets/scale"]
verbs: ["get", "update"]
# Remove the configmaps rule once below issue is fixed:
# kubernetes-incubator/cluster-proportional-autoscaler#16
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: system:kube-dns-autoscaler
subjects:
- kind: ServiceAccount
name: kube-dns-autoscaler
namespace: kube-system
roleRef:
kind: ClusterRole
name: system:kube-dns-autoscaler
apiGroup: rbac.authorization.k8s.io

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-dns-autoscaler
namespace: kube-system
labels:
k8s-app: kube-dns-autoscaler
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: kube-dns-autoscaler
template:
metadata:
labels:
k8s-app: kube-dns-autoscaler
spec:
priorityClassName: system-cluster-critical
securityContext:
seccompProfile:
type: RuntimeDefault
supplementalGroups: [ 65534 ]
fsGroup: 65534
nodeSelector:
kubernetes.io/os: linux
containers:
- name: autoscaler
image: registry.k8s.io/cpa/cluster-proportional-autoscaler:1.8.4
resources:
requests:
cpu: "20m"
memory: "10Mi"
command:
- /cluster-proportional-autoscaler
- --namespace=kube-system
- --configmap=kube-dns-autoscaler
# Should keep target in sync with cluster/addons/dns/kube-dns.yaml.base
- --target=<SCALE_TARGET>
# When cluster is using large nodes(with more cores), "coresPerReplica" should dominate.
# If using small nodes, "nodesPerReplica" should dominate.
- --default-params={"linear":{"coresPerReplica":256,"nodesPerReplica":
16,"preventSinglePointFailure":true,"includeUnschedulableNodes":true}}
- --logtostderr=true
- --v=2
tolerations:
- key: "CriticalAddonsOnly"
operator: "Exists"
serviceAccountName: kube-dns-autoscaler

In the file, replace <SCALE_TARGET> with your scale target.

Go to the directory that contains your configuration file, and enter this command to create the
Deployment:

kubectl apply -f dns-horizontal-autoscaler.yaml

The output of a successful command is:

deployment.apps/dns-autoscaler created

DNS horizontal autoscaling is now enabled.

Tune DNS autoscaling parameters

Verify that the dns-autoscaler ConfigMap exists:

kubectl get configmap --namespace=kube-system

The output is similar to this:

NAME DATA AGE

...
dns-autoscaler 1 ...
...
Modify the data in the ConfigMap:

kubectl edit configmap dns-autoscaler --namespace=kube-system

Look for this line:

linear: '{"coresPerReplica":256,"min":1,"nodesPerReplica":16}'

Modify the fields according to your needs. The "min" field indicates the minimal number of DNS
backends. The actual number of backends is calculated using this equation:

replicas = max( ceil( cores × 1/coresPerReplica ) , ceil( nodes × 1/nodesPerReplica ) )

Note that the values of both coresPerReplica and nodesPerReplica are floats.

The idea is that when a cluster is using nodes that have many cores, coresPerReplica dominates.
When a cluster is using nodes that have fewer cores, nodesPerReplica dominates.

There are other supported scaling patterns. For details, see cluster-proportional-autoscaler.

Disable DNS horizontal autoscaling

There are a few options for tuning DNS horizontal autoscaling. Which option to use depends on
different conditions.

Option 1: Scale down the dns-autoscaler deployment to 0 replicas

This option works for all situations. Enter this command:

kubectl scale deployment --replicas=0 dns-autoscaler --namespace=kube-system

The output is:

deployment.apps/dns-autoscaler scaled

Verify that the replica count is zero:

kubectl get rs --namespace=kube-system

The output displays 0 in the DESIRED and CURRENT columns:

NAME DESIRED CURRENT READY AGE

...
dns-autoscaler-6b59789fc8 0 0 0 ...
...

Option 2: Delete the dns-autoscaler deployment

This option works if dns-autoscaler is under your own control, which means no one will re-
create it:

kubectl delete deployment dns-autoscaler --namespace=kube-system

The output is:

deployment.apps "dns-autoscaler" deleted

Option 3: Delete the dns-autoscaler manifest file from the master node

This option works if dns-autoscaler is under control of the (deprecated) Addon Manager, and
you have write access to the master node.

/etc/kubernetes/addons/dns-horizontal-autoscaler/dns-horizontal-autoscaler.yaml

After the manifest file is deleted, the Addon Manager will delete the dns-autoscaler
Deployment.

Understanding how DNS horizontal autoscaling works

• The cluster-proportional-autoscaler application is deployed separately from the DNS
service.

• An autoscaler Pod runs a client that polls the Kubernetes API server for the number of
nodes and cores in the cluster.

• A desired replica count is calculated and applied to the DNS backends based on the
current schedulable nodes and cores and the given scaling parameters.

• The scaling parameters and data points are provided via a ConfigMap to the autoscaler,
and it refreshes its parameters table every poll interval to be up to date with the latest
desired scaling parameters.

• Changes to the scaling parameters are allowed without rebuilding or restarting the
autoscaler Pod.

• The autoscaler provides a controller interface to support two control patterns: linear and
ladder.

What's next
• Read about Guaranteed Scheduling For Critical Add-On Pods.
• Learn more about the implementation of cluster-proportional-autoscaler.

Change the default StorageClass

This page shows how to change the default Storage Class that is used to provision volumes for
PersistentVolumeClaims that have no special requirements.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Why change the default storage class?

Depending on the installation method, your Kubernetes cluster may be deployed with an
existing StorageClass that is marked as default. This default StorageClass is then used to
dynamically provision storage for PersistentVolumeClaims that do not require any specific
storage class. See PersistentVolumeClaim documentation for details.

The pre-installed default StorageClass may not fit well with your expected workload; for
example, it might provision storage that is too expensive. If this is the case, you can either
change the default StorageClass or disable it completely to avoid dynamic provisioning of
storage.

Deleting the default StorageClass may not work, as it may be re-created automatically by the
addon manager running in your cluster. Please consult the docs for your installation for details
about addon manager and how to disable individual addons.

Changing the default StorageClass

1. List the StorageClasses in your cluster:

kubectl get storageclass

The output is similar to this:

NAME PROVISIONER AGE

standard (default) kubernetes.io/gce-pd 1d
gold kubernetes.io/gce-pd 1d

The default StorageClass is marked by (default).

2. Mark the default StorageClass as non-default:

The default StorageClass has an annotation storageclass.kubernetes.io/is-default-class set

to true. Any other value or absence of the annotation is interpreted as false.

To mark a StorageClass as non-default, you need to change its value to false:

kubectl patch storageclass standard -p '{"metadata": {"annotations":

{"storageclass.kubernetes.io/is-default-class":"false"}}}'

where standard is the name of your chosen StorageClass.

3. Mark a StorageClass as default:

Similar to the previous step, you need to add/set the annotation
storageclass.kubernetes.io/is-default-class=true.

kubectl patch storageclass gold -p '{"metadata": {"annotations":

{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Please note that at most one StorageClass can be marked as default. If two or more of
them are marked as default, a PersistentVolumeClaim without storageClassName
explicitly specified cannot be created.

4. Verify that your chosen StorageClass is default:

kubectl get storageclass

The output is similar to this:

NAME PROVISIONER AGE

standard kubernetes.io/gce-pd 1d
gold (default) kubernetes.io/gce-pd 1d

What's next
• Learn more about PersistentVolumes.

Switching from Polling to CRI Event-based

Updates to Container Status
FEATURE STATE: Kubernetes v1.27 [beta]

This page shows how to migrate nodes to use event based updates for container status. The
event-based implementation reduces node resource consumption by the kubelet, compared to
the legacy approach that relies on polling. You may know this feature as evented Pod lifecycle
event generator (PLEG). That's the name used internally within the Kubernetes project for a key
implementation detail.

The polling based approach is referred to as generic PLEG.

Before you begin

• You need to run a version of Kubernetes that provides this feature. Kubernetes v1.27
includes beta support for event-based container status updates. The feature is beta but is
disabled by default because it requires support from the container runtime.
• Your Kubernetes server must be at or later than version 1.26. To check the version, enter
kubectl version. If you are running a different version of Kubernetes, check the
documentation for that release.
• The container runtime in use must support container lifecycle events. The kubelet
automatically switches back to the legacy generic PLEG mechanism if the container
runtime does not announce support for container lifecycle events, even if you have this
feature gate enabled.
Why switch to Evented PLEG?
• The Generic PLEG incurs non-negligible overhead due to frequent polling of container
statuses.
• This overhead is exacerbated by Kubelet's parallelized polling of container states, thus
limiting its scalability and causing poor performance and reliability problems.
• The goal of Evented PLEG is to reduce unnecessary work during inactivity by replacing
periodic polling.

Switching to Evented PLEG

1. Start the Kubelet with the feature gate EventedPLEG enabled. You can manage the
kubelet feature gates editing the kubelet config file and restarting the kubelet service. You
need to do this on each node where you are using this feature.

2. Make sure the node is drained before proceeding.

3. Start the container runtime with the container event generation enabled.

◦ Containerd
◦ CRI-O

Version 1.7+

Version 1.26+

Check if the CRI-O is already configured to emit CRI events by verifying the
configuration,

crio config | grep enable_pod_events

If it is enabled, the output should be similar to the following:

enable_pod_events = true

To enable it, start the CRI-O daemon with the flag --enable-pod-events=true or use a
dropin config with the following lines:

[crio.runtime]
enable_pod_events: true

Your Kubernetes server must be at or later than version 1.26. To check the version, enter
kubectl version.

4. Verify that the kubelet is using event-based container stage change monitoring. To check,
look for the term EventedPLEG in the kubelet logs.

The output should be similar to this:

I0314 11:10:13.909915 1105457 feature_gate.go:249] feature gates:

&{map[EventedPLEG:true]}

If you have set --v to 4 and above, you might see more entries that indicate that the
kubelet is using event-based container state monitoring.
I0314 11:12:42.009542 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=3b2c6172-b112-447a-ba96-94e7022912dc
I0314 11:12:44.623326 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=b3fba5ea-a8c5-4b76-8f43-481e17e8ec40
I0314 11:12:44.714564 1110177 evented.go:238] "Evented PLEG: Generated pod status from
the received event" podUID=b3fba5ea-a8c5-4b76-8f43-481e17e8ec40

What's next
• Learn more about the design in the Kubernetes Enhancement Proposal (KEP): Kubelet
Evented PLEG for Better Performance.

Change the Reclaim Policy of a

PersistentVolume
This page shows how to change the reclaim policy of a Kubernetes PersistentVolume.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Why change reclaim policy of a PersistentVolume

PersistentVolumes can have various reclaim policies, including "Retain", "Recycle", and "Delete".
For dynamically provisioned PersistentVolumes, the default reclaim policy is "Delete". This
means that a dynamically provisioned volume is automatically deleted when a user deletes the
corresponding PersistentVolumeClaim. This automatic behavior might be inappropriate if the
volume contains precious data. In that case, it is more appropriate to use the "Retain" policy.
With the "Retain" policy, if a user deletes a PersistentVolumeClaim, the corresponding
PersistentVolume will not be deleted. Instead, it is moved to the Released phase, where all of its
data can be manually recovered.

Changing the reclaim policy of a PersistentVolume

1. List the PersistentVolumes in your cluster:

kubectl get pv

The output is similar to this:

NAME CAPACITY ACCESSMODES RECLAIMPOLICY
STATUS CLAIM STORAGECLASS REASON AGE
pvc-b6efd8da-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound
default/claim1 manual 10s
pvc-b95650f8-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound
default/claim2 manual 6s
pvc-bb3ca71d-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound
default/claim3 manual 3s

This list also includes the name of the claims that are bound to each volume for easier
identification of dynamically provisioned volumes.

2. Choose one of your PersistentVolumes and change its reclaim policy:

kubectl patch pv <your-pv-name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'

where <your-pv-name> is the name of your chosen PersistentVolume.

Note:

On Windows, you must double quote any JSONPath template that contains spaces (not
single quote as shown above for bash). This in turn means that you must use a single
quote or escaped double quote around any literals in the template. For example:

kubectl patch pv <your-pv-name> -p "{\"spec\":{\"persistentVolumeReclaimPolicy\":\"Retai

n\"}}"

3. Verify that your chosen PersistentVolume has the right policy:

kubectl get pv

The output is similar to this:

NAME CAPACITY ACCESSMODES RECLAIMPOLICY

STATUS CLAIM STORAGECLASS REASON AGE
pvc-b6efd8da-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound
default/claim1 manual 40s
pvc-b95650f8-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Delete Bound
default/claim2 manual 36s
pvc-bb3ca71d-b7b5-11e6-9d58-0ed433a7dd94 4Gi RWO Retain Bound
default/claim3 manual 33s

In the preceding output, you can see that the volume bound to claim default/claim3 has
reclaim policy Retain. It will not be automatically deleted when a user deletes claim
default/claim3.

What's next
• Learn more about PersistentVolumes.
• Learn more about PersistentVolumeClaims.
References

• PersistentVolume
◦ Pay attention to the .spec.persistentVolumeReclaimPolicy field of PersistentVolume.
• PersistentVolumeClaim

Cloud Controller Manager Administration

FEATURE STATE: Kubernetes v1.11 [beta]

Since cloud providers develop and release at a different pace compared to the Kubernetes
project, abstracting the provider-specific code to the cloud-controller-manager binary allows
cloud vendors to evolve independently from the core Kubernetes code.

The cloud-controller-manager can be linked to any cloud provider that satisfies

cloudprovider.Interface. For backwards compatibility, the cloud-controller-manager provided in
the core Kubernetes project uses the same cloud libraries as kube-controller-manager. Cloud
providers already supported in Kubernetes core are expected to use the in-tree cloud-controller-
manager to transition out of Kubernetes core.

Administration
Requirements

Every cloud has their own set of requirements for running their own cloud provider
integration, it should not be too different from the requirements when running kube-controller-
manager. As a general rule of thumb you'll need:

• cloud authentication/authorization: your cloud may require a token or IAM rules to allow
access to their APIs
• kubernetes authentication/authorization: cloud-controller-manager may need RBAC rules
set to speak to the kubernetes apiserver
• high availability: like kube-controller-manager, you may want a high available setup for
cloud controller manager using leader election (on by default).

Running cloud-controller-manager

Successfully running cloud-controller-manager requires some changes to your cluster

configuration.

• kubelet, kube-apiserver, and kube-controller-manager must be set according to the user's

usage of external CCM. If the user has an external CCM (not the internal cloud controller
loops in the Kubernetes Controller Manager), then --cloud-provider=external must be
specified. Otherwise, it should not be specified.

Keep in mind that setting up your cluster to use cloud controller manager will change your
cluster behaviour in a few ways:

• Components that specify --cloud-provider=external will add a taint

node.cloudprovider.kubernetes.io/uninitialized with an effect NoSchedule during
initialization. This marks the node as needing a second initialization from an external
controller before it can be scheduled work. Note that in the event that cloud controller
manager is not available, new nodes in the cluster will be left unschedulable. The taint is
important since the scheduler may require cloud specific information about nodes such as
their region or type (high cpu, gpu, high memory, spot instance, etc).
• cloud information about nodes in the cluster will no longer be retrieved using local
metadata, but instead all API calls to retrieve node information will go through cloud
controller manager. This may mean you can restrict access to your cloud API on the
kubelets for better security. For larger clusters you may want to consider if cloud
controller manager will hit rate limits since it is now responsible for almost all API calls
to your cloud from within the cluster.

The cloud controller manager can implement:

• Node controller - responsible for updating kubernetes nodes using cloud APIs and
deleting kubernetes nodes that were deleted on your cloud.
• Service controller - responsible for loadbalancers on your cloud against services of type
LoadBalancer.
• Route controller - responsible for setting up network routes on your cloud
• any other features you would like to implement if you are running an out-of-tree
provider.

Examples
If you are using a cloud that is currently supported in Kubernetes core and would like to adopt
cloud controller manager, see the cloud controller manager in kubernetes core.

For cloud controller managers not in Kubernetes core, you can find the respective projects in
repositories maintained by cloud vendors or by SIGs.

For providers already in Kubernetes core, you can run the in-tree cloud controller manager as a
DaemonSet in your cluster, use the following as a guideline:

admin/cloud/ccm-example.yaml

# This is an example of how to set up cloud-controller-manager as a Daemonset in your cluster.

# It assumes that your masters can run pods and has the role node-role.kubernetes.io/master
# Note that this Daemonset will not work straight out of the box for your cloud, this is
# meant to be a guideline.

---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:cloud-controller-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: cloud-controller-manager
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: cloud-controller-manager
name: cloud-controller-manager
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: cloud-controller-manager
template:
metadata:
labels:
k8s-app: cloud-controller-manager
spec:
serviceAccountName: cloud-controller-manager
containers:
- name: cloud-controller-manager
# for in-tree providers we use registry.k8s.io/cloud-controller-manager
# this can be replaced with any other image for out-of-tree providers
image: registry.k8s.io/cloud-controller-manager:v1.8.0
command:
- /usr/local/bin/cloud-controller-manager
- --cloud-provider=[YOUR_CLOUD_PROVIDER] # Add your own cloud provider here!
- --leader-elect=true
- --use-service-account-credentials
# these flags will vary for every cloud provider
- --allocate-node-cidrs=true
- --configure-cloud-routes=true
- --cluster-cidr=172.17.0.0/16
tolerations:
# this is required so CCM can bootstrap itself
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
# these tolerations are to have the daemonset runnable on control plane nodes
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
# this is to restrict CCM to only run on master nodes
# the node selector may vary depending on your cluster setup
nodeSelector:
node-role.kubernetes.io/master: ""

Limitations
Running cloud controller manager comes with a few possible limitations. Although these
limitations are being addressed in upcoming releases, it's important that you are aware of these
limitations for production workloads.

Support for Volumes

Cloud controller manager does not implement any of the volume controllers found in kube-
controller-manager as the volume integrations also require coordination with kubelets. As we
evolve CSI (container storage interface) and add stronger support for flex volume plugins,
necessary support will be added to cloud controller manager so that clouds can fully integrate
with volumes. Learn more about out-of-tree CSI volume plugins here.

Scalability

The cloud-controller-manager queries your cloud provider's APIs to retrieve information for all
nodes. For very large clusters, consider possible bottlenecks such as resource requirements and
API rate limiting.

Chicken and Egg

The goal of the cloud controller manager project is to decouple development of cloud features
from the core Kubernetes project. Unfortunately, many aspects of the Kubernetes project has
assumptions that cloud provider features are tightly integrated into the project. As a result,
adopting this new architecture can create several situations where a request is being made for
information from a cloud provider, but the cloud controller manager may not be able to return
that information without the original request being complete.

A good example of this is the TLS bootstrapping feature in the Kubelet. TLS bootstrapping
assumes that the Kubelet has the ability to ask the cloud provider (or a local metadata service)
for all its address types (private, public, etc) but cloud controller manager cannot set a node's
address types without being initialized in the first place which requires that the kubelet has TLS
certificates to communicate with the apiserver.

As this initiative evolves, changes will be made to address these issues in upcoming releases.

What's next
To build and develop your own cloud controller manager, read Developing Cloud Controller
Manager.

Configure a kubelet image credential

provider
FEATURE STATE: Kubernetes v1.26 [stable]
Starting from Kubernetes v1.20, the kubelet can dynamically retrieve credentials for a container
image registry using exec plugins. The kubelet and the exec plugin communicate through stdio
(stdin, stdout, and stderr) using Kubernetes versioned APIs. These plugins allow the kubelet to
request credentials for a container registry dynamically as opposed to storing static credentials
on disk. For example, the plugin may talk to a local metadata server to retrieve short-lived
credentials for an image that is being pulled by the kubelet.

You may be interested in using this capability if any of the below are true:

• API calls to a cloud provider service are required to retrieve authentication information
for a registry.
• Credentials have short expiration times and requesting new credentials frequently is
required.
• Storing registry credentials on disk or in imagePullSecrets is not acceptable.

This guide demonstrates how to configure the kubelet's image credential provider plugin
mechanism.

Before you begin

• You need a Kubernetes cluster with nodes that support kubelet credential provider
plugins. This support is available in Kubernetes 1.28; Kubernetes v1.24 and v1.25 included
this as a beta feature, enabled by default.
• A working implementation of a credential provider exec plugin. You can build your own
plugin or use one provided by cloud providers.

Your Kubernetes server must be at or later than version v1.26. To check the version, enter
kubectl version.

Installing Plugins on Nodes

A credential provider plugin is an executable binary that will be run by the kubelet. Ensure that
the plugin binary exists on every node in your cluster and stored in a known directory. The
directory will be required later when configuring kubelet flags.

Configuring the Kubelet

In order to use this feature, the kubelet expects two flags to be set:

• --image-credential-provider-config - the path to the credential provider plugin config file.

• --image-credential-provider-bin-dir - the path to the directory where credential provider
plugin binaries are located.

Configure a kubelet credential provider

The configuration file passed into --image-credential-provider-config is read by the kubelet to

determine which exec plugins should be invoked for which container images. Here's an
example configuration file you may end up using if you are using the ECR-based plugin:

apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
# providers is a list of credential provider helper plugins that will be enabled by the kubelet.
# Multiple providers may match against a single image, in which case credentials
# from all providers will be returned to the kubelet. If multiple providers are called
# for a single image, the results are combined. If providers return overlapping
# auth keys, the value from the provider earlier in this list is used.
providers:
# name is the required name of the credential provider. It must match the name of the
# provider executable as seen by the kubelet. The executable must be in the kubelet's
# bin directory (set by the --image-credential-provider-bin-dir flag).
- name: ecr-credential-provider
# matchImages is a required list of strings used to match against images in order to
# determine if this provider should be invoked. If one of the strings matches the
# requested image from the kubelet, the plugin will be invoked and given a chance
# to provide credentials. Images are expected to contain the registry domain
# and URL path.
#
# Each entry in matchImages is a pattern which can optionally contain a port and a path.
# Globs can be used in the domain, but not in the port or the path. Globs are supported
# as subdomains like '*.k8s.io' or 'k8s.*.io', and top-level-domains such as 'k8s.*'.
# Matching partial subdomains like 'app*.k8s.io' is also supported. Each glob can only match
# a single subdomain segment, so `*.io` does **not** match `*.k8s.io`.
#
# A match exists between an image and a matchImage when all of the below are true:
# - Both contain the same number of domain parts and each part matches.
# - The URL path of an matchImages must be a prefix of the target image URL path.
# - If the matchImages contains a port, then the port must match in the image as well.
#
# Example values of matchImages:
# - 123456789.dkr.ecr.us-east-1.amazonaws.com
# - *.azurecr.io
# - gcr.io
# - *.*.registry.io
# - registry.io:8080/path
matchImages:
- "*.dkr.ecr.*.amazonaws.com"
- "*.dkr.ecr.*.amazonaws.com.cn"
- "*.dkr.ecr-fips.*.amazonaws.com"
- "*.dkr.ecr.us-iso-east-1.c2s.ic.gov"
- "*.dkr.ecr.us-isob-east-1.sc2s.sgov.gov"
# defaultCacheDuration is the default duration the plugin will cache credentials in-memory
# if a cache duration is not provided in the plugin response. This field is required.
defaultCacheDuration: "12h"
# Required input version of the exec CredentialProviderRequest. The returned
CredentialProviderResponse
# MUST use the same encoding version as the input. Current supported values are:
# - credentialprovider.kubelet.k8s.io/v1
apiVersion: credentialprovider.kubelet.k8s.io/v1
# Arguments to pass to the command when executing it.
# +optional
# args:
# - --example-argument
# Env defines additional environment variables to expose to the process. These
# are unioned with the host's environment, as well as variables client-go uses
# to pass argument to the plugin.
# +optional
env:
- name: AWS_PROFILE
value: example_profile

The providers field is a list of enabled plugins used by the kubelet. Each entry has a few
required fields:

• name: the name of the plugin which MUST match the name of the executable binary that
exists in the directory passed into --image-credential-provider-bin-dir.
• matchImages: a list of strings used to match against images in order to determine if this
provider should be invoked. More on this below.
• defaultCacheDuration: the default duration the kubelet will cache credentials in-memory
if a cache duration was not specified by the plugin.
• apiVersion: the API version that the kubelet and the exec plugin will use when
communicating.

Each credential provider can also be given optional args and environment variables as well.
Consult the plugin implementors to determine what set of arguments and environment
variables are required for a given plugin.

Configure image matching

The matchImages field for each credential provider is used by the kubelet to determine whether
a plugin should be invoked for a given image that a Pod is using. Each entry in matchImages is
an image pattern which can optionally contain a port and a path. Globs can be used in the
domain, but not in the port or the path. Globs are supported as subdomains like *.k8s.io or
k8s.*.io, and top-level domains such as k8s.*. Matching partial subdomains like app*.k8s.io is also
supported. Each glob can only match a single subdomain segment, so *.io does NOT match
*.k8s.io.

A match exists between an image name and a matchImage entry when all of the below are true:

• Both contain the same number of domain parts and each part matches.
• The URL path of match image must be a prefix of the target image URL path.
• If the matchImages contains a port, then the port must match in the image as well.

Some example values of matchImages patterns are:

• 123456789.dkr.ecr.us-east-1.amazonaws.com
• *.azurecr.io
• gcr.io
• *.*.registry.io
• foo.registry.io:8080/path

What's next
• Read the details about CredentialProviderConfig in the kubelet configuration API (v1)
reference.
• Read the kubelet credential provider API reference (v1).
Configure Quotas for API Objects
This page shows how to configure quotas for API objects, including PersistentVolumeClaims
and Services. A quota restricts the number of objects, of a particular type, that can be created in
a namespace. You specify quotas in a ResourceQuota object.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace quota-object-example

Create a ResourceQuota
Here is the configuration file for a ResourceQuota object:

admin/resource/quota-objects.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
name: object-quota-demo
spec:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"

Create the ResourceQuota:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects.yaml --namespace=quota

-object-example

View detailed information about the ResourceQuota:

kubectl get resourcequota object-quota-demo --namespace=quota-object-example --output=ya

ml
The output shows that in the quota-object-example namespace, there can be at most one
PersistentVolumeClaim, at most two Services of type LoadBalancer, and no Services of type
NodePort.

status:
hard:
persistentvolumeclaims: "1"
services.loadbalancers: "2"
services.nodeports: "0"
used:
persistentvolumeclaims: "0"
services.loadbalancers: "0"
services.nodeports: "0"

Create a PersistentVolumeClaim
Here is the configuration file for a PersistentVolumeClaim object:

admin/resource/quota-objects-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi

Create the PersistentVolumeClaim:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects-pvc.yaml --namespace=q

uota-object-example

Verify that the PersistentVolumeClaim was created:

kubectl get persistentvolumeclaims --namespace=quota-object-example

The output shows that the PersistentVolumeClaim exists and has status Pending:

NAME STATUS
pvc-quota-demo Pending

Attempt to create a second PersistentVolumeClaim

Here is the configuration file for a second PersistentVolumeClaim:

admin/resource/quota-objects-pvc-2.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-quota-demo-2
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi

Attempt to create the second PersistentVolumeClaim:

kubectl apply -f https://k8s.io/examples/admin/resource/quota-objects-pvc-2.yaml --

namespace=quota-object-example

The output shows that the second PersistentVolumeClaim was not created, because it would
have exceeded the quota for the namespace.

persistentvolumeclaims "pvc-quota-demo-2" is forbidden:

exceeded quota: object-quota-demo, requested: persistentvolumeclaims=1,
used: persistentvolumeclaims=1, limited: persistentvolumeclaims=1

Notes
These are the strings used to identify API resources that can be constrained by quotas:

String API Object

"pods" Pod
"services" Service
"replicationcontrollers" ReplicationController
"resourcequotas" ResourceQuota
"secrets" Secret
"configmaps" ConfigMap
"persistentvolumeclaims" PersistentVolumeClaim
"services.nodeports" Service of type NodePort
"services.loadbalancers" Service of type LoadBalancer

Clean up
Delete your namespace:

kubectl delete namespace quota-object-example

What's next
For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

Configure Default CPU Requests and Limits for a Namespace
•
• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

For app developers

• Assign Memory Resources to Containers and Pods

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

Control CPU Management Policies on the

Node
FEATURE STATE: Kubernetes v1.26 [stable]

Kubernetes keeps many aspects of how pods execute on nodes abstracted from the user. This is
by design. However, some workloads require stronger guarantees in terms of latency and/or
performance in order to operate acceptably. The kubelet provides methods to enable more
complex workload placement policies while keeping the abstraction free from explicit
placement directives.

For detailed information on resource management, please refer to the Resource Management
for Pods and Containers documentation.

Before you begin

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version v1.26. To check the version, enter
kubectl version.

If you are running an older version of Kubernetes, please look at the documentation for the
version you are actually running.
CPU Management Policies
By default, the kubelet uses CFS quota to enforce pod CPU limits. When the node runs many
CPU-bound pods, the workload can move to different CPU cores depending on whether the pod
is throttled and which CPU cores are available at scheduling time. Many workloads are not
sensitive to this migration and thus work fine without any intervention.

However, in workloads where CPU cache affinity and scheduling latency significantly affect
workload performance, the kubelet allows alternative CPU management policies to determine
some placement preferences on the node.

Configuration

The CPU Manager policy is set with the --cpu-manager-policy kubelet flag or the
cpuManagerPolicy field in KubeletConfiguration. There are two supported policies:

• none: the default policy.

• static: allows pods with certain resource characteristics to be granted increased CPU
affinity and exclusivity on the node.

The CPU manager periodically writes resource updates through the CRI in order to reconcile
in-memory CPU assignments with cgroupfs. The reconcile frequency is set through a new
Kubelet configuration value --cpu-manager-reconcile-period. If not specified, it defaults to the
same duration as --node-status-update-frequency.

The behavior of the static policy can be fine-tuned using the --cpu-manager-policy-options flag.
The flag takes a comma-separated list of key=value policy options. If you disable the
CPUManagerPolicyOptions feature gate then you cannot fine-tune CPU manager policies. In
that case, the CPU manager operates only using its default settings.

In addition to the top-level CPUManagerPolicyOptions feature gate, the policy options are split
into two groups: alpha quality (hidden by default) and beta quality (visible by default). The
groups are guarded respectively by the CPUManagerPolicyAlphaOptions and
CPUManagerPolicyBetaOptions feature gates. Diverging from the Kubernetes standard, these
feature gates guard groups of options, because it would have been too cumbersome to add a
feature gate for each individual option.

Changing the CPU Manager Policy

Since the CPU manager policy can only be applied when kubelet spawns new pods, simply
changing from "none" to "static" won't apply to existing pods. So in order to properly change
the CPU manager policy on a node, perform the following steps:

1. Drain the node.

2. Stop kubelet.
3. Remove the old CPU manager state file. The path to this file is /var/lib/kubelet/
cpu_manager_state by default. This clears the state maintained by the CPUManager so
that the cpu-sets set up by the new policy won’t conflict with it.
4. Edit the kubelet configuration to change the CPU manager policy to the desired value.
5. Start kubelet.

Repeat this process for every node that needs its CPU manager policy changed. Skipping this
process will result in kubelet crashlooping with the following error:
could not restore state from checkpoint: configured policy "static" differs from state checkpoint
policy "none", please drain this node and delete the CPU manager checkpoint file "/var/lib/
kubelet/cpu_manager_state" before restarting Kubelet

None policy

The none policy explicitly enables the existing default CPU affinity scheme, providing no
affinity beyond what the OS scheduler does automatically. Limits on CPU usage for
Guaranteed pods and Burstable pods are enforced using CFS quota.

Static policy

The static policy allows containers in Guaranteed pods with integer CPU requests access to
exclusive CPUs on the node. This exclusivity is enforced using the cpuset cgroup controller.

Note: System services such as the container runtime and the kubelet itself can continue to run
on these exclusive CPUs. The exclusivity only extends to other pods.
Note: CPU Manager doesn't support offlining and onlining of CPUs at runtime. Also, if the set
of online CPUs changes on the node, the node must be drained and CPU manager manually
reset by deleting the state file cpu_manager_state in the kubelet root directory.

This policy manages a shared pool of CPUs that initially contains all CPUs in the node. The
amount of exclusively allocatable CPUs is equal to the total number of CPUs in the node minus
any CPU reservations by the kubelet --kube-reserved or --system-reserved options. From 1.17,
the CPU reservation list can be specified explicitly by kubelet --reserved-cpus option. The
explicit CPU list specified by --reserved-cpus takes precedence over the CPU reservation
specified by --kube-reserved and --system-reserved. CPUs reserved by these options are taken,
in integer quantity, from the initial shared pool in ascending order by physical core ID. This
shared pool is the set of CPUs on which any containers in BestEffort and Burstable pods run.
Containers in Guaranteed pods with fractional CPU requests also run on CPUs in the shared
pool. Only containers that are both part of a Guaranteed pod and have integer CPU requests are
assigned exclusive CPUs.

Note: The kubelet requires a CPU reservation greater than zero be made using either --kube-
reserved and/or --system-reserved or --reserved-cpus when the static policy is enabled. This is
because zero CPU reservation would allow the shared pool to become empty.

As Guaranteed pods whose containers fit the requirements for being statically assigned are
scheduled to the node, CPUs are removed from the shared pool and placed in the cpuset for the
container. CFS quota is not used to bound the CPU usage of these containers as their usage is
bound by the scheduling domain itself. In others words, the number of CPUs in the container
cpuset is equal to the integer CPU limit specified in the pod spec. This static assignment
increases CPU affinity and decreases context switches due to throttling for the CPU-bound
workload.

Consider the containers in the following pod specs:

spec:
containers:
- name: nginx
image: nginx
This pod runs in the BestEffort QoS class because no resource requests or limits are specified. It
runs in the shared pool.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"

This pod runs in the Burstable QoS class because resource requests do not equal limits and the
cpu quantity is not specified. It runs in the shared pool.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "100Mi"
cpu: "1"

This pod runs in the Burstable QoS class because resource requests do not equal limits. It runs
in the shared pool.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
requests:
memory: "200Mi"
cpu: "2"

This pod runs in the Guaranteed QoS class because requests are equal to limits. And the
container's resource limit for the CPU resource is an integer greater than or equal to one. The
nginx container is granted 2 exclusive CPUs.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "1.5"
requests:
memory: "200Mi"
cpu: "1.5"

This pod runs in the Guaranteed QoS class because requests are equal to limits. But the
container's resource limit for the CPU resource is a fraction. It runs in the shared pool.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"

This pod runs in the Guaranteed QoS class because only limits are specified and requests are set
equal to limits when not explicitly specified. And the container's resource limit for the CPU
resource is an integer greater than or equal to one. The nginx container is granted 2 exclusive
CPUs.

Static policy options

You can toggle groups of options on and off based upon their maturity level using the following
feature gates:

• CPUManagerPolicyBetaOptions default enabled. Disable to hide beta-level options.

• CPUManagerPolicyAlphaOptions default disabled. Enable to show alpha-level options.
You will still have to enable each option using the CPUManagerPolicyOptions kubelet
option.

The following policy options exist for the static CPUManager policy:

• full-pcpus-only (beta, visible by default) (1.22 or higher)

• distribute-cpus-across-numa (alpha, hidden by default) (1.23 or higher)
• align-by-socket (alpha, hidden by default) (1.25 or higher)

If the full-pcpus-only policy option is specified, the static policy will always allocate full
physical cores. By default, without this option, the static policy allocates CPUs using a
topology-aware best-fit allocation. On SMT enabled systems, the policy can allocate individual
virtual cores, which correspond to hardware threads. This can lead to different containers
sharing the same physical cores; this behaviour in turn contributes to the noisy neighbours
problem. With the option enabled, the pod will be admitted by the kubelet only if the CPU
request of all its containers can be fulfilled by allocating full physical cores. If the pod does not
pass the admission, it will be put in Failed state with the message SMTAlignmentError.

If the distribute-cpus-across-numapolicy option is specified, the static policy will evenly

distribute CPUs across NUMA nodes in cases where more than one NUMA node is required to
satisfy the allocation. By default, the CPUManager will pack CPUs onto one NUMA node until
it is filled, with any remaining CPUs simply spilling over to the next NUMA node. This can
cause undesired bottlenecks in parallel code relying on barriers (and similar synchronization
primitives), as this type of code tends to run only as fast as its slowest worker (which is slowed
down by the fact that fewer CPUs are available on at least one NUMA node). By distributing
CPUs evenly across NUMA nodes, application developers can more easily ensure that no single
worker suffers from NUMA effects more than any other, improving the overall performance of
these types of applications.

If the align-by-socket policy option is specified, CPUs will be considered aligned at the socket
boundary when deciding how to allocate CPUs to a container. By default, the CPUManager
aligns CPU allocations at the NUMA boundary, which could result in performance degradation
if CPUs need to be pulled from more than one NUMA node to satisfy the allocation. Although it
tries to ensure that all CPUs are allocated from the minimum number of NUMA nodes, there is
no guarantee that those NUMA nodes will be on the same socket. By directing the
CPUManager to explicitly align CPUs at the socket boundary rather than the NUMA boundary,
we are able to avoid such issues. Note, this policy option is not compatible with
TopologyManager single-numa-node policy and does not apply to hardware where the number
of sockets is greater than number of NUMA nodes.

The full-pcpus-only option can be enabled by adding full-pcpus-only=true to the CPUManager

policy options. Likewise, the distribute-cpus-across-numa option can be enabled by adding
distribute-cpus-across-numa=true to the CPUManager policy options. When both are set, they
are "additive" in the sense that CPUs will be distributed across NUMA nodes in chunks of full-
pcpus rather than individual cores. The align-by-socket policy option can be enabled by adding
align-by-socket=true to the CPUManager policy options. It is also additive to the full-pcpus-
only and distribute-cpus-across-numa policy options.

Control Topology Management Policies on

a node
FEATURE STATE: Kubernetes v1.27 [stable]

An increasing number of systems leverage a combination of CPUs and hardware accelerators to

support latency-critical execution and high-throughput parallel computation. These include
workloads in fields such as telecommunications, scientific computing, machine learning,
financial services and data analytics. Such hybrid systems comprise a high performance
environment.

In order to extract the best performance, optimizations related to CPU isolation, memory and
device locality are required. However, in Kubernetes, these optimizations are handled by a
disjoint set of components.

Topology Manager is a Kubelet component that aims to coordinate the set of components that
are responsible for these optimizations.

Before you begin

• Killercoda
• Play with Kubernetes
Your Kubernetes server must be at or later than version v1.18. To check the version, enter
kubectl version.

How Topology Manager Works

Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes
make resource allocation decisions independently of each other. This can result in undesirable
allocations on multiple-socketed systems, performance/latency sensitive applications will suffer
due to these undesirable allocations. Undesirable in this case meaning for example, CPUs and
devices being allocated from different NUMA Nodes thus, incurring additional latency.

The Topology Manager is a Kubelet component, which acts as a source of truth so that other
Kubelet components can make topology aligned resource allocation choices.

The Topology Manager provides an interface for components, called Hint Providers, to send and
receive topology information. Topology Manager has a set of node level policies which are
explained below.

The Topology manager receives Topology information from the Hint Providers as a bitmask
denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager
policies perform a set of operations on the hints provided and converge on the hint determined
by the policy to give the optimal result, if an undesirable hint is stored the preferred field for
the hint will be set to false. In the current policies preferred is the narrowest preferred mask.
The selected hint is stored as part of the Topology Manager. Depending on the policy
configured the pod can be accepted or rejected from the node based on the selected hint. The
hint is then stored in the Topology Manager for use by the Hint Providers when making the
resource allocation decisions.

Topology Manager Scopes and Policies

The Topology Manager currently:

• Aligns Pods of all QoS classes.

• Aligns the requested resources that Hint Provider provides topology hints for.

If these conditions are met, the Topology Manager will align the requested resources.

In order to customise how this alignment is carried out, the Topology Manager provides two
distinct knobs: scope and policy.

The scope defines the granularity at which you would like resource alignment to be performed
(e.g. at the pod or container level). And the policy defines the actual strategy used to carry out
the alignment (e.g. best-effort, restricted, single-numa-node, etc.). Details on the various scopes
and policies available today can be found below.

Note: To align CPU resources with other requested resources in a Pod Spec, the CPU Manager
should be enabled and proper CPU Manager policy should be configured on a Node. See control
CPU Management Policies.
Note: To align memory (and hugepages) resources with other requested resources in a Pod
Spec, the Memory Manager should be enabled and proper Memory Manager policy should be
configured on a Node. Examine Memory Manager documentation.
Topology Manager Scopes

The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:

• container (default)
• pod

Either option can be selected at a time of the kubelet startup, with --topology-manager-scope
flag.

container scope

The container scope is used by default.

Within this scope, the Topology Manager performs a number of sequential resource alignments,
i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no
notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In
effect, the Topology Manager performs an arbitrary alignment of individual containers to
NUMA nodes.

The notion of grouping the containers was endorsed and implemented on purpose in the
following scope, for example the pod scope.

pod scope

To select the pod scope, start the kubelet with the command line option --topology-manager-
scope=pod.

This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is,
the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all
containers) to either a single NUMA node or a common set of NUMA nodes. The following
examples illustrate the alignments produced by the Topology Manager on different occasions:

• all containers can be and are allocated to a single NUMA node;

• all containers can be and are allocated to a shared set of NUMA nodes.

The total amount of particular resource demanded for the entire pod is calculated according to
effective requests/limits formula, and thus, this total value is equal to the maximum of:

• the sum of all app container requests,

• the maximum of init container requests,

for a resource.

Using the pod scope in tandem with single-numa-node Topology Manager policy is specifically
valuable for workloads that are latency sensitive or for high-throughput applications that
perform IPC. By combining both options, you are able to place all containers in a pod onto a
single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for
that pod.

In the case of single-numa-node policy, a pod is accepted only if a suitable set of NUMA nodes
is present among possible allocations. Reconsider the example above:

• a set containing only a single NUMA node - it leads to pod being admitted,
• whereas a set containing more NUMA nodes - it results in pod rejection (because instead
of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).

To recap, Topology Manager first computes a set of NUMA nodes and then tests it against
Topology Manager policy, which either leads to the rejection or admission of the pod.

Topology Manager Policies

Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, --
topology-manager-policy. There are four supported policies:

• none (default)
• best-effort
• restricted
• single-numa-node

Note: If Topology Manager is configured with the pod scope, the container, which is
considered by the policy, is reflecting requirements of the entire pod, and thus each container
from the pod will result with the same topology alignment decision.

none policy

This is the default policy and does not perform any topology alignment.

best-effort policy

For each container in a Pod, the kubelet, with best-effort topology management policy, calls
each Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will store this and admit the pod to the node anyway.

The Hint Providers can then use this information when making the resource allocation decision.

restricted policy

For each container in a Pod, the kubelet, with restricted topology management policy, calls each
Hint Provider to discover their resource availability. Using this information, the Topology
Manager stores the preferred NUMA Node affinity for that container. If the affinity is not
preferred, Topology Manager will reject this pod from the node. This will result in a pod in a
Terminated state with a pod admission failure.

Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule
the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
An external control loop could be also implemented to trigger a redeployment of pods that have
the Topology Affinity error.

If the pod is admitted, the Hint Providers can then use this information when making the
resource allocation decision.
single-numa-node policy

For each container in a Pod, the kubelet, with single-numa-node topology management policy,
calls each Hint Provider to discover their resource availability. Using this information, the
Topology Manager determines if a single NUMA Node affinity is possible. If it is, Topology
Manager will store this and the Hint Providers can then use this information when making the
resource allocation decision. If, however, this is not possible then the Topology Manager will
reject the pod from the node. This will result in a pod in a Terminated state with a pod
admission failure.

Once the pod is in a Terminated state, the Kubernetes scheduler will not attempt to reschedule
the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the
Pod.An external control loop could be also implemented to trigger a redeployment of pods that
have the Topology Affinity error.

Topology manager policy options

Support for the Topology Manager policy options requires TopologyManagerPolicyOptions

feature gate to be enabled (it is enabled by default).

You can toggle groups of options on and off based upon their maturity level using the following
feature gates:

• TopologyManagerPolicyBetaOptions default enabled. Enable to show beta-level options.

• TopologyManagerPolicyAlphaOptions default disabled. Enable to show alpha-level
options.

You will still have to enable each option using the TopologyManagerPolicyOptions kubelet
option.

The following policy options exists:

• prefer-closest-numa-nodes (beta, visible by default; TopologyManagerPolicyOptions and

TopologyManagerPolicyBetaOptions feature gates have to be enabled). The prefer-
closest-numa-nodes policy option is beta in Kubernetes 1.28.

If the prefer-closest-numa-nodes policy option is specified, the best-effort and restricted policies
will favor sets of NUMA nodes with shorter distance between them when making admission
decisions. You can enable this option by adding prefer-closest-numa-nodes=true to the
Topology Manager policy options. By default, without this option, Topology Manager aligns
resources on either a single NUMA node or the minimum number of NUMA nodes (in cases
where more than one NUMA node is required). However, the TopologyManager is not aware of
NUMA distances and does not take them into account when making admission decisions. This
limitation surfaces in multi-socket, as well as single-socket multi NUMA systems, and can cause
significant performance degradation in latency-critical execution and high-throughput
applications if the Topology Manager decides to align resources on non-adjacent NUMA nodes.

Pod Interactions with Topology Manager Policies

Consider the containers in the following pod specs:

spec:
containers:
- name: nginx
image: nginx

This pod runs in the BestEffort QoS class because no resource requests or limits are specified.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"

This pod runs in the Burstable QoS class because requests are less than limits.

If the selected policy is anything other than none, Topology Manager would consider these Pod
specifications. The Topology Manager would consult the Hint Providers to get topology hints.
In the case of the static, the CPU Manager policy would return default topology hint, because
these Pods do not have explicitly request CPU resources.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"

This pod with integer CPU request runs in the Guaranteed QoS class because requests are equal
to limits.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
This pod with sharing CPU request runs in the Guaranteed QoS class because requests are
equal to limits.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
example.com/deviceA: "1"
example.com/deviceB: "1"
requests:
example.com/deviceA: "1"
example.com/deviceB: "1"

This pod runs in the BestEffort QoS class because there are no CPU and memory requests.

The Topology Manager would consider the above pods. The Topology Manager would consult
the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.

In the case of the Guaranteed pod with integer CPU request, the static CPU Manager policy
would return topology hints relating to the exclusive CPU and the Device Manager would send
back hints for the requested device.

In the case of the Guaranteed pod with sharing CPU request, the static CPU Manager policy
would return default topology hint as there is no exclusive CPU request and the Device
Manager would send back hints for the requested device.

In the above two cases of the Guaranteed pod, the none CPU Manager policy would return
default topology hint.

In the case of the BestEffort pod, the static CPU Manager policy would send back the default
topology hint as there is no CPU request and the Device Manager would send back the hints for
each of the requested devices.

Using this information the Topology Manager calculates the optimal hint for the pod and stores
this information, which will be used by the Hint Providers when they are making their resource
assignments.

Known Limitations

1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more
than 8 NUMA nodes there will be a state explosion when trying to enumerate the
possible NUMA affinities and generating their hints.

2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then

fail on the node due to the Topology Manager.

Customizing DNS Service

This page explains how to configure your DNS Pod(s) and customize the DNS resolution
process in your cluster.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured
to communicate with your cluster. It is recommended to run this tutorial on a cluster with at
least two nodes that are not acting as control plane hosts. If you do not already have a cluster,
you can create one by using minikube or you can use one of these Kubernetes playgrounds:

• Killercoda
• Play with Kubernetes

Your cluster must be running the CoreDNS add-on.

Your Kubernetes server must be at or later than version v1.12. To check the version, enter
kubectl version.

Introduction
DNS is a built-in Kubernetes service launched automatically using the addon manager cluster
add-on.

Note: The CoreDNS Service is named kube-dns in the metadata.name field.

The intent is to ensure greater interoperability with workloads that relied on the legacy kube-
dns Service name to resolve addresses internal to the cluster. Using a Service named kube-dns
abstracts away the implementation detail of which DNS provider is running behind that
common name.

If you are running CoreDNS as a Deployment, it will typically be exposed as a Kubernetes

Service with a static IP address. The kubelet passes DNS resolver information to each container
with the --cluster-dns=<dns-service-ip> flag.

DNS names also need domains. You configure the local domain in the kubelet with the flag --
cluster-domain=<default-local-domain>.

The DNS server supports forward lookups (A and AAAA records), port lookups (SRV records),
reverse IP address lookups (PTR records), and more. For more information, see DNS for Services
and Pods.

If a Pod's dnsPolicy is set to default, it inherits the name resolution configuration from the node
that the Pod runs on. The Pod's DNS resolution should behave the same as the node. But see
Known issues.

If you don't want this, or if you want a different DNS config for pods, you can use the kubelet's
--resolv-conf flag. Set this flag to "" to prevent Pods from inheriting DNS. Set it to a valid file
path to specify a file other than /etc/resolv.conf for DNS inheritance.

CoreDNS
CoreDNS is a general-purpose authoritative DNS server that can serve as cluster DNS,
complying with the DNS specifications.
CoreDNS ConfigMap options

CoreDNS is a DNS server that is modular and pluggable, with plugins adding new
functionalities. The CoreDNS server can be configured by maintaining a Corefile, which is the
CoreDNS configuration file. As a cluster administrator, you can modify the ConfigMap for the
CoreDNS Corefile to change how DNS service discovery behaves for that cluster.

In Kubernetes, CoreDNS is installed with the following default Corefile configuration:

apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}

The Corefile configuration includes the following plugins of CoreDNS:

• errors: Errors are logged to stdout.

• health: Health of CoreDNS is reported to http://localhost:8080/health. In this extended
syntax lameduck will make the process unhealthy then wait for 5 seconds before the
process is shut down.
• ready: An HTTP endpoint on port 8181 will return 200 OK, when all plugins that are able
to signal readiness have done so.
• kubernetes: CoreDNS will reply to DNS queries based on IP of the Services and Pods. You
can find more details about this plugin on the CoreDNS website.
◦ ttl allows you to set a custom TTL for responses. The default is 5 seconds. The
minimum TTL allowed is 0 seconds, and the maximum is capped at 3600 seconds.
Setting TTL to 0 will prevent records from being cached.
◦ The pods insecure option is provided for backward compatibility with kube-dns.
◦ You can use the pods verified option, which returns an A record only if there exists
a pod in the same namespace with a matching IP.
◦ The pods disabled option can be used if you don't use pod records.
• prometheus: Metrics of CoreDNS are available at http://localhost:9153/metrics in the
Prometheus format (also known as OpenMetrics).
• forward: Any queries that are not within the Kubernetes cluster domain are forwarded to
predefined resolvers (/etc/resolv.conf).
• cache: This enables a frontend cache.
• loop: Detects simple forwarding loops and halts the CoreDNS process if a loop is found.
• reload: Allows automatic reload of a changed Corefile. After you edit the ConfigMap
configuration, allow two minutes for your changes to take effect.
• loadbalance: This is a round-robin DNS loadbalancer that randomizes the order of A,
AAAA, and MX records in the answer.

You can modify the default CoreDNS behavior by modifying the ConfigMap.

Configuration of Stub-domain and upstream nameserver using CoreDNS

CoreDNS has the ability to configure stub-domains and upstream nameservers using the
forward plugin.

Example

If a cluster operator has a Consul domain server located at "10.150.0.1", and all Consul names
have the suffix ".consul.local". To configure it in CoreDNS, the cluster administrator creates the
following stanza in the CoreDNS ConfigMap.

consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}

To explicitly force all non-cluster DNS lookups to go through a specific nameserver at

172.16.0.1, point the forward to the nameserver instead of /etc/resolv.conf

forward . 172.16.0.1

The final ConfigMap along with the default Corefile configuration looks like:

apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 172.16.0.1
cache 30
loop
reload
loadbalance
}
consul.local:53 {
errors
cache 30
forward . 10.150.0.1
}

Note: CoreDNS does not support FQDNs for stub-domains and nameservers (eg: "ns.foo.com").
During translation, all FQDN nameservers will be omitted from the CoreDNS config.

What's next
• Read Debugging DNS Resolution

Debugging DNS Resolution

This page provides hints on diagnosing DNS problems.

Before you begin

• Killercoda
• Play with Kubernetes

Your cluster must be configured to use the CoreDNS addon or its precursor, kube-dns.

Your Kubernetes server must be at or later than version v1.6. To check the version, enter
kubectl version.

Create a simple Pod to use as a test environment

admin/dns/dnsutils.yaml

apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3
command:
- sleep
- "infinity"
imagePullPolicy: IfNotPresent
restartPolicy: Always

Note: This example creates a pod in the default namespace. DNS name resolution for services
depends on the namespace of the pod. For more information, review DNS for Services and Pods.

Use that manifest to create a Pod:

kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml

pod/dnsutils created

...and verify its status:

kubectl get pods dnsutils

NAME READY STATUS RESTARTS AGE

dnsutils 1/1 Running 0 <some-time>

Once that Pod is running, you can exec nslookup in that environment. If you see something like
the following, DNS is working correctly.

kubectl exec -i -t dnsutils -- nslookup kubernetes.default

Server: 10.0.0.10
Address 1: 10.0.0.10

Name: kubernetes.default
Address 1: 10.0.0.1

If the nslookup command fails, check the following:

Check the local DNS configuration first

Take a look inside the resolv.conf file. (See Customizing DNS Service and Known issues below
for more information)

kubectl exec -ti dnsutils -- cat /etc/resolv.conf

Verify that the search path and name server are set up like the following (note that search path
may vary for different cloud providers):

search default.svc.cluster.local svc.cluster.local cluster.local google.internal

c.gce_project_id.internal
nameserver 10.0.0.10
options ndots:5

Errors such as the following indicate a problem with the CoreDNS (or kube-dns) add-on or with
associated Services:

kubectl exec -i -t dnsutils -- nslookup kubernetes.default

Server: 10.0.0.10
Address 1: 10.0.0.10

nslookup: can't resolve 'kubernetes.default'

kubectl exec -i -t dnsutils -- nslookup kubernetes.default

Server: 10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'kubernetes.default'

Check if the DNS pod is running

Use the kubectl get pods command to verify that the DNS pod is running.

kubectl get pods --namespace=kube-system -l k8s-app=kube-dns

NAME READY STATUS RESTARTS AGE

...
coredns-7b96bf9f76-5hsxb 1/1 Running 0 1h
coredns-7b96bf9f76-mvmmt 1/1 Running 0 1h
...

Note: The value for label k8s-app is kube-dns for both CoreDNS and kube-dns deployments.

If you see that no CoreDNS Pod is running or that the Pod has failed/completed, the DNS add-
on may not be deployed by default in your current environment and you will have to deploy it
manually.

Check for errors in the DNS pod

Use the kubectl logs command to see logs for the DNS containers.

For CoreDNS:

kubectl logs --namespace=kube-system -l k8s-app=kube-dns

Here is an example of a healthy CoreDNS log:

.:53
2018/08/15 14:37:17 [INFO] CoreDNS-1.2.2
2018/08/15 14:37:17 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.2
linux/amd64, go1.10.3, 2e322f6
2018/08/15 14:37:17 [INFO] plugin/reload: Running configuration MD5 =
24e6c59e83ce706f07bcc82c31b1ea1c

See if there are any suspicious or unexpected messages in the logs.

Is DNS service up?

Verify that the DNS service is up by using the kubectl get service command.

kubectl get svc --namespace=kube-system

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

...
kube-dns ClusterIP 10.0.0.10 <none> 53/UDP,53/TCP 1h
...

Note: The service name is kube-dns for both CoreDNS and kube-dns deployments.

If you have created the Service or in the case it should be created by default but it does not
appear, see debugging Services for more information.

Are DNS endpoints exposed?

You can verify that DNS endpoints are exposed by using the kubectl get endpoints command.

kubectl get endpoints kube-dns --namespace=kube-system

NAME ENDPOINTS AGE

kube-dns 10.180.3.17:53,10.180.3.17:53 1h

If you do not see the endpoints, see the endpoints section in the debugging Services
documentation.

For additional Kubernetes DNS examples, see the cluster-dns examples in the Kubernetes
GitHub repository.

Are DNS queries being received/processed?

You can verify if queries are being received by CoreDNS by adding the log plugin to the
CoreDNS configuration (aka Corefile). The CoreDNS Corefile is held in a ConfigMap named
coredns. To edit it, use the command:

kubectl -n kube-system edit configmap coredns

Then add log in the Corefile section per the example below:

apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
log
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
upstream
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}

After saving the changes, it may take up to minute or two for Kubernetes to propagate these
changes to the CoreDNS pods.

Next, make some queries and view the logs per the sections above in this document. If
CoreDNS pods are receiving the queries, you should see them in the logs.

Here is an example of a query in the log:

.:53
2018/08/15 14:37:15 [INFO] CoreDNS-1.2.0
2018/08/15 14:37:15 [INFO] linux/amd64, go1.10.3, 2e322f6
CoreDNS-1.2.0
linux/amd64, go1.10.3, 2e322f6
2018/09/07 15:29:04 [INFO] plugin/reload: Running configuration MD5 =
162475cdf272d8aa601e6fe67a6ad42f
2018/09/07 15:29:04 [INFO] Reloading complete
172.17.0.18:41675 - [07/Sep/2018:15:29:11 +0000] 59925 "A IN kubernetes.default.svc.cluster.local.
udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000066649s

Does CoreDNS have sufficient permissions?

CoreDNS must be able to list service and endpoint related resources to properly resolve service
names.

Sample error message:

2022-03-18T07:12:15.699431183Z [INFO] 10.96.144.227:52299 - 3686 "A IN

serverproxy.contoso.net.cluster.local. udp 52 false 512" SERVFAIL qr,aa,rd 145 0.000091221s

First, get the current ClusterRole of system:coredns:

kubectl describe clusterrole system:coredns -n kube-system

Expected output:

PolicyRule:
Resources Non-Resource URLs Resource Names Verbs
--------- ----------------- -------------- -----
endpoints [] [] [list watch]
namespaces [] [] [list watch]
pods [] [] [list watch]
services [] [] [list watch]
endpointslices.discovery.k8s.io [] [] [list watch]
If any permissions are missing, edit the ClusterRole to add them:

kubectl edit clusterrole system:coredns -n kube-system

Example insertion of EndpointSlices permissions:

...
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- list
- watch
...

Are you in the right namespace for the service?

DNS queries that don't specify a namespace are limited to the pod's namespace.

If the namespace of the pod and service differ, the DNS query must include the namespace of
the service.

This query is limited to the pod's namespace:

kubectl exec -i -t dnsutils -- nslookup <service-name>

This query specifies the namespace:

kubectl exec -i -t dnsutils -- nslookup <service-name>.<namespace>

To learn more about name resolution, see DNS for Services and Pods.

Known issues
Some Linux distributions (e.g. Ubuntu) use a local DNS resolver by default (systemd-resolved).
Systemd-resolved moves and replaces /etc/resolv.conf with a stub file that can cause a fatal
forwarding loop when resolving names in upstream servers. This can be fixed manually by
using kubelet's --resolv-conf flag to point to the correct resolv.conf (With systemd-resolved, this
is /run/systemd/resolve/resolv.conf). kubeadm automatically detects systemd-resolved, and
adjusts the kubelet flags accordingly.

Kubernetes installs do not configure the nodes' resolv.conf files to use the cluster DNS by
default, because that process is inherently distribution-specific. This should probably be
implemented eventually.

Linux's libc (a.k.a. glibc) has a limit for the DNS nameserver records to 3 by default and
Kubernetes needs to consume 1 nameserver record. This means that if a local installation
already uses 3 nameservers, some of those entries will be lost. To work around this limit, the
node can run dnsmasq, which will provide more nameserver entries. You can also use kubelet's
--resolv-conf flag.
If you are using Alpine version 3.17 or earlier as your base image, DNS may not work properly
due to a design issue with Alpine. Until musl version 1.24 didn't include TCP fallback to the
DNS stub resolver meaning any DNS call above 512 bytes would fail. Please upgrade your
images to Alpine version 3.18 or above.

What's next
• See Autoscaling the DNS Service in a Cluster.
• Read DNS for Services and Pods

Declare Network Policy

This document helps you get started using the Kubernetes NetworkPolicy API to declare
network policies that govern how pods communicate with each other.

Before you begin

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version v1.8. To check the version, enter
kubectl version.

Make sure you've configured a network provider with network policy support. There are a
number of network providers that support NetworkPolicy, including:

• Antrea
• Calico
• Cilium
• Kube-router
• Romana
• Weave Net

Create an nginx deployment and expose it via a service

To see how Kubernetes network policy works, start off by creating an nginx Deployment.

kubectl create deployment nginx --image=nginx

deployment.apps/nginx created
Expose the Deployment through a Service called nginx.

kubectl expose deployment nginx --port=80

service/nginx exposed

The above commands create a Deployment with an nginx Pod and expose the Deployment
through a Service named nginx. The nginx Pod and Deployment are found in the default
namespace.

kubectl get svc,pod

NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE

service/kubernetes 10.100.0.1 <none> 443/TCP 46m
service/nginx 10.100.0.16 <none> 80/TCP 33s

NAME READY STATUS RESTARTS AGE

pod/nginx-701339712-e0qfq 1/1 Running 0 35s

Test the service by accessing it from another Pod

You should be able to access the new nginx service from other Pods. To access the nginx Service
from another Pod in the default namespace, start a busybox container:

kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh

In your shell, run the following command:

wget --spider --timeout=1 nginx

Connecting to nginx (10.100.0.16:80)

remote file exists

Limit access to the nginx service

To limit the access to the nginx service so that only Pods with the label access: true can query
it, create a NetworkPolicy object as follows:

service/networking/nginx-policy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: access-nginx
spec:
podSelector:
matchLabels:
app: nginx
ingress:
- from:
- podSelector:
matchLabels:
access: "true"
The name of a NetworkPolicy object must be a valid DNS subdomain name.

Note: NetworkPolicy includes a podSelector which selects the grouping of Pods to which the
policy applies. You can see this policy selects Pods with the label app=nginx. The label was
automatically added to the Pod in the nginx Deployment. An empty podSelector selects all pods
in the namespace.

Assign the policy to the service

Use kubectl to create a NetworkPolicy from the above nginx-policy.yaml file:

kubectl apply -f https://k8s.io/examples/service/networking/nginx-policy.yaml

networkpolicy.networking.k8s.io/access-nginx created

Test access to the service when access label is not defined

When you attempt to access the nginx Service from a Pod without the correct labels, the
request times out:

kubectl run busybox --rm -ti --image=busybox:1.28 -- /bin/sh

In your shell, run the command:

wget --spider --timeout=1 nginx

Connecting to nginx (10.100.0.16:80)

wget: download timed out

Define access label and test again

You can create a Pod with the correct labels to see that the request is allowed:

kubectl run busybox --rm -ti --labels="access=true" --image=busybox:1.28 -- /bin/sh

In your shell, run the command:

wget --spider --timeout=1 nginx

Connecting to nginx (10.100.0.16:80)

remote file exists

Developing Cloud Controller Manager

FEATURE STATE: Kubernetes v1.11 [beta]

The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-

specific control logic. The cloud controller manager lets you link your cluster into your cloud
provider's API, and separates out the components that interact with that cloud platform from
components that only interact with your cluster.
By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.

Background
Since cloud providers develop and release at a different pace compared to the Kubernetes
project, abstracting the provider-specific code to the cloud-controller-manager binary allows
cloud vendors to evolve independently from the core Kubernetes code.

The Kubernetes project provides skeleton cloud-controller-manager code with Go interfaces to

allow you (or your cloud provider) to plug in your own implementations. This means that a
cloud provider can implement a cloud-controller-manager by importing packages from
Kubernetes core; each cloudprovider will register their own code by calling
cloudprovider.RegisterCloudProvider to update a global variable of available cloud providers.

Developing
Out of tree

To build an out-of-tree cloud-controller-manager for your cloud:

1. Create a go package with an implementation that satisfies cloudprovider.Interface.

2. Use main.go in cloud-controller-manager from Kubernetes core as a template for your
main.go. As mentioned above, the only difference should be the cloud package that will
be imported.
3. Import your cloud package in main.go, ensure your package has an init block to run
cloudprovider.RegisterCloudProvider.

Many cloud providers publish their controller manager code as open source. If you are creating
a new cloud-controller-manager from scratch, you could take an existing out-of-tree cloud
controller manager as your starting point.

In tree

For in-tree cloud providers, you can run the in-tree cloud controller manager as a DaemonSet in
your cluster. See Cloud Controller Manager Administration for more details.

Enable Or Disable A Kubernetes API

This page shows how to enable or disable an API version from your cluster's control plane.

Specific API versions can be turned on or off by passing --runtime-config=api/<version> as a

command line argument to the API server. The values for this argument are a comma-separated
list of API versions. Later values override earlier values.

The runtime-config command line argument also supports 2 special keys:

• api/all, representing all known APIs

• api/legacy, representing only legacy APIs. Legacy APIs are any APIs that have been
explicitly deprecated.

For example, to turn off all API versions except v1, pass --runtime-config=api/all=false,api/
v1=true to the kube-apiserver.

What's next
Read the full documentation for the kube-apiserver component.

Encrypting Confidential Data at Rest

All of the APIs in Kubernetes that let you write persistent API resource data support at-rest
encryption. For example, you can enable at-rest encryption for Secrets. This at-rest encryption
is additional to any system-level encryption for the etcd cluster or for the filesystem(s) on hosts
where you are running the kube-apiserver.

This page shows how to enable and configure encryption of API data at rest.

Note:

This task covers encryption for resource data stored using the Kubernetes API. For example,
you can encrypt Secret objects, including the key-value data they contain.

If you want to encrypt data in filesystems that are mounted into containers, you instead need to
either:

• use a storage integration that provides encrypted volumes

• encrypt the data within your own application

Before you begin

◦ Killercoda
◦ Play with Kubernetes

• This task assumes that you are running the Kubernetes API server as a static pod on each
control plane node.

• Your cluster's control plane must use etcd v3.x (major version 3, any minor version).

• To encrypt a custom resource, your cluster must be running Kubernetes v1.26 or newer.

• To use a wildcard to match resources, your cluster must be running Kubernetes v1.27 or
newer.

To check the version, enter kubectl version.

Configuration and determining whether encryption at
rest is already enabled
The kube-apiserver process accepts an argument --encryption-provider-config that controls
how API data is encrypted in etcd. The configuration is provided as an API named
EncryptionConfiguration. An example configuration is provided below.

Caution: IMPORTANT: For high-availability configurations (with two or more control plane
nodes), the encryption configuration file must be the same! Otherwise, the kube-apiserver
component cannot decrypt data stored in the etcd.

Understanding the encryption at rest configuration

---
#
# CAUTION: this is an example configuration.
# Do not use this for your own cluster!
#
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example # a custom resource API
providers:
# This configuration does not provide data confidentiality. The first
# configured provider is specifying the "identity" mechanism, which
# stores resources as plain text.
#
- identity: {} # plain text, in other words NO encryption
- aesgcm:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- aescbc:
keys:
- name: key1
secret: c2VjcmV0IGlzIHNlY3VyZQ==
- name: key2
secret: dGhpcyBpcyBwYXNzd29yZA==
- secretbox:
keys:
- name: key1
secret: YWJjZGVmZ2hpamtsbW5vcHFyc3R1dnd4eXoxMjM0NTY=
- resources:
- events
providers:
- identity: {} # do not encrypt Events even though *.* is specified below
- resources:
- '*.apps' # wildcard match requires Kubernetes 1.27 or later
providers:
- aescbc:
keys:
- name: key2
secret: c2VjcmV0IGlzIHNlY3VyZSwgb3IgaXMgaXQ/Cg==
- resources:
- '*.*' # wildcard match requires Kubernetes 1.27 or later
providers:
- aescbc:
keys:
- name: key3
secret: c2VjcmV0IGlzIHNlY3VyZSwgSSB0aGluaw==

Each resources array item is a separate config and contains a complete configuration. The
resources.resources field is an array of Kubernetes resource names (resource or resource.group)
that should be encrypted like Secrets, ConfigMaps, or other resources.

If custom resources are added to EncryptionConfiguration and the cluster version is 1.26 or
newer, any newly created custom resources mentioned in the EncryptionConfiguration will be
encrypted. Any custom resources that existed in etcd prior to that version and configuration
will be unencrypted until they are next written to storage. This is the same behavior as built-in
resources. See the Ensure all secrets are encrypted section.

The providers array is an ordered list of the possible encryption providers to use for the APIs
that you listed. Each provider supports multiple keys - the keys are tried in order for
decryption, and if the provider is the first provider, the first key is used for encryption.

Only one provider type may be specified per entry (identity or aescbc may be provided, but not
both in the same item). The first provider in the list is used to encrypt resources written into the
storage. When reading resources from storage, each provider that matches the stored data
attempts in order to decrypt the data. If no provider can read the stored data due to a mismatch
in format or secret key, an error is returned which prevents clients from accessing that
resource.

EncryptionConfiguration supports the use of wildcards to specify the resources that should be
encrypted. Use '*.<group>' to encrypt all resources within a group (for eg '*.apps' in above
example) or '*.*' to encrypt all resources. '*.' can be used to encrypt all resource in the core
group. '*.*' will encrypt all resources, even custom resources that are added after API server
start.

Note: Use of wildcards that overlap within the same resource list or across multiple entries are
not allowed since part of the configuration would be ineffective. The resources list's processing
order and precedence are determined by the order it's listed in the configuration.

Opting out of encryption for specific resources while wildcard is enabled can be achieved by
adding a new resources array item with the resource name, followed by the providers array
item with the identity provider. For example, if '*.*' is enabled and you want to opt-out
encryption for the events resource, add a new item to the resources array with events as the
resource name, followed by the providers array item with identity. The new item should look
like this:
- resources:
- events
providers:
- identity: {}

Ensure that the new item is listed before the wildcard '*.*' item in the resources array to give it
precedence.

For more detailed information about the EncryptionConfiguration struct, please refer to the
encryption configuration API.

Caution: If any resource is not readable via the encryption config (because keys were
changed), the only recourse is to delete that key from the underlying etcd directly. Calls that
attempt to read that resource will fail until it is deleted or a valid decryption key is provided.

Available providers

Before you configure encryption-at-rest for data in your cluster's Kubernetes API, you need to
select which provider(s) you will use.

The following table describes each available provider.

Providers for Kubernetes encryption at rest

Name Encryption Strength Speed Key length
None N/A N/A N/A
Resources written as-is without encryption. When set as the first provider,
identity the resource will be decrypted as new values are written. Existing encrypted
resources are not automatically overwritten with the plaintext data. The
identity provider is the default if you do not specify otherwise.
AES-CBC with
Weak Fast 32-byte
PKCS#7 padding
aescbc
Not recommended due to CBC's vulnerability to padding oracle attacks. Key
material accessible from control plane host.
Must be rotated
AES-GCM with 16, 24, or 32-
every 200,000 Fastest
random nonce byte
aesgcm writes
Not recommended for use except when an automated key rotation scheme is
implemented. Key material accessible from control plane host.
Uses envelope
encryption scheme Slow (compared to
Strongest 32-bytes
kms v1 with DEK per kms version 2)
(deprecated resource.
since Data is encrypted by data encryption keys (DEKs) using AES-GCM; DEKs are
Kubernetes encrypted by key encryption keys (KEKs) according to configuration in Key
v1.28) Management Service (KMS). Simple key rotation, with a new DEK generated
for each encryption, and KEK rotation controlled by the user.
Read how to configure the KMS V1 provider.
Uses envelope
encryption scheme
kms v2 (beta) Strongest Fast 32-bytes
with DEK per API
server.
Name Encryption Strength Speed Key length
Data is encrypted by data encryption keys (DEKs) using AES-GCM; DEKs are
encrypted by key encryption keys (KEKs) according to configuration in Key
Management Service (KMS). Kubernetes defaults to generating a new DEK at
API server startup, which is then reused for object encryption. If you enable
the KMSv2KDF feature gate, Kubernetes instead generates a new DEK per
encryption from a secret seed. Whichever approach you configure, the DEK
or seed is also rotated whenever the KEK is rotated.
A good choice if using a third party tool for key management. Available in
beta from Kubernetes v1.27.
Read how to configure the KMS V2 provider.
XSalsa20 and
Strong Faster 32-byte
Poly1305
secretbox Uses relatively new encryption technologies that may not be considered
acceptable in environments that require high levels of review. Key material
accessible from control plane host.

The identity provider is the default if you do not specify otherwise. The identity provider
does not encrypt stored data and provides no additional confidentiality protection.

Key storage

Local key storage

Encrypting secret data with a locally managed key protects against an etcd compromise, but it
fails to protect against a host compromise. Since the encryption keys are stored on the host in
the EncryptionConfiguration YAML file, a skilled attacker can access that file and extract the
encryption keys.

Managed (KMS) key storage

The KMS provider uses envelope encryption: Kubernetes encrypts resources using a data key,
and then encrypts that data key using the managed encryption service. Kubernetes generates a
unique data key for each resource. The API server stores an encrypted version of the data key in
etcd alongside the ciphertext; when reading the resource, the API server calls the managed
encryption service and provides both the ciphertext and the (encrypted) data key. Within the
managed encryption service, the provider use a key encryption key to decipher the data key,
deciphers the data key, and finally recovers the plain text. Communication between the control
plane and the KMS requires in-transit protection, such as TLS.

Using envelope encryption creates dependence on the key encryption key, which is not stored
in Kubernetes. In the KMS case, an attacker who intends to get unauthorised access to the
plaintext values would need to compromise etcd and the third-party KMS provider.

Write an encryption configuration file

Caution: The encryption configuration file may contain keys that can decrypt content in etcd.
If the configuration file contains any key material, you must properly restrict permissions on all
your control plane hosts so only the user who runs the kube-apiserver can read this
configuration.
Create a new encryption configuration file. The contents should be similar to:

---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- aescbc:
keys:
- name: key1
# See the following text for more details about the secret value
secret: <BASE 64 ENCODED SECRET>
- identity: {} # this fallback allows reading unencrypted secrets;
# for example, during initial migration

To create a new Secret, perform the following steps:

1. Generate a 32-byte random key and base64 encode it. If you're on Linux or macOS, run
the following command:

head -c 32 /dev/urandom | base64

2. Place that value in the secret field of the EncryptionConfiguration struct.

3. Set the --encryption-provider-config flag on the kube-apiserver to point to the location of

the config file.

You will need to mount the new encryption config file to the kube-apiserver static pod.
Here is an example on how to do that:

1. Save the new encryption config file to /etc/kubernetes/enc/enc.yaml on the control-

plane node.
2. Edit the manifest for the kube-apiserver static pod: /etc/kubernetes/manifests/
kube-apiserver.yaml similarly to this:

---
#
# This is a fragment of a manifest for a static Pod.
# Check whether this is correct for your cluster and for your API server.
#
apiVersion: v1
kind: Pod
metadata:
annotations:
kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.20.30.40:443
creationTimestamp: null
labels:
app.kubernetes.io/component: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
- kube-apiserver
...
- --encryption-provider-config=/etc/kubernetes/enc/enc.yaml # add this line
volumeMounts:
...
- name: enc # add this line
mountPath: /etc/kubernetes/enc # add this line
readOnly: true # add this line
...
volumes:
...
- name: enc # add this line
hostPath: # add this line
path: /etc/kubernetes/enc # add this line
type: DirectoryOrCreate # add this line
...

4. Restart your API server.

Caution: Your config file contains keys that can decrypt the contents in etcd, so you must
properly restrict permissions on your control-plane nodes so only the user who runs the kube-
apiserver can read it.

Reconfigure other control plane hosts

If you have multiple API servers in your cluster, you should deploy the changes in turn to each
API server.

Make sure that you use the same encryption configuration on each control plane host.

Verify that newly written data is encrypted

Data is encrypted when written to etcd. After restarting your kube-apiserver, any newly
created or updated Secret (or other resource kinds configured in EncryptionConfiguration)
should be encrypted when stored.

To check this, you can use the etcdctl command line program to retrieve the contents of your
secret data.

This example shows how to check this for encrypting the Secret API.

1. Create a new Secret called secret1 in the default namespace:

kubectl create secret generic secret1 -n default --from-literal=mykey=mydata

2. Using the etcdctl command line tool, read that Secret out of etcd:

ETCDCTL_API=3 etcdctl get /registry/secrets/default/secret1 [...] | hexdump -C

where [...] must be the additional arguments for connecting to the etcd server.
For example:

ETCDCTL_API=3 etcdctl \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get /registry/secrets/default/secret1 | hexdump -C

The output is similar to this (abbreviated):

3. Verify the stored Secret is prefixed with k8s:enc:aescbc:v1: which indicates the aescbc
provider has encrypted the resulting data. Confirm that the key name shown in etcd
matches the key name specified in the EncryptionConfiguration mentioned above. In this
example, you can see that the encryption key named key1 is used in etcd and in
EncryptionConfiguration.

4. Verify the Secret is correctly decrypted when retrieved via the API:

kubectl get secret secret1 -n default -o yaml

The output should contain mykey: bXlkYXRh, with contents of mydata encoded using
base64; read decoding a Secret to learn how to completely decode the Secret.

Ensure all relevant data are encrypted

It's often not enough to make sure that new objects get encrypted: you also want that
encryption to apply to the objects that are already stored.

For this example, you have configured your cluster so that Secrets are encrypted on write.
Performing a replace operation for each Secret will encrypt that content at rest, where the
objects are unchanged.

You can make this change across all Secrets in your cluster:

# Run this as an administrator that can read and write all Secrets
kubectl get secrets --all-namespaces -o json | kubectl replace -f -

The command above reads all Secrets and then updates them with the same data, in order to
apply server side encryption.

Note:
If an error occurs due to a conflicting write, retry the command. It is safe to run that command
more than once.

For larger clusters, you may wish to subdivide the Secrets by namespace, or script an update.

Rotating a decryption key

Changing a Secret without incurring downtime requires a multi-step operation, especially in
the presence of a highly-available deployment where multiple kube-apiserver processes are
running.

1. Generate a new key and add it as the second key entry for the current provider on all
servers
2. Restart all kube-apiserver processes to ensure each server can decrypt using the new key
3. Make the new key the first entry in the keys array so that it is used for encryption in the
config
4. Restart all kube-apiserver processes to ensure each server now encrypts using the new
key
5. Run kubectl get secrets --all-namespaces -o json | kubectl replace -f - to encrypt all
existing Secrets with the new key
6. Remove the old decryption key from the config after you have backed up etcd with the
new key in use and updated all Secrets

When running a single kube-apiserver instance, step 2 may be skipped.

Configure automatic reloading

You can configure automatic reloading of encryption provider configuration. That setting
determines whether the API server should load the file you specify for --encryption-provider-
config only once at startup, or automatically whenever you change that file. Enabling this
option allows you to change the keys for encryption at rest without restarting the API server.

To allow automatic reloading, configure the API server to run with: --encryption-provider-
config-automatic-reload=true

What's next
• Read about decrypting data that are already stored at rest
• Learn more about the EncryptionConfiguration configuration API (v1).

Decrypt Confidential Data that is Already

Encrypted at Rest
All of the APIs in Kubernetes that let you write persistent API resource data support at-rest
encryption. For example, you can enable at-rest encryption for Secrets. This at-rest encryption
is additional to any system-level encryption for the etcd cluster or for the filesystem(s) on hosts
where you are running the kube-apiserver.
This page shows how to switch from encryption of API data at rest, so that API data are stored
unencrypted. You might want to do this to improve performance; usually, though, if it was a
good idea to encrypt some data, it's also a good idea to leave them encrypted.

Note:

This task covers encryption for resource data stored using the Kubernetes API. For example,
you can encrypt Secret objects, including the key-value data they contain.

If you wanted to manage encryption for data in filesystems that are mounted into containers,
you instead need to either:

• use a storage integration that provides encrypted volumes

• encrypt the data within your own application

Before you begin

◦ Killercoda
◦ Play with Kubernetes

• This task assumes that you are running the Kubernetes API server as a static pod on each
control plane node.

• Your cluster's control plane must use etcd v3.x (major version 3, any minor version).

• To encrypt a custom resource, your cluster must be running Kubernetes v1.26 or newer.

• You should have some API data that are already encrypted.

To check the version, enter kubectl version.

Determine whether encryption at rest is already enabled

By default, the API server uses an identity provider that stores plain-text representations of
resources. The default identity provider does not provide any confidentiality protection.

The kube-apiserver process accepts an argument --encryption-provider-config that specifies a

path to a configuration file. The contents of that file, if you specify one, control how Kubernetes
API data is encrypted in etcd. If it is not specified, you do not have encryption at rest enabled.

The format of that configuration file is YAML, representing a configuration API kind named
EncryptionConfiguration. You can see an example configuration in Encryption at rest
configuration.

If --encryption-provider-config is set, check which resources (such as secrets) are configured for
encryption, and what provider is used. Make sure that the preferred provider for that resource
type is not identity; you only set identity (no encryption) as default when you want to disable
encryption at rest. Verify that the first-listed provider for a resource is something other than
identity, which means that any new information written to resources of that type will be
encrypted as configured. If you do see identity as the first-listed provider for any resource, this
means that those resources are being written out to etcd without encryption.

Decrypt all data

This example shows how to stop encrypting the Secret API at rest. If you are encrypting other
API kinds, adjust the steps to match.

Locate the encryption configuration file

First, find the API server configuration files. On each control plane node, static Pod manifest for
the kube-apiserver specifies a command line argument, --encryption-provider-config. You are
likely to find that this file is mounted into the static Pod using a hostPath volume mount. Once
you locate the volume you can find the file on the node filesystem and inspect it.

Configure the API server to decrypt objects

To disable encryption at rest, place the identity provider as the first entry in your encryption
configuration file.

For example, if your existing EncryptionConfiguration file reads:

---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
# Do not use this (invalid) example key for encryption
- name: example
secret: 2KfZgdiq2K0g2YrYpyDYs9mF2LPZhQ==

then change it to:

---
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {} # add this line
- aescbc:
keys:
- name: example
secret: 2KfZgdiq2K0g2YrYpyDYs9mF2LPZhQ==
and restart the kube-apiserver Pod on this node.

Reconfigure other control plane hosts

If you have multiple API servers in your cluster, you should deploy the changes in turn to each
API server.

Make sure that you use the same encryption configuration on each control plane host.

Force decryption

Then run the following command to force decryption of all Secrets:

# If you are decrypting a different kind of object, change "secrets" to match.

kubectl get secrets --all-namespaces -o json | kubectl replace -f -

Once you have replaced all existing encrypted resources with backing data that don't use
encryption, you can remove the encryption settings from the kube-apiserver.

The command line options to remove are:

• --encryption-provider-config
• --encryption-provider-config-automatic-reload

Restart the kube-apiserver Pod again to apply the new configuration.

Reconfigure other control plane hosts

If you have multiple API servers in your cluster, you should again deploy the changes in turn to
each API server.

Make sure that you use the same encryption configuration on each control plane host.

What's next
• Learn more about the EncryptionConfiguration configuration API (v1).

Guaranteed Scheduling For Critical Add-

On Pods
Kubernetes core components such as the API server, scheduler, and controller-manager run on
a control plane node. However, add-ons must run on a regular cluster node. Some of these add-
ons are critical to a fully functional cluster, such as metrics-server, DNS, and UI. A cluster may
stop working properly if a critical add-on is evicted (either manually or as a side effect of
another operation like upgrade) and becomes pending (for example when the cluster is highly
utilized and either there are other pending pods that schedule into the space vacated by the
evicted critical add-on pod or the amount of resources available on the node changed for some
other reason).
Note that marking a pod as critical is not meant to prevent evictions entirely; it only prevents
the pod from becoming permanently unavailable. A static pod marked as critical can't be
evicted. However, non-static pods marked as critical are always rescheduled.

Marking pod as critical

To mark a Pod as critical, set priorityClassName for that Pod to system-cluster-critical or

system-node-critical. system-node-critical is the highest available priority, even higher than
system-cluster-critical.

IP Masquerade Agent User Guide

This page shows how to configure and enable the ip-masq-agent.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

IP Masquerade Agent User Guide

The ip-masq-agent configures iptables rules to hide a pod's IP address behind the cluster node's
IP address. This is typically done when sending traffic to destinations outside the cluster's pod
CIDR range.

Key Terms

• NAT (Network Address Translation): Is a method of remapping one IP address to

another by modifying either the source and/or destination address information in the IP
header. Typically performed by a device doing IP routing.
• Masquerading: A form of NAT that is typically used to perform a many to one address
translation, where multiple source IP addresses are masked behind a single address,
which is typically the device doing the IP routing. In Kubernetes this is the Node's IP
address.
• CIDR (Classless Inter-Domain Routing): Based on the variable-length subnet
masking, allows specifying arbitrary-length prefixes. CIDR introduced a new method of
representation for IP addresses, now commonly known as CIDR notation, in which an
address or routing prefix is written with a suffix indicating the number of bits of the
prefix, such as 192.168.2.0/24.
• Link Local: A link-local address is a network address that is valid only for
communications within the network segment or the broadcast domain that the host is
connected to. Link-local addresses for IPv4 are defined in the address block 169.254.0.0/16
in CIDR notation.
The ip-masq-agent configures iptables rules to handle masquerading node/pod IP addresses
when sending traffic to destinations outside the cluster node's IP and the Cluster IP range. This
essentially hides pod IP addresses behind the cluster node's IP address. In some environments,
traffic to "external" addresses must come from a known machine address. For example, in
Google Cloud, any traffic to the internet must come from a VM's IP. When containers are used,
as in Google Kubernetes Engine, the Pod IP will be rejected for egress. To avoid this, we must
hide the Pod IP behind the VM's own IP address - generally known as "masquerade". By default,
the agent is configured to treat the three private IP ranges specified by RFC 1918 as non-
masquerade CIDR. These ranges are 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16. The agent will
also treat link-local (169.254.0.0/16) as a non-masquerade CIDR by default. The agent is
configured to reload its configuration from the location /etc/config/ip-masq-agent every 60
seconds, which is also configurable.

masq/non-masq example

The agent configuration file must be written in YAML or JSON syntax, and may contain three
optional keys:

• nonMasqueradeCIDRs: A list of strings in CIDR notation that specify the non-

masquerade ranges.
• masqLinkLocal: A Boolean (true/false) which indicates whether to masquerade traffic to
the link local prefix 169.254.0.0/16. False by default.
• resyncInterval: A time interval at which the agent attempts to reload config from disk.
For example: '30s', where 's' means seconds, 'ms' means milliseconds.

Traffic to 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16 ranges will NOT be masqueraded. Any
other traffic (assumed to be internet) will be masqueraded. An example of a local destination
from a pod could be its Node's IP address as well as another node's address or one of the IP
addresses in Cluster's IP range. Any other traffic will be masqueraded by default. The below
entries show the default set of rules that are applied by the ip-masq-agent:

iptables -t nat -L IP-MASQ-AGENT

target prot opt source destination

RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic
should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local traffic
should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 172.16.0.0/12 /* ip-masq-agent: cluster-local traffic
should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 192.168.0.0/16 /* ip-masq-agent: cluster-local traffic
should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic
should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */
ADDRTYPE match dst-type !LOCAL

By default, in GCE/Google Kubernetes Engine, if network policy is enabled or you are using a
cluster CIDR not in the 10.0.0.0/8 range, the ip-masq-agent will run in your cluster. If you are
running in another environment, you can add the ip-masq-agent DaemonSet to your cluster.

Create an ip-masq-agent
To create an ip-masq-agent, run the following kubectl command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/ip-masq-agent/master/ip-
masq-agent.yaml

You must also apply the appropriate node label to any nodes in your cluster that you want the
agent to run on.

kubectl label nodes my-node node.kubernetes.io/masq-agent-ds-ready=true

More information can be found in the ip-masq-agent documentation here.

In most cases, the default set of rules should be sufficient; however, if this is not the case for
your cluster, you can create and apply a ConfigMap to customize the IP ranges that are affected.
For example, to allow only 10.0.0.0/8 to be considered by the ip-masq-agent, you can create the
following ConfigMap in a file called "config".

Note:

It is important that the file is called config since, by default, that will be used as the key for
lookup by the ip-masq-agent:

nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s

Run the following command to add the configmap to your cluster:

kubectl create configmap ip-masq-agent --from-file=config --namespace=kube-system

This will update a file located at /etc/config/ip-masq-agent which is periodically checked every
resyncInterval and applied to the cluster node. After the resync interval has expired, you should
see the iptables rules reflect your changes:

iptables -t nat -L IP-MASQ-AGENT

Chain IP-MASQ-AGENT (1 references)

target prot opt source destination
RETURN all -- anywhere 169.254.0.0/16 /* ip-masq-agent: cluster-local traffic
should not be subject to MASQUERADE */ ADDRTYPE match dst-type !LOCAL
RETURN all -- anywhere 10.0.0.0/8 /* ip-masq-agent: cluster-local
MASQUERADE all -- anywhere anywhere /* ip-masq-agent: outbound traffic
should be subject to MASQUERADE (this match must come after cluster-local CIDR matches) */
ADDRTYPE match dst-type !LOCAL

By default, the link local range (169.254.0.0/16) is also handled by the ip-masq agent, which sets
up the appropriate iptables rules. To have the ip-masq-agent ignore link local, you can set
masqLinkLocal to true in the ConfigMap.

nonMasqueradeCIDRs:
- 10.0.0.0/8
resyncInterval: 60s
masqLinkLocal: true
Limit Storage Consumption
This example demonstrates how to limit the amount of storage consumed in a namespace.

The following resources are used in the demonstration: ResourceQuota, LimitRange, and
PersistentVolumeClaim.

Before you begin

◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.

Scenario: Limiting Storage Consumption

The cluster-admin is operating a cluster on behalf of a user population and the admin wants to
control how much storage a single namespace can consume in order to control cost.

The admin would like to limit:

1. The number of persistent volume claims in a namespace

2. The amount of storage each claim can request
3. The amount of cumulative storage the namespace can have

LimitRange to limit requests for storage

Adding a LimitRange to a namespace enforces storage request sizes to a minimum and
maximum. Storage is requested via PersistentVolumeClaim. The admission controller that
enforces limit ranges will reject any PVC that is above or below the values set by the admin.

In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi
max.

apiVersion: v1
kind: LimitRange
metadata:
name: storagelimits
spec:
limits:
- type: PersistentVolumeClaim
max:
storage: 2Gi
min:
storage: 1Gi
Minimum storage requests are used when the underlying storage provider requires certain
minimums. For example, AWS EBS volumes have a 1Gi minimum requirement.

StorageQuota to limit PVC count and cumulative storage

capacity
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of
those PVCs. New PVCs that exceed either maximum value will be rejected.

In this example, a 6th PVC in the namespace would be rejected because it exceeds the
maximum count of 5. Alternatively, a 5Gi maximum quota when combined with the 2Gi max
limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested for a
namespace capped at 5Gi.

apiVersion: v1
kind: ResourceQuota
metadata:
name: storagequota
spec:
hard:
persistentvolumeclaims: "5"
requests.storage: "5Gi"

Summary
A limit range can put a ceiling on how much storage is requested while a resource quota can
effectively cap the storage consumed by a namespace through claim counts and cumulative
storage capacity. The allows a cluster-admin to plan their cluster's storage budget without risk
of any one project going over their allotment.

Migrate Replicated Control Plane To Use

Cloud Controller Manager
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-
specific control logic. The cloud controller manager lets you link your cluster into your cloud
provider's API, and separates out the components that interact with that cloud platform from
components that only interact with your cluster.

By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.

Background
As part of the cloud provider extraction effort, all cloud specific controllers must be moved out
of the kube-controller-manager. All existing clusters that run cloud controllers in the kube-
controller-manager must migrate to instead run the controllers in a cloud provider specific
cloud-controller-manager.
Leader Migration provides a mechanism in which HA clusters can safely migrate "cloud
specific" controllers between the kube-controller-manager and the cloud-controller-manager
via a shared resource lock between the two components while upgrading the replicated control
plane. For a single-node control plane, or if unavailability of controller managers can be
tolerated during the upgrade, Leader Migration is not needed and this guide can be ignored.

Leader Migration can be enabled by setting --enable-leader-migration on kube-controller-

manager or cloud-controller-manager. Leader Migration only applies during the upgrade and
can be safely disabled or left enabled after the upgrade is complete.

This guide walks you through the manual process of upgrading the control plane from kube-
controller-manager with built-in cloud provider to running both kube-controller-manager and
cloud-controller-manager. If you use a tool to deploy and manage the cluster, please refer to the
documentation of the tool and the cloud provider for specific instructions of the migration.

Before you begin

It is assumed that the control plane is running Kubernetes version N and to be upgraded to
version N + 1. Although it is possible to migrate within the same version, ideally the migration
should be performed as part of an upgrade so that changes of configuration can be aligned to
each release. The exact versions of N and N + 1 depend on each cloud provider. For example, if a
cloud provider builds a cloud-controller-manager to work with Kubernetes 1.24, then N can be
1.23 and N + 1 can be 1.24.

The control plane nodes should run kube-controller-manager with Leader Election enabled,
which is the default. As of version N, an in-tree cloud provider must be set with --cloud-
provider flag and cloud-controller-manager should not yet be deployed.

The out-of-tree cloud provider must have built a cloud-controller-manager with Leader
Migration implementation. If the cloud provider imports k8s.io/cloud-provider and k8s.io/
controller-manager of version v0.21.0 or later, Leader Migration will be available. However, for
version before v0.22.0, Leader Migration is alpha and requires feature gate
ControllerManagerLeaderMigration to be enabled in cloud-controller-manager.

This guide assumes that kubelet of each control plane node starts kube-controller-manager and
cloud-controller-manager as static pods defined by their manifests. If the components run in a
different setting, please adjust the steps accordingly.

For authorization, this guide assumes that the cluster uses RBAC. If another authorization mode
grants permissions to kube-controller-manager and cloud-controller-manager components,
please grant the needed access in a way that matches the mode.

Grant access to Migration Lease

The default permissions of the controller manager allow only accesses to their main Lease. In
order for the migration to work, accesses to another Lease are required.

You can grant kube-controller-manager full access to the leases API by modifying the
system::leader-locking-kube-controller-manager role. This task guide assumes that the name of
the migration lease is cloud-provider-extraction-migration.

kubectl patch -n kube-system role 'system::leader-locking-kube-controller-manager' -p

'{"rules": [ {"apiGroups":[ "coordination.k8s.io"], "resources": ["leases"], "resourceNames":
["cloud-provider-extraction-migration"], "verbs": ["create", "list", "get", "update"] } ]}' --type=mer
ge`

Do the same to the system::leader-locking-cloud-controller-manager role.

kubectl patch -n kube-system role 'system::leader-locking-cloud-controller-manager' -p '{"rules"

: [ {"apiGroups":[ "coordination.k8s.io"], "resources": ["leases"], "resourceNames": ["cloud-
provider-extraction-migration"], "verbs": ["create", "list", "get", "update"] } ]}' --type=merge`

Initial Leader Migration configuration

Leader Migration optionally takes a configuration file representing the state of controller-to-
manager assignment. At this moment, with in-tree cloud provider, kube-controller-manager
runs route, service, and cloud-node-lifecycle. The following example configuration shows the
assignment.

Leader Migration can be enabled without a configuration. Please see Default Configuration for
details.

kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: kube-controller-manager
- name: service
component: kube-controller-manager
- name: cloud-node-lifecycle
component: kube-controller-manager

Alternatively, because the controllers can run under either controller managers, setting
component to * for both sides makes the configuration file consistent between both parties of
the migration.

# wildcard version
kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: *
- name: service
component: *
- name: cloud-node-lifecycle
component: *

On each control plane node, save the content to /etc/leadermigration.conf, and update the
manifest of kube-controller-manager so that the file is mounted inside the container at the same
location. Also, update the same manifest to add the following arguments:

• --enable-leader-migration to enable Leader Migration on the controller manager

• --leader-migration-config=/etc/leadermigration.conf to set configuration file
Restart kube-controller-manager on each node. At this moment, kube-controller-manager has
leader migration enabled and is ready for the migration.

Deploy Cloud Controller Manager

In version N + 1, the desired state of controller-to-manager assignment can be represented by a

new configuration file, shown as follows. Please note component field of each controllerLeaders
changing from kube-controller-manager to cloud-controller-manager. Alternatively, use the
wildcard version mentioned above, which has the same effect.

kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: cloud-controller-manager
- name: service
component: cloud-controller-manager
- name: cloud-node-lifecycle
component: cloud-controller-manager

When creating control plane nodes of version N + 1, the content should be deployed to /etc/
leadermigration.conf. The manifest of cloud-controller-manager should be updated to mount
the configuration file in the same manner as kube-controller-manager of version N. Similarly,
add --enable-leader-migration and --leader-migration-config=/etc/leadermigration.conf to the
arguments of cloud-controller-manager.

Create a new control plane node of version N + 1 with the updated cloud-controller-manager
manifest, and with the --cloud-provider flag set to external for kube-controller-manager. kube-
controller-manager of version N + 1 MUST NOT have Leader Migration enabled because, with
an external cloud provider, it does not run the migrated controllers anymore, and thus it is not
involved in the migration.

Please refer to Cloud Controller Manager Administration for more detail on how to deploy
cloud-controller-manager.

Upgrade Control Plane

The control plane now contains nodes of both version N and N + 1. The nodes of version N run
kube-controller-manager only, and these of version N + 1 run both kube-controller-manager
and cloud-controller-manager. The migrated controllers, as specified in the configuration, are
running under either kube-controller-manager of version N or cloud-controller-manager of
version N + 1 depending on which controller manager holds the migration lease. No controller
will ever be running under both controller managers at any time.

In a rolling manner, create a new control plane node of version N + 1 and bring down one of
version N until the control plane contains only nodes of version N + 1. If a rollback from
version N + 1 to N is required, add nodes of version N with Leader Migration enabled for kube-
controller-manager back to the control plane, replacing one of version N + 1 each time until
there are only nodes of version N.
(Optional) Disable Leader Migration

Now that the control plane has been upgraded to run both kube-controller-manager and cloud-
controller-manager of version N + 1, Leader Migration has finished its job and can be safely
disabled to save one Lease resource. It is safe to re-enable Leader Migration for the rollback in
the future.

In a rolling manager, update manifest of cloud-controller-manager to unset both --enable-

leader-migration and --leader-migration-config= flag, also remove the mount of /etc/
leadermigration.conf, and finally remove /etc/leadermigration.conf. To re-enable Leader
Migration, recreate the configuration file and add its mount and the flags that enable Leader
Migration back to cloud-controller-manager.

Default Configuration

Starting Kubernetes 1.22, Leader Migration provides a default configuration suitable for the
default controller-to-manager assignment. The default configuration can be enabled by setting
--enable-leader-migration but without --leader-migration-config=.

For kube-controller-manager and cloud-controller-manager, if there are no flags that enable any
in-tree cloud provider or change ownership of controllers, the default configuration can be used
to avoid manual creation of the configuration file.

Special case: migrating the Node IPAM controller

If your cloud provider provides an implementation of Node IPAM controller, you should switch
to the implementation in cloud-controller-manager. Disable Node IPAM controller in kube-
controller-manager of version N + 1 by adding --controllers=*,-nodeipam to its flags. Then add
nodeipam to the list of migrated controllers.

# wildcard version, with nodeipam

kind: LeaderMigrationConfiguration
apiVersion: controllermanager.config.k8s.io/v1
leaderName: cloud-provider-extraction-migration
controllerLeaders:
- name: route
component: *
- name: service
component: *
- name: cloud-node-lifecycle
component: *
- name: nodeipam
- component: *

What's next
• Read the Controller Manager Leader Migration enhancement proposal.
Namespaces Walkthrough
Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes
cluster.

It does this by providing the following:

1. A scope for Names.

2. A mechanism to attach authorization and policy to a subsection of the cluster.

Use of multiple namespaces is optional.

This example demonstrates how to use Kubernetes namespaces to subdivide your cluster.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Prerequisites
This example assumes the following:

1. You have an existing Kubernetes cluster.

2. You have a basic understanding of Kubernetes Pods, Services, and Deployments.

Understand the default namespace

By default, a Kubernetes cluster will instantiate a default namespace when provisioning the
cluster to hold the default set of Pods, Services, and Deployments used by the cluster.

Assuming you have a fresh cluster, you can inspect the available namespaces by doing the
following:

kubectl get namespaces

NAME STATUS AGE

default Active 13m

Create new namespaces

For this exercise, we will create two additional Kubernetes namespaces to hold our content.
Let's imagine a scenario where an organization is using a shared Kubernetes cluster for
development and production use cases.

The development team would like to maintain a space in the cluster where they can get a view
on the list of Pods, Services, and Deployments they use to build and run their application. In
this space, Kubernetes resources come and go, and the restrictions on who can or cannot
modify resources are relaxed to enable agile development.

The operations team would like to maintain a space in the cluster where they can enforce strict
procedures on who can or cannot manipulate the set of Pods, Services, and Deployments that
run the production site.

One pattern this organization could follow is to partition the Kubernetes cluster into two
namespaces: development and production.

Let's create two new namespaces to hold our work.

Use the file namespace-dev.yaml which describes a development namespace:

admin/namespace-dev.yaml

apiVersion: v1
kind: Namespace
metadata:
name: development
labels:
name: development

Create the development namespace using kubectl.

kubectl create -f https://k8s.io/examples/admin/namespace-dev.yaml

Save the following contents into file namespace-prod.yaml which describes a production
namespace:

admin/namespace-prod.yaml

apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
name: production

And then let's create the production namespace using kubectl.

kubectl create -f https://k8s.io/examples/admin/namespace-prod.yaml

To be sure things are right, let's list all of the namespaces in our cluster.

kubectl get namespaces --show-labels

NAME STATUS AGE LABELS

default Active 32m <none>
development Active 29s name=development
production Active 23s name=production

Create pods in each namespace

A Kubernetes namespace provides the scope for Pods, Services, and Deployments in the cluster.

Users interacting with one namespace do not see the content in another namespace.

To demonstrate this, let's spin up a simple Deployment and Pods in the development
namespace.

We first check what is the current context:

kubectl config view

apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin

kubectl config current-context

lithe-cocoa-92103_kubernetes

The next step is to define a context for the kubectl client to work in each namespace. The value
of "cluster" and "user" fields are copied from the current context.

kubectl config set-context dev --namespace=development \

--cluster=lithe-cocoa-92103_kubernetes \
--user=lithe-cocoa-92103_kubernetes

kubectl config set-context prod --namespace=production \

--cluster=lithe-cocoa-92103_kubernetes \
--user=lithe-cocoa-92103_kubernetes

By default, the above commands add two contexts that are saved into file .kube/config. You can
now view the contexts and alternate against the two new request contexts depending on which
namespace you wish to work against.

To view the new contexts:

kubectl config view

apiVersion: v1
clusters:
- cluster:
certificate-authority-data: REDACTED
server: https://130.211.122.180
name: lithe-cocoa-92103_kubernetes
contexts:
- context:
cluster: lithe-cocoa-92103_kubernetes
user: lithe-cocoa-92103_kubernetes
name: lithe-cocoa-92103_kubernetes
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: development
user: lithe-cocoa-92103_kubernetes
name: dev
- context:
cluster: lithe-cocoa-92103_kubernetes
namespace: production
user: lithe-cocoa-92103_kubernetes
name: prod
current-context: lithe-cocoa-92103_kubernetes
kind: Config
preferences: {}
users:
- name: lithe-cocoa-92103_kubernetes
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
token: 65rZW78y8HbwXXtSXuUw9DbP4FLjHi4b
- name: lithe-cocoa-92103_kubernetes-basic-auth
user:
password: h5M0FtUUIflBSdI7
username: admin

Let's switch to operate in the development namespace.

kubectl config use-context dev

You can verify your current context by doing the following:

kubectl config current-context

dev

At this point, all requests we make to the Kubernetes cluster from the command line are scoped
to the development namespace.

Let's create some contents.

admin/snowflake-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: snowflake
name: snowflake
spec:
replicas: 2
selector:
matchLabels:
app: snowflake
template:
metadata:
labels:
app: snowflake
spec:
containers:
- image: registry.k8s.io/serve_hostname
imagePullPolicy: Always
name: snowflake

Apply the manifest to create a Deployment

kubectl apply -f https://k8s.io/examples/admin/snowflake-deployment.yaml

We have created a deployment whose replica size is 2 that is running the pod called snowflake
with a basic container that serves the hostname.

kubectl get deployment

NAME READY UP-TO-DATE AVAILABLE AGE

snowflake 2/2 2 2 2m

kubectl get pods -l app=snowflake

NAME READY STATUS RESTARTS AGE

snowflake-3968820950-9dgr8 1/1 Running 0 2m
snowflake-3968820950-vgc4n 1/1 Running 0 2m

And this is great, developers are able to do what they want, and they do not have to worry
about affecting content in the production namespace.

Let's switch to the production namespace and show how resources in one namespace are
hidden from the other.

kubectl config use-context prod

The production namespace should be empty, and the following commands should return
nothing.

kubectl get deployment

kubectl get pods

Production likes to run cattle, so let's create some cattle pods.

kubectl create deployment cattle --image=registry.k8s.io/serve_hostname --replicas=5

kubectl get deployment

NAME READY UP-TO-DATE AVAILABLE AGE

cattle 5/5 5 5 10s

kubectl get pods -l app=cattle

NAME READY STATUS RESTARTS AGE

cattle-2263376956-41xy6 1/1 Running 0 34s
cattle-2263376956-kw466 1/1 Running 0 34s
cattle-2263376956-n4v97 1/1 Running 0 34s
cattle-2263376956-p5p3i 1/1 Running 0 34s
cattle-2263376956-sxpth 1/1 Running 0 34s

At this point, it should be clear that the resources users create in one namespace are hidden
from the other namespace.

As the policy support in Kubernetes evolves, we will extend this scenario to show how you can
provide different authorization rules for each namespace.

Operating etcd clusters for Kubernetes

etcd is a consistent and highly-available key value store used as Kubernetes' backing store for
all cluster data.

If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for
the data.

You can find in-depth information about etcd in the official documentation.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured
to communicate with your cluster. It is recommended to run this task on a cluster with at least
two nodes that are not acting as control plane nodes . If you do not already have a cluster, you
can create one by using minikube.

Prerequisites
• Run etcd as a cluster of odd members.
etcd is a leader-based distributed system. Ensure that the leader periodically send
• heartbeats on time to all followers to keep the cluster stable.

• Ensure that no resource starvation occurs.

Performance and stability of the cluster is sensitive to network and disk I/O. Any
resource starvation can lead to heartbeat timeout, causing instability of the cluster. An
unstable etcd indicates that no leader is elected. Under such circumstances, a cluster
cannot make any changes to its current state, which implies no new pods can be
scheduled.

• Keeping etcd clusters stable is critical to the stability of Kubernetes clusters. Therefore,
run etcd clusters on dedicated machines or isolated environments for guaranteed
resource requirements.

• The minimum recommended etcd versions to run in production are 3.4.22+ and 3.5.6+.

Resource requirements
Operating etcd with limited resources is suitable only for testing purposes. For deploying in
production, advanced hardware configuration is required. Before deploying etcd in production,
see resource requirement reference.

Starting etcd clusters

This section covers starting a single-node and multi-node etcd cluster.

Single-node etcd cluster

Use a single-node etcd cluster only for testing purpose.

1. Run the following:

2. Start the Kubernetes API server with the flag --etcd-servers=$PRIVATE_IP:2379.

Make sure PRIVATE_IP is set to your etcd client IP.

Multi-node etcd cluster

For durability and high availability, run etcd as a multi-node cluster in production and back it
up periodically. A five-member cluster is recommended in production. For more information,
see FAQ documentation.

Configure an etcd cluster either by static member information or by dynamic discovery. For
more information on clustering, see etcd clustering documentation.
For an example, consider a five-member etcd cluster running with the following client URLs:
http://$IP1:2379, http://$IP2:2379, http://$IP3:2379, http://$IP4:2379, and http://$IP5:2379. To
start a Kubernetes API server:

1. Run the following:

etcd --listen-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:2379,http://$IP4:
2379,http://$IP5:2379 --advertise-client-urls=http://$IP1:2379,http://$IP2:2379,http://$IP3:
2379,http://$IP4:2379,http://$IP5:2379

2. Start the Kubernetes API servers with the flag --etcd-

servers=$IP1:2379,$IP2:2379,$IP3:2379,$IP4:2379,$IP5:2379.

Make sure the IP<n> variables are set to your client IP addresses.

Multi-node etcd cluster with load balancer

To run a load balancing etcd cluster:

1. Set up an etcd cluster.

2. Configure a load balancer in front of the etcd cluster. For example, let the address of the
load balancer be $LB.
3. Start Kubernetes API Servers with the flag --etcd-servers=$LB:2379.

Securing etcd clusters

Access to etcd is equivalent to root permission in the cluster so ideally only the API server
should have access to it. Considering the sensitivity of the data, it is recommended to grant
permission to only those nodes that require access to etcd clusters.

To secure etcd, either set up firewall rules or use the security features provided by etcd. etcd
security features depend on x509 Public Key Infrastructure (PKI). To begin, establish secure
communication channels by generating a key and certificate pair. For example, use key pairs
peer.key and peer.cert for securing communication between etcd members, and client.key and
client.cert for securing communication between etcd and its clients. See the example scripts
provided by the etcd project to generate key pairs and CA files for client authentication.

Securing communication

To configure etcd with secure peer communication, specify flags --peer-key-file=peer.key and --
peer-cert-file=peer.cert, and use HTTPS as the URL schema.

Similarly, to configure etcd with secure client communication, specify flags --key-
file=k8sclient.key and --cert-file=k8sclient.cert, and use HTTPS as the URL schema. Here is an
example on a client command that uses secure communication:

ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 \

--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
member list
Limiting access of etcd clusters

After configuring secure communication, restrict the access of etcd cluster to only the
Kubernetes API servers. Use TLS authentication to do so.

For example, consider key pairs k8sclient.key and k8sclient.cert that are trusted by the CA
etcd.ca. When etcd is configured with --client-cert-auth along with TLS, it verifies the
certificates from clients by using system CAs or the CA passed in by --trusted-ca-file flag.
Specifying flags --client-cert-auth=true and --trusted-ca-file=etcd.ca will restrict the access to
clients with the certificate k8sclient.cert.

Once etcd is configured correctly, only clients with valid certificates can access it. To give
Kubernetes API servers the access, configure them with the flags --etcd-certfile=k8sclient.cert,
--etcd-keyfile=k8sclient.key and --etcd-cafile=ca.cert.

Note: etcd authentication is not currently supported by Kubernetes. For more information, see
the related issue Support Basic Auth for Etcd v2.

Replacing a failed etcd member

etcd cluster achieves high availability by tolerating minor member failures. However, to
improve the overall health of the cluster, replace failed members immediately. When multiple
members fail, replace them one by one. Replacing a failed member involves two steps: removing
the failed member and adding a new member.

Though etcd keeps unique member IDs internally, it is recommended to use a unique name for
each member to avoid human errors. For example, consider a three-member etcd cluster. Let the
URLs be, member1=http://10.0.0.1, member2=http://10.0.0.2, and member3=http://10.0.0.3.
When member1 fails, replace it with member4=http://10.0.0.4.

1. Get the member ID of the failed member1:

etcdctl --endpoints=http://10.0.0.2,http://10.0.0.3 member list

The following message is displayed:

8211f1d0f64f3269, started, member1, http://10.0.0.1:2380, http://10.0.0.1:2379

91bc3c398fb3c146, started, member2, http://10.0.0.2:2380, http://10.0.0.2:2379
fd422379fda50e48, started, member3, http://10.0.0.3:2380, http://10.0.0.3:2379

2. Do either of the following:

1. If each Kubernetes API server is configured to communicate with all etcd members,
remove the failed member from the --etcd-servers flag, then restart each
Kubernetes API server.
2. If each Kubernetes API server communicates with a single etcd member, then stop
the Kubernetes API server that communicates with the failed etcd.

3. Stop the etcd server on the broken node. It is possible that other clients besides the
Kubernetes API server is causing traffic to etcd and it is desirable to stop all traffic to
prevent writes to the data dir.

4. Remove the failed member:

etcdctl member remove 8211f1d0f64f3269

The following message is displayed:

Removed member 8211f1d0f64f3269 from cluster

5. Add the new member:

etcdctl member add member4 --peer-urls=http://10.0.0.4:2380

The following message is displayed:

Member 2be1eb8f84b7f63e added to cluster ef37ad9dc622a7c4

6. Start the newly added member on a machine with the IP 10.0.0.4:

export ETCD_NAME="member4"
export ETCD_INITIAL_CLUSTER="member2=http://10.0.0.2:2380,member3=http://
10.0.0.3:2380,member4=http://10.0.0.4:2380"
export ETCD_INITIAL_CLUSTER_STATE=existing
etcd [flags]

7. Do either of the following:

1. If each Kubernetes API server is configured to communicate with all etcd members,
add the newly added member to the --etcd-servers flag, then restart each
Kubernetes API server.
2. If each Kubernetes API server communicates with a single etcd member, start the
Kubernetes API server that was stopped in step 2. Then configure Kubernetes API
server clients to again route requests to the Kubernetes API server that was
stopped. This can often be done by configuring a load balancer.

For more information on cluster reconfiguration, see etcd reconfiguration documentation.

Backing up an etcd cluster

All Kubernetes objects are stored on etcd. Periodically backing up the etcd cluster data is
important to recover Kubernetes clusters under disaster scenarios, such as losing all control
plane nodes. The snapshot file contains all the Kubernetes states and critical information. In
order to keep the sensitive Kubernetes data safe, encrypt the snapshot files.

Backing up an etcd cluster can be accomplished in two ways: etcd built-in snapshot and volume
snapshot.

Built-in snapshot

etcd supports built-in snapshot. A snapshot may either be taken from a live member with the
etcdctl snapshot save command or by copying the member/snap/db file from an etcd data
directory that is not currently used by an etcd process. Taking the snapshot will not affect the
performance of the member.

Below is an example for taking a snapshot of the keyspace served by $ENDPOINT to the file
snapshot.db:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

Verify the snapshot:

ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db

+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| fe01cf57 | 10 | 7 | 2.1 MB |
+----------+----------+------------+------------+

Volume snapshot

If etcd is running on a storage volume that supports backup, such as Amazon Elastic Block
Store, back up etcd data by taking a snapshot of the storage volume.

Snapshot using etcdctl options

We can also take the snapshot using various options given by etcdctl. For example

ETCDCTL_API=3 etcdctl -h

will list various options available from etcdctl. For example, you can take a snapshot by
specifying the endpoint, certificates etc as shown below:

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \

--cacert=<trusted-ca-file> --cert=<cert-file> --key=<key-file> \
snapshot save <backup-file-location>

where trusted-ca-file, cert-file and key-file can be obtained from the description of the etcd Pod.

Scaling out etcd clusters

Scaling out etcd clusters increases availability by trading off performance. Scaling does not
increase cluster performance nor capability. A general rule is not to scale out or in etcd clusters.
Do not configure any auto scaling groups for etcd clusters. It is highly recommended to always
run a static five-member etcd cluster for production Kubernetes clusters at any officially
supported scale.

A reasonable scaling is to upgrade a three-member cluster to a five-member one, when more

reliability is desired. See etcd reconfiguration documentation for information on how to add
members into an existing cluster.

Restoring an etcd cluster

etcd supports restoring from snapshots that are taken from an etcd process of the major.minor
version. Restoring a version from a different patch version of etcd also is supported. A restore
operation is employed to recover the data of a failed cluster.

Before starting the restore operation, a snapshot file must be present. It can either be a snapshot
file from a previous backup operation, or from a remaining data directory.
Here is an example:

ETCDCTL_API=3 etcdctl --endpoints 10.2.0.9:2379 snapshot restore snapshot.db

Another example for restoring using etcdctl options:

ETCDCTL_API=3 etcdctl --data-dir <data-dir-location> snapshot restore snapshot.db

where <data-dir-location> is a directory that will be created during the restore process.

Yet another example would be to first export the ETCDCTL_API environment variable:

export ETCDCTL_API=3
etcdctl --data-dir <data-dir-location> snapshot restore snapshot.db

For more information and examples on restoring a cluster from a snapshot file, see etcd disaster
recovery documentation.

If the access URLs of the restored cluster is changed from the previous cluster, the Kubernetes
API server must be reconfigured accordingly. In this case, restart Kubernetes API servers with
the flag --etcd-servers=$NEW_ETCD_CLUSTER instead of the flag --etcd-
servers=$OLD_ETCD_CLUSTER. Replace $NEW_ETCD_CLUSTER and
$OLD_ETCD_CLUSTER with the respective IP addresses. If a load balancer is used in front of
an etcd cluster, you might need to update the load balancer instead.

If the majority of etcd members have permanently failed, the etcd cluster is considered failed. In
this scenario, Kubernetes cannot make any changes to its current state. Although the scheduled
pods might continue to run, no new pods can be scheduled. In such cases, recover the etcd
cluster and potentially reconfigure Kubernetes API servers to fix the issue.

Note:

If any API servers are running in your cluster, you should not attempt to restore instances of
etcd. Instead, follow these steps to restore etcd:

• stop all API server instances

• restore state in all etcd instances
• restart all API server instances

We also recommend restarting any components (e.g. kube-scheduler, kube-controller-manager,

kubelet) to ensure that they don't rely on some stale data. Note that in practice, the restore
takes a bit of time. During the restoration, critical components will lose leader lock and restart
themselves.

Upgrading etcd clusters

For more details on etcd upgrade, please refer to the etcd upgrades documentation.

Note: Before you start an upgrade, please back up your etcd cluster first.

Maintaining etcd clusters

For more details on etcd maintenance, please refer to the etcd maintenance documentation.
This item links to a third party project or product that is not part of Kubernetes itself. More
information
Note:

Defragmentation is an expensive operation, so it should be executed as infrequent as possible.

On the other hand, it's also necessary to make sure any etcd member will not run out of the
storage quota. The Kubernetes project recommends that when you perform defragmentation,
you use a tool such as etcd-defrag.

You can also run the defragmentation tool as a Kubernetes CronJob, to make sure that
defragmentation happens regularly. See etcd-defrag-cronjob.yaml for details.

Reserve Compute Resources for System

Daemons
Kubernetes nodes can be scheduled to Capacity. Pods can consume all the available capacity on
a node by default. This is an issue because nodes typically run quite a few system daemons that
power the OS and Kubernetes itself. Unless resources are set aside for these system daemons,
pods and system daemons compete for resources and lead to resource starvation issues on the
node.

The kubelet exposes a feature named 'Node Allocatable' that helps to reserve compute resources
for system daemons. Kubernetes recommends cluster administrators to configure 'Node
Allocatable' based on their workload density on each node.

Before you begin

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version 1.8. To check the version, enter kubectl
version. Your Kubernetes server must be at or later than version 1.17 to use the kubelet
command line option --reserved-cpus to set an explicitly reserved CPU list.

Node Allocatable
node capacity

'Allocatable' on a Kubernetes node is defined as the amount of compute resources that are
available for pods. The scheduler does not over-subscribe 'Allocatable'. 'CPU', 'memory' and
'ephemeral-storage' are supported as of now.

Node Allocatable is exposed as part of v1.Node object in the API and as part of kubectl describe
node in the CLI.
Resources can be reserved for two categories of system daemons in the kubelet.

Enabling QoS and Pod level cgroups

To properly enforce node allocatable constraints on the node, you must enable the new cgroup
hierarchy via the --cgroups-per-qos flag. This flag is enabled by default. When enabled, the
kubelet will parent all end-user pods under a cgroup hierarchy managed by the kubelet.

Configuring a cgroup driver

The kubelet supports manipulation of the cgroup hierarchy on the host using a cgroup driver.
The driver is configured via the --cgroup-driver flag.

The supported values are the following:

• cgroupfs is the default driver that performs direct manipulation of the cgroup filesystem
on the host in order to manage cgroup sandboxes.
• systemd is an alternative driver that manages cgroup sandboxes using transient slices for
resources that are supported by that init system.

Depending on the configuration of the associated container runtime, operators may have to
choose a particular cgroup driver to ensure proper system behavior. For example, if operators
use the systemd cgroup driver provided by the containerd runtime, the kubelet must be
configured to use the systemd cgroup driver.

Kube Reserved

• Kubelet Flag: --kube-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-

storage=1Gi][,][pid=1000]
• Kubelet Flag: --kube-reserved-cgroup=

kube-reserved is meant to capture resource reservation for kubernetes system daemons like the
kubelet, container runtime, node problem detector, etc. It is not meant to reserve resources for
system daemons that are run as pods. kube-reserved is typically a function of pod density on
the nodes.

In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the
specified number of process IDs for kubernetes system daemons.

To optionally enforce kube-reserved on kubernetes system daemons, specify the parent control
group for kube daemons as the value for --kube-reserved-cgroup kubelet flag.

It is recommended that the kubernetes system daemons are placed under a top level control
group (runtime.slice on systemd machines for example). Each system daemon should ideally
run within its own child control group. Refer to the design proposal for more details on
recommended control group hierarchy.

Note that Kubelet does not create --kube-reserved-cgroup if it doesn't exist. The kubelet will
fail to start if an invalid cgroup is specified. With systemd cgroup driver, you should follow a
specific pattern for the name of the cgroup you define: the name should be the value you set for
--kube-reserved-cgroup, with .slice appended.
System Reserved

• Kubelet Flag: --system-reserved=[cpu=100m][,][memory=100Mi][,][ephemeral-

storage=1Gi][,][pid=1000]
• Kubelet Flag: --system-reserved-cgroup=

system-reserved is meant to capture resource reservation for OS system daemons like sshd,
udev, etc. system-reserved should reserve memory for the kernel too since kernel memory is
not accounted to pods in Kubernetes at this time. Reserving resources for user login sessions is
also recommended (user.slice in systemd world).

In addition to cpu, memory, and ephemeral-storage, pid may be specified to reserve the
specified number of process IDs for OS system daemons.

To optionally enforce system-reserved on system daemons, specify the parent control group for
OS system daemons as the value for --system-reserved-cgroup kubelet flag.

It is recommended that the OS system daemons are placed under a top level control group
(system.slice on systemd machines for example).

Note that kubelet does not create --system-reserved-cgroup if it doesn't exist. kubelet will fail
if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific
pattern for the name of the cgroup you define: the name should be the value you set for --
system-reserved-cgroup, with .slice appended.

Explicitly Reserved CPU List

FEATURE STATE: Kubernetes v1.17 [stable]

Kubelet Flag: --reserved-cpus=0-3 KubeletConfiguration Flag: reservedSystemCPUs: 0-3

reserved-cpus is meant to define an explicit CPU set for OS system daemons and kubernetes
system daemons. reserved-cpus is for systems that do not intend to define separate top level
cgroups for OS system daemons and kubernetes system daemons with regard to cpuset
resource. If the Kubelet does not have --system-reserved-cgroup and --kube-reserved-cgroup,
the explicit cpuset provided by reserved-cpus will take precedence over the CPUs defined by --
kube-reserved and --system-reserved options.

This option is specifically designed for Telco/NFV use cases where uncontrolled interrupts/
timers may impact the workload performance. you can use this option to define the explicit
cpuset for the system/kubernetes daemons as well as the interrupts/timers, so the rest CPUs on
the system can be used exclusively for workloads, with less impact from uncontrolled
interrupts/timers. To move the system daemon, kubernetes daemons and interrupts/timers to
the explicit cpuset defined by this option, other mechanism outside Kubernetes should be used.
For example: in Centos, you can do this using the tuned toolset.

Eviction Thresholds

Kubelet Flag: --eviction-hard=[memory.available<500Mi]

Memory pressure at the node level leads to System OOMs which affects the entire node and all
pods running on it. Nodes can go offline temporarily until memory has been reclaimed. To
avoid (or reduce the probability of) system OOMs kubelet provides out of resource
management. Evictions are supported for memory and ephemeral-storage only. By reserving
some memory via --eviction-hard flag, the kubelet attempts to evict pods whenever memory
availability on the node drops below the reserved value. Hypothetically, if system daemons did
not exist on a node, pods cannot use more than capacity - eviction-hard. For this reason,
resources reserved for evictions are not available for pods.

Enforcing Node Allocatable

Kubelet Flag: --enforce-node-allocatable=pods[,][system-reserved][,][kube-reserved]

The scheduler treats 'Allocatable' as the available capacity for pods.

kubelet enforce 'Allocatable' across pods by default. Enforcement is performed by evicting pods
whenever the overall usage across all pods exceeds 'Allocatable'. More details on eviction policy
can be found on the node pressure eviction page. This enforcement is controlled by specifying
pods value to the kubelet flag --enforce-node-allocatable.

Optionally, kubelet can be made to enforce kube-reserved and system-reserved by specifying

kube-reserved & system-reserved values in the same flag. Note that to enforce kube-reserved or
system-reserved, --kube-reserved-cgroup or --system-reserved-cgroup needs to be specified
respectively.

General Guidelines
System daemons are expected to be treated similar to Guaranteed pods. System daemons can
burst within their bounding control groups and this behavior needs to be managed as part of
kubernetes deployments. For example, kubelet should have its own control group and share
kube-reserved resources with the container runtime. However, Kubelet cannot burst and use up
all available Node resources if kube-reserved is enforced.

Be extra careful while enforcing system-reserved reservation since it can lead to critical system
services being CPU starved, OOM killed, or unable to fork on the node. The recommendation is
to enforce system-reserved only if a user has profiled their nodes exhaustively to come up with
precise estimates and is confident in their ability to recover if any process in that group is oom-
killed.

• To begin with enforce 'Allocatable' on pods.

• Once adequate monitoring and alerting is in place to track kube system daemons, attempt
to enforce kube-reserved based on usage heuristics.
• If absolutely necessary, enforce system-reserved over time.

The resource requirements of kube system daemons may grow over time as more and more
features are added. Over time, kubernetes project will attempt to bring down utilization of node
system daemons, but that is not a priority as of now. So expect a drop in Allocatable capacity in
future releases.

Example Scenario
Here is an example to illustrate Node Allocatable computation:

• Node has 32Gi of memory, 16 CPUs and 100Gi of Storage

• --kube-reserved is set to cpu=1,memory=2Gi,ephemeral-storage=1Gi
• --system-reserved is set to cpu=500m,memory=1Gi,ephemeral-storage=1Gi
• --eviction-hard is set to memory.available<500Mi,nodefs.available<10%

Under this scenario, 'Allocatable' will be 14.5 CPUs, 28.5Gi of memory and 88Gi of local storage.
Scheduler ensures that the total memory requests across all pods on this node does not exceed
28.5Gi and storage doesn't exceed 88Gi. Kubelet evicts pods whenever the overall memory
usage across pods exceeds 28.5Gi, or if overall disk usage exceeds 88Gi. If all processes on the
node consume as much CPU as they can, pods together cannot consume more than 14.5 CPUs.

If kube-reserved and/or system-reserved is not enforced and system daemons exceed their
reservation, kubelet evicts pods whenever the overall node memory usage is higher than 31.5Gi
or storage is greater than 90Gi.

Running Kubernetes Node Components as

a Non-root User
FEATURE STATE: Kubernetes v1.22 [alpha]

This document describes how to run Kubernetes Node components such as kubelet, CRI, OCI,
and CNI without root privileges, by using a user namespace.

This technique is also known as rootless mode.

Note:

This document describes how to run Kubernetes Node components (and hence pods) as a non-
root user.

If you are just looking for how to run a pod as a non-root user, see SecurityContext.

Before you begin

Your Kubernetes server must be at or later than version 1.22. To check the version, enter
kubectl version.

• Enable Cgroup v2
• Enable systemd with user session
• Configure several sysctl values, depending on host Linux distribution
• Ensure that your unprivileged user is listed in /etc/subuid and /etc/subgid
• Enable the KubeletInUserNamespace feature gate

Running Kubernetes inside Rootless Docker/Podman

kind

kind supports running Kubernetes inside Rootless Docker or Rootless Podman.

See Running kind with Rootless Docker.

minikube

minikube also supports running Kubernetes inside Rootless Docker or Rootless Podman.

See the Minikube documentation:

• Rootless Docker
• Rootless Podman

Running Kubernetes inside Unprivileged Containers

sysbox

Sysbox is an open-source container runtime (similar to "runc") that supports running system-
level workloads such as Docker and Kubernetes inside unprivileged containers isolated with the
Linux user namespace.

See Sysbox Quick Start Guide: Kubernetes-in-Docker for more info.

Sysbox supports running Kubernetes inside unprivileged containers without requiring Cgroup
v2 and without the KubeletInUserNamespace feature gate. It does this by exposing specially
crafted /proc and /sys filesystems inside the container plus several other advanced OS
virtualization techniques.

Running Rootless Kubernetes directly on a host

K3s

K3s experimentally supports rootless mode.

See Running K3s with Rootless mode for the usage.

Usernetes

Usernetes is a reference distribution of Kubernetes that can be installed under $HOME

directory without the root privilege.

Usernetes supports both containerd and CRI-O as CRI runtimes. Usernetes supports multi-node
clusters using Flannel (VXLAN).

See the Usernetes repo for the usage.

Manually deploy a node that runs the kubelet in a user
namespace
This section provides hints for running Kubernetes in a user namespace manually.

Note: This section is intended to be read by developers of Kubernetes distributions, not by end
users.

Creating a user namespace

The first step is to create a user namespace.

If you are trying to run Kubernetes in a user-namespaced container such as Rootless Docker/
Podman or LXC/LXD, you are all set, and you can go to the next subsection.

Otherwise you have to create a user namespace by yourself, by calling unshare(2) with
CLONE_NEWUSER.

A user namespace can be also unshared by using command line tools such as:

• unshare(1)
• RootlessKit
• become-root

After unsharing the user namespace, you will also have to unshare other namespaces such as
mount namespace.

You do not need to call chroot() nor pivot_root() after unsharing the mount namespace,
however, you have to mount writable filesystems on several directories in the namespace.

At least, the following directories need to be writable in the namespace (not outside the
namespace):

• /etc
• /run
• /var/logs
• /var/lib/kubelet
• /var/lib/cni
• /var/lib/containerd (for containerd)
• /var/lib/containers (for CRI-O)

Creating a delegated cgroup tree

In addition to the user namespace, you also need to have a writable cgroup tree with cgroup v2.

Note: Kubernetes support for running Node components in user namespaces requires cgroup
v2. Cgroup v1 is not supported.

If you are trying to run Kubernetes in Rootless Docker/Podman or LXC/LXD on a systemd-

based host, you are all set.

Otherwise you have to create a systemd unit with Delegate=yes property to delegate a cgroup
tree with writable permission.
On your node, systemd must already be configured to allow delegation; for more details, see
cgroup v2 in the Rootless Containers documentation.

Configuring network

The network namespace of the Node components has to have a non-loopback interface, which
can be for example configured with slirp4netns, VPNKit, or lxc-user-nic(1).

The network namespaces of the Pods can be configured with regular CNI plugins. For multi-
node networking, Flannel (VXLAN, 8472/UDP) is known to work.

Ports such as the kubelet port (10250/TCP) and NodePort service ports have to be exposed from
the Node network namespace to the host with an external port forwarder, such as RootlessKit,
slirp4netns, or socat(1).

You can use the port forwarder from K3s. See Running K3s in Rootless Mode for more details.
The implementation can be found in the pkg/rootlessports package of k3s.

Configuring CRI

The kubelet relies on a container runtime. You should deploy a container runtime such as
containerd or CRI-O and ensure that it is running within the user namespace before the kubelet
starts.

• containerd
• CRI-O

Running CRI plugin of containerd in a user namespace is supported since containerd 1.4.

Running containerd within a user namespace requires the following configurations.

version = 2

[plugins."io.containerd.grpc.v1.cri"]
# Disable AppArmor
disable_apparmor = true
# Ignore an error during setting oom_score_adj
restrict_oom_score_adj = true
# Disable hugetlb cgroup v2 controller (because systemd does not support delegating hugetlb
controller)
disable_hugetlb_controller = true

[plugins."io.containerd.grpc.v1.cri".containerd]
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be
disabled
snapshotter = "fuse-overlayfs"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
# We use cgroupfs that is delegated by systemd, so we do not use SystemdCgroup driver
# (unless you run another systemd in the namespace)
SystemdCgroup = false

The default path of the configuration file is /etc/containerd/config.toml. The path can be
specified with containerd -c /path/to/containerd/config.toml.

Running CRI-O in a user namespace is supported since CRI-O 1.22.

CRI-O requires an environment variable _CRIO_ROOTLESS=1 to be set.

The following configurations are also recommended:

[crio]
storage_driver = "overlay"
# Using non-fuse overlayfs is also possible for kernel >= 5.11, but requires SELinux to be
disabled
storage_option = ["overlay.mount_program=/usr/local/bin/fuse-overlayfs"]

[crio.runtime]
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroup_manager = "cgroupfs"

The default path of the configuration file is /etc/crio/crio.conf. The path can be specified with
crio --config /path/to/crio/crio.conf.

Configuring kubelet

Running kubelet in a user namespace requires the following configuration:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
KubeletInUserNamespace: true
# We use cgroupfs that is delegated by systemd, so we do not use "systemd" driver
# (unless you run another systemd in the namespace)
cgroupDriver: "cgroupfs"

When the KubeletInUserNamespace feature gate is enabled, the kubelet ignores errors that may
happen during setting the following sysctl values on the node.

• vm.overcommit_memory
• vm.panic_on_oom
• kernel.panic
• kernel.panic_on_oops
• kernel.keys.root_maxkeys
• kernel.keys.root_maxbytes.

Within a user namespace, the kubelet also ignores any error raised from trying to open /dev/
kmsg. This feature gate also allows kube-proxy to ignore an error during setting
RLIMIT_NOFILE.
The KubeletInUserNamespace feature gate was introduced in Kubernetes v1.22 with "alpha"
status.

Running kubelet in a user namespace without using this feature gate is also possible by
mounting a specially crafted proc filesystem (as done by Sysbox), but not officially supported.

Configuring kube-proxy

Running kube-proxy in a user namespace requires the following configuration:

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "iptables" # or "userspace"
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s

Caveats
• Most of "non-local" volume drivers such as nfs and iscsi do not work. Local volumes like
local, hostPath, emptyDir, configMap, secret, and downwardAPI are known to work.

• Some CNI plugins may not work. Flannel (VXLAN) is known to work.

For more on this, see the Caveats and Future work page on the rootlesscontaine.rs website.

See Also
• rootlesscontaine.rs
• Rootless Containers 2020 (KubeCon NA 2020)
• Running kind with Rootless Docker
• Usernetes
• Running K3s with rootless mode
• KEP-2033: Kubelet-in-UserNS (aka Rootless mode)

Safely Drain a Node

This page shows how to safely drain a node, optionally respecting the PodDisruptionBudget
you have defined.

Before you begin

This task assumes that you have met the following prerequisites:

1. You do not require your applications to be highly available during the node drain, or
2. You have read about the PodDisruptionBudget concept, and have configured
PodDisruptionBudgets for applications that need them.

(Optional) Configure a disruption budget

To ensure that your workloads remain available during maintenance, you can configure a
PodDisruptionBudget.

If availability is important for any applications that run or could run on the node(s) that you are
draining, configure a PodDisruptionBudgets first and then continue following this guide.

It is recommended to set AlwaysAllow Unhealthy Pod Eviction Policy to your

PodDisruptionBudgets to support eviction of misbehaving applications during a node drain.
The default behavior is to wait for the application pods to become healthy before the drain can
proceed.

Use kubectl drain to remove a node from service

You can use kubectl drain to safely evict all of your pods from a node before you perform
maintenance on the node (e.g. kernel upgrade, hardware maintenance, etc.). Safe evictions allow
the pod's containers to gracefully terminate and will respect the PodDisruptionBudgets you
have specified.

Note: By default kubectl drain ignores certain system pods on the node that cannot be killed;
see the kubectl drain documentation for more details.

When kubectl drain returns successfully, that indicates that all of the pods (except the ones
excluded as described in the previous paragraph) have been safely evicted (respecting the
desired graceful termination period, and respecting the PodDisruptionBudget you have
defined). It is then safe to bring down the node by powering down its physical machine or, if
running on a cloud platform, deleting its virtual machine.

Note:

If any new Pods tolerate the node.kubernetes.io/unschedulable taint, then those Pods might be
scheduled to the node you have drained. Avoid tolerating that taint other than for DaemonSets.

If you or another API user directly set the nodeName field for a Pod (bypassing the scheduler),
then the Pod is bound to the specified node and will run there, even though you have drained
that node and marked it unschedulable.

First, identify the name of the node you wish to drain. You can list all of the nodes in your
cluster with

kubectl get nodes

Next, tell Kubernetes to drain the node:

kubectl drain --ignore-daemonsets <node name>

If there are pods managed by a DaemonSet, you will need to specify --ignore-daemonsets with
kubectl to successfully drain the node. The kubectl drain subcommand on its own does not
actually drain a node of its DaemonSet pods: the DaemonSet controller (part of the control
plane) immediately replaces missing Pods with new equivalent Pods. The DaemonSet controller
also creates Pods that ignore unschedulable taints, which allows the new Pods to launch onto a
node that you are draining.

Once it returns (without giving an error), you can power down the node (or equivalently, if on a
cloud platform, delete the virtual machine backing the node). If you leave the node in the
cluster during the maintenance operation, you need to run

kubectl uncordon <node name>

afterwards to tell Kubernetes that it can resume scheduling new pods onto the node.

Draining multiple nodes in parallel

The kubectl drain command should only be issued to a single node at a time. However, you can
run multiple kubectl drain commands for different nodes in parallel, in different terminals or in
the background. Multiple drain commands running concurrently will still respect the
PodDisruptionBudget you specify.

For example, if you have a StatefulSet with three replicas and have set a PodDisruptionBudget
for that set specifying minAvailable: 2, kubectl drain only evicts a pod from the StatefulSet if all
three replicas pods are healthy; if then you issue multiple drain commands in parallel,
Kubernetes respects the PodDisruptionBudget and ensures that only 1 (calculated as replicas -
minAvailable) Pod is unavailable at any given time. Any drains that would cause the number of
healthy replicas to fall below the specified budget are blocked.

The Eviction API

If you prefer not to use kubectl drain (such as to avoid calling to an external command, or to get
finer control over the pod eviction process), you can also programmatically cause evictions
using the eviction API.

For more information, see API-initiated eviction.

What's next
• Follow steps to protect your application by configuring a Pod Disruption Budget.

Securing a Cluster
This document covers topics related to protecting a cluster from accidental or malicious access
and provides recommendations on overall security.

Before you begin

◦ Killercoda
◦ Play with Kubernetes
To check the version, enter kubectl version.

Controlling access to the Kubernetes API

As Kubernetes is entirely API-driven, controlling and limiting who can access the cluster and
what actions they are allowed to perform is the first line of defense.

Use Transport Layer Security (TLS) for all API traffic

Kubernetes expects that all API communication in the cluster is encrypted by default with TLS,
and the majority of installation methods will allow the necessary certificates to be created and
distributed to the cluster components. Note that some components and installation methods
may enable local ports over HTTP and administrators should familiarize themselves with the
settings of each component to identify potentially unsecured traffic.

API Authentication

Choose an authentication mechanism for the API servers to use that matches the common
access patterns when you install a cluster. For instance, small, single-user clusters may wish to
use a simple certificate or static Bearer token approach. Larger clusters may wish to integrate
an existing OIDC or LDAP server that allow users to be subdivided into groups.

All API clients must be authenticated, even those that are part of the infrastructure like nodes,
proxies, the scheduler, and volume plugins. These clients are typically service accounts or use
x509 client certificates, and they are created automatically at cluster startup or are setup as part
of the cluster installation.

Consult the authentication reference document for more information.

API Authorization

Once authenticated, every API call is also expected to pass an authorization check. Kubernetes
ships an integrated Role-Based Access Control (RBAC) component that matches an incoming
user or group to a set of permissions bundled into roles. These permissions combine verbs (get,
create, delete) with resources (pods, services, nodes) and can be namespace-scoped or cluster-
scoped. A set of out-of-the-box roles are provided that offer reasonable default separation of
responsibility depending on what actions a client might want to perform. It is recommended
that you use the Node and RBAC authorizers together, in combination with the NodeRestriction
admission plugin.

As with authentication, simple and broad roles may be appropriate for smaller clusters, but as
more users interact with the cluster, it may become necessary to separate teams into separate
namespaces with more limited roles.

With authorization, it is important to understand how updates on one object may cause actions
in other places. For instance, a user may not be able to create pods directly, but allowing them
to create a deployment, which creates pods on their behalf, will let them create those pods
indirectly. Likewise, deleting a node from the API will result in the pods scheduled to that node
being terminated and recreated on other nodes. The out-of-the box roles represent a balance
between flexibility and common use cases, but more limited roles should be carefully reviewed
to prevent accidental escalation. You can make roles specific to your use case if the out-of-box
ones don't meet your needs.

Consult the authorization reference section for more information.

Controlling access to the Kubelet

Kubelets expose HTTPS endpoints which grant powerful control over the node and containers.
By default Kubelets allow unauthenticated access to this API.

Production clusters should enable Kubelet authentication and authorization.

Consult the Kubelet authentication/authorization reference for more information.

Controlling the capabilities of a workload or user at

runtime
Authorization in Kubernetes is intentionally high level, focused on coarse actions on resources.
More powerful controls exist as policies to limit by use case how those objects act on the
cluster, themselves, and other resources.

Limiting resource usage on a cluster

Resource quota limits the number or capacity of resources granted to a namespace. This is most
often used to limit the amount of CPU, memory, or persistent disk a namespace can allocate,
but can also control how many pods, services, or volumes exist in each namespace.

Limit ranges restrict the maximum or minimum size of some of the resources above, to prevent
users from requesting unreasonably high or low values for commonly reserved resources like
memory, or to provide default limits when none are specified.

Controlling what privileges containers run with

A pod definition contains a security context that allows it to request access to run as a specific
Linux user on a node (like root), access to run privileged or access the host network, and other
controls that would otherwise allow it to run unfettered on a hosting node.

You can configure Pod security admission to enforce use of a particular Pod Security Standard
in a namespace, or to detect breaches.

Generally, most application workloads need limited access to host resources so they can
successfully run as a root process (uid 0) without access to host information. However,
considering the privileges associated with the root user, you should write application containers
to run as a non-root user. Similarly, administrators who wish to prevent client applications from
escaping their containers should apply the Baseline or Restricted Pod Security Standard.
Preventing containers from loading unwanted kernel modules

The Linux kernel automatically loads kernel modules from disk if needed in certain
circumstances, such as when a piece of hardware is attached or a filesystem is mounted. Of
particular relevance to Kubernetes, even unprivileged processes can cause certain network-
protocol-related kernel modules to be loaded, just by creating a socket of the appropriate type.
This may allow an attacker to exploit a security hole in a kernel module that the administrator
assumed was not in use.

To prevent specific modules from being automatically loaded, you can uninstall them from the
node, or add rules to block them. On most Linux distributions, you can do that by creating a file
such as /etc/modprobe.d/kubernetes-blacklist.conf with contents like:

# DCCP is unlikely to be needed, has had multiple serious

# vulnerabilities, and is not well-maintained.
blacklist dccp

# SCTP is not used in most Kubernetes clusters, and has also had
# vulnerabilities in the past.
blacklist sctp

To block module loading more generically, you can use a Linux Security Module (such as
SELinux) to completely deny the module_request permission to containers, preventing the
kernel from loading modules for containers under any circumstances. (Pods would still be able
to use modules that had been loaded manually, or modules that were loaded by the kernel on
behalf of some more-privileged process.)

Restricting network access

The network policies for a namespace allows application authors to restrict which pods in other
namespaces may access pods and ports within their namespaces. Many of the supported
Kubernetes networking providers now respect network policy.

Quota and limit ranges can also be used to control whether users may request node ports or
load-balanced services, which on many clusters can control whether those users applications
are visible outside of the cluster.

Additional protections may be available that control network rules on a per-plugin or per-
environment basis, such as per-node firewalls, physically separating cluster nodes to prevent
cross talk, or advanced networking policy.

Restricting cloud metadata API access

Cloud platforms (AWS, Azure, GCE, etc.) often expose metadata services locally to instances. By
default these APIs are accessible by pods running on an instance and can contain cloud
credentials for that node, or provisioning data such as kubelet credentials. These credentials can
be used to escalate within the cluster or to other cloud services under the same account.

When running Kubernetes on a cloud platform, limit permissions given to instance credentials,
use network policies to restrict pod access to the metadata API, and avoid using provisioning
data to deliver secrets.
Controlling which nodes pods may access

By default, there are no restrictions on which nodes may run a pod. Kubernetes offers a rich set
of policies for controlling placement of pods onto nodes and the taint-based pod placement and
eviction that are available to end users. For many clusters use of these policies to separate
workloads can be a convention that authors adopt or enforce via tooling.

As an administrator, a beta admission plugin PodNodeSelector can be used to force pods within
a namespace to default or require a specific node selector, and if end users cannot alter
namespaces, this can strongly limit the placement of all of the pods in a specific workload.

Protecting cluster components from compromise

This section describes some common patterns for protecting clusters from compromise.

Restrict access to etcd

Write access to the etcd backend for the API is equivalent to gaining root on the entire cluster,
and read access can be used to escalate fairly quickly. Administrators should always use strong
credentials from the API servers to their etcd server, such as mutual auth via TLS client
certificates, and it is often recommended to isolate the etcd servers behind a firewall that only
the API servers may access.

Caution: Allowing other components within the cluster to access the master etcd instance with
read or write access to the full keyspace is equivalent to granting cluster-admin access. Using
separate etcd instances for non-master components or using etcd ACLs to restrict read and
write access to a subset of the keyspace is strongly recommended.

Enable audit logging

The audit logger is a beta feature that records actions taken by the API for later analysis in the
event of a compromise. It is recommended to enable audit logging and archive the audit file on
a secure server.

Restrict access to alpha or beta features

Alpha and beta Kubernetes features are in active development and may have limitations or bugs
that result in security vulnerabilities. Always assess the value an alpha or beta feature may
provide against the possible risk to your security posture. When in doubt, disable features you
do not use.

Rotate infrastructure credentials frequently

The shorter the lifetime of a secret or credential the harder it is for an attacker to make use of
that credential. Set short lifetimes on certificates and automate their rotation. Use an
authentication provider that can control how long issued tokens are available and use short
lifetimes where possible. If you use service-account tokens in external integrations, plan to
rotate those tokens frequently. For example, once the bootstrap phase is complete, a bootstrap
token used for setting up nodes should be revoked or its authorization removed.
Review third party integrations before enabling them

Many third party integrations to Kubernetes may alter the security profile of your cluster.
When enabling an integration, always review the permissions that an extension requests before
granting it access. For example, many security integrations may request access to view all
secrets on your cluster which is effectively making that component a cluster admin. When in
doubt, restrict the integration to functioning in a single namespace if possible.

Components that create pods may also be unexpectedly powerful if they can do so inside
namespaces like the kube-system namespace, because those pods can gain access to service
account secrets or run with elevated permissions if those service accounts are granted access to
permissive PodSecurityPolicies.

If you use Pod Security admission and allow any component to create Pods within a namespace
that permits privileged Pods, those Pods may be able to escape their containers and use this
widened access to elevate their privileges.

You should not allow untrusted components to create Pods in any system namespace (those
with names that start with kube-) nor in any namespace where that access grant allows the
possibility of privilege escalation.

Encrypt secrets at rest

In general, the etcd database will contain any information accessible via the Kubernetes API
and may grant an attacker significant visibility into the state of your cluster. Always encrypt
your backups using a well reviewed backup and encryption solution, and consider using full
disk encryption where possible.

Kubernetes supports optional encryption at rest for information in the Kubernetes API. This
lets you ensure that when Kubernetes stores data for objects (for example, Secret or ConfigMap
objects), the API server writes an encrypted representation of the object. That encryption
means that even someone who has access to etcd backup data is unable to view the content of
those objects. In Kubernetes 1.28 you can also encrypt custom resources; encryption-at-rest for
extension APIs defined in CustomResourceDefinitions was added to Kubernetes as part of the
v1.26 release.

Receiving alerts for security updates and reporting vulnerabilities

Join the kubernetes-announce group for emails about security announcements. See the security
reporting page for more on how to report vulnerabilities.

What's next
• Security Checklist for additional information on Kubernetes security guidance.

Set Kubelet Parameters Via A

Configuration File
A subset of the kubelet's configuration parameters may be set via an on-disk config file, as a
substitute for command-line flags.
Providing parameters via a config file is the recommended approach because it simplifies node
deployment and configuration management.

Create the config file

The subset of the kubelet's configuration that can be configured via a file is defined by the
KubeletConfiguration struct.

The configuration file must be a JSON or YAML representation of the parameters in this struct.
Make sure the kubelet has read permissions on the file.

Here is an example of what this file might look like:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "100Mi"
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"

In this example, the kubelet is configured with the following settings:

1. address: The kubelet will serve on IP address 192.168.0.8.

2. port: The kubelet will serve on port 20250.
3. serializeImagePulls: Image pulls will be done in parallel.
4. evictionHard: The kubelet will evict Pods under one of the following conditions:
◦ When the node's available memory drops below 100MiB.
◦ When the node's main filesystem's available space is less than 10%.
◦ When the image filesystem's available space is less than 15%.
◦ When more than 95% of the node's main filesystem's inodes are in use.

Note: In the example, by changing the default value of only one parameter for evictionHard,
the default values of other parameters will not be inherited and will be set to zero. In order to
provide custom values, you should provide all the threshold values respectively.

The imagefs is an optional filesystem that container runtimes use to store container images and
container writable layers.

Start a kubelet process configured via the config file

Note: If you use kubeadm to initialize your cluster, use the kubelet-config while creating your
cluster with kubeadm init. See configuring kubelet using kubeadm for details.

Start the kubelet with the --config flag set to the path of the kubelet's config file. The kubelet
will then load its config from this file.

Note that command line flags which target the same value as a config file will override that
value. This helps ensure backwards compatibility with the command-line API.
Note that relative file paths in the kubelet config file are resolved relative to the location of the
kubelet config file, whereas relative paths in command line flags are resolved relative to the
kubelet's current working directory.

Note that some default values differ between command-line flags and the kubelet config file. If
--config is provided and the values are not specified via the command line, the defaults for the
KubeletConfiguration version apply. In the above example, this version is kubelet.config.k8s.io/
v1beta1.

Drop-in directory for kubelet configuration files

As of Kubernetes v1.28.0, the kubelet has been extended to support a drop-in configuration
directory. The location of it can be specified with --config-dir flag, and it defaults to "", or
disabled, by default.

You can only set --config-dir if you set the environment variable
KUBELET_CONFIG_DROPIN_DIR_ALPHA for the kubelet process (the value of that variable
does not matter). For Kubernetes v1.28, the kubelet returns an error if you specify --config-dir
without that variable set, and startup fails. You cannot specify the drop-in configuration
directory using the kubelet configuration file; only the CLI argument --config-dir can set it.

One can use the kubelet configuration directory in a similar way to the kubelet config file.

Note: The suffix of a valid kubelet drop-in configuration file must be .conf. For instance: 99-
kubelet-address.conf

For instance, you may want a baseline kubelet configuration for all nodes, but you may want to
customize the address field. This can be done as follows:

Main kubelet configuration file contents:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
port: 20250
serializeImagePulls: false
evictionHard:
memory.available: "200Mi"

Contents of a file in --config-dir directory:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: "192.168.0.8"

On startup, the kubelet merges configuration from:

• Command line arguments (lowest precedence).

• the kubelet configuration
• Drop-in configuration files, according to sort order.
• Feature gates specified over the command line (highest precedence).

This produces the same outcome as if you used the single configuration file used in the earlier
example.
What's next
• Learn more about kubelet configuration by checking the KubeletConfiguration reference.

Share a Cluster with Namespaces

This page shows how to view, work in, and delete namespaces. The page also shows how to use
Kubernetes namespaces to subdivide your cluster.

Before you begin

• Have an existing Kubernetes cluster.
• You have a basic understanding of Kubernetes Pods, Services, and Deployments.

Viewing namespaces
List the current namespaces in a cluster using:

kubectl get namespaces

NAME STATUS AGE

default Active 11d
kube-node-lease Active 11d
kube-public Active 11d
kube-system Active 11d

Kubernetes starts with four initial namespaces:

• default The default namespace for objects with no other namespace

• kube-node-lease This namespace holds Lease objects associated with each node. Node
leases allow the kubelet to send heartbeats so that the control plane can detect node
failure.
• kube-public This namespace is created automatically and is readable by all users
(including those not authenticated). This namespace is mostly reserved for cluster usage,
in case that some resources should be visible and readable publicly throughout the whole
cluster. The public aspect of this namespace is only a convention, not a requirement.
• kube-system The namespace for objects created by the Kubernetes system

You can also get the summary of a specific namespace using:

kubectl get namespaces <name>

Or you can get detailed information with:

kubectl describe namespaces <name>

Name: default
Labels: <none>
Annotations: <none>
Status: Active
No resource quota.

Resource Limits
Type Resource Min Max Default
---- -------- --- --- ---
Container cpu - - 100m

Note that these details show both resource quota (if present) as well as resource limit ranges.

Resource quota tracks aggregate usage of resources in the Namespace and allows cluster
operators to define Hard resource usage limits that a Namespace may consume.

A limit range defines min/max constraints on the amount of resources a single entity can
consume in a Namespace.

See Admission control: Limit Range

A namespace can be in one of two phases:

• Active the namespace is in use

• Terminating the namespace is being deleted, and can not be used for new objects

For more details, see Namespace in the API reference.

Creating a new namespace

Note: Avoid creating namespace with prefix kube-, since it is reserved for Kubernetes system
namespaces.

Create a new YAML file called my-namespace.yaml with the contents:

apiVersion: v1
kind: Namespace
metadata:
name: <insert-namespace-name-here>

Then run:

kubectl create -f ./my-namespace.yaml

Alternatively, you can create namespace using below command:

kubectl create namespace <insert-namespace-name-here>

The name of your namespace must be a valid DNS label.

There's an optional field finalizers, which allows observables to purge resources whenever the
namespace is deleted. Keep in mind that if you specify a nonexistent finalizer, the namespace
will be created but will get stuck in the Terminating state if the user tries to delete it.

More information on finalizers can be found in the namespace design doc.

Deleting a namespace
Delete a namespace with

kubectl delete namespaces <insert-some-namespace-name>

Warning: This deletes everything under the namespace!

This delete is asynchronous, so for a time you will see the namespace in the Terminating state.

Subdividing your cluster using Kubernetes namespaces

By default, a Kubernetes cluster will instantiate a default namespace when provisioning the
cluster to hold the default set of Pods, Services, and Deployments used by the cluster.

Assuming you have a fresh cluster, you can introspect the available namespaces by doing the
following:

kubectl get namespaces

NAME STATUS AGE

default Active 13m

Create new namespaces

For this exercise, we will create two additional Kubernetes namespaces to hold our content.

In a scenario where an organization is using a shared Kubernetes cluster for development and
production use cases:

• The development team would like to maintain a space in the cluster where they can get a
view on the list of Pods, Services, and Deployments they use to build and run their
application. In this space, Kubernetes resources come and go, and the restrictions on who
can or cannot modify resources are relaxed to enable agile development.

• The operations team would like to maintain a space in the cluster where they can enforce
strict procedures on who can or cannot manipulate the set of Pods, Services, and
Deployments that run the production site.

One pattern this organization could follow is to partition the Kubernetes cluster into two
namespaces: development and production. Let's create two new namespaces to hold our work.

Create the development namespace using kubectl:

kubectl create -f https://k8s.io/examples/admin/namespace-dev.json

And then let's create the production namespace using kubectl:

kubectl create -f https://k8s.io/examples/admin/namespace-prod.json

To be sure things are right, list all of the namespaces in our cluster.

kubectl get namespaces --show-labels

NAME STATUS AGE LABELS
default Active 32m <none>
development Active 29s name=development
production Active 23s name=production

Create pods in each namespace

A Kubernetes namespace provides the scope for Pods, Services, and Deployments in the cluster.
Users interacting with one namespace do not see the content in another namespace. To
demonstrate this, let's spin up a simple Deployment and Pods in the development namespace.

kubectl create deployment snowflake \

--image=registry.k8s.io/serve_hostname \
-n=development --replicas=2

We have created a deployment whose replica size is 2 that is running the pod called snowflake
with a basic container that serves the hostname.

kubectl get deployment -n=development

NAME READY UP-TO-DATE AVAILABLE AGE

snowflake 2/2 2 2 2m

kubectl get pods -l app=snowflake -n=development

NAME READY STATUS RESTARTS AGE

snowflake-3968820950-9dgr8 1/1 Running 0 2m
snowflake-3968820950-vgc4n 1/1 Running 0 2m

And this is great, developers are able to do what they want, and they do not have to worry
about affecting content in the production namespace.

Let's switch to the production namespace and show how resources in one namespace are
hidden from the other. The production namespace should be empty, and the following
commands should return nothing.

kubectl get deployment -n=production

kubectl get pods -n=production

Production likes to run cattle, so let's create some cattle pods.

kubectl create deployment cattle --image=registry.k8s.io/serve_hostname -n=production

kubectl scale deployment cattle --replicas=5 -n=production

kubectl get deployment -n=production

NAME READY UP-TO-DATE AVAILABLE AGE

cattle 5/5 5 5 10s

kubectl get pods -l app=cattle -n=production

NAME READY STATUS RESTARTS AGE

At this point, it should be clear that the resources users create in one namespace are hidden
from the other namespace.

As the policy support in Kubernetes evolves, we will extend this scenario to show how you can
provide different authorization rules for each namespace.

Understanding the motivation for using namespaces

A single cluster should be able to satisfy the needs of multiple users or groups of users
(henceforth in this document a user community).

Kubernetes namespaces help different projects, teams, or customers to share a Kubernetes

cluster.

It does this by providing the following:

1. A scope for names.

2. A mechanism to attach authorization and policy to a subsection of the cluster.

Use of multiple namespaces is optional.

Each user community wants to be able to work in isolation from other communities. Each user
community has its own:

1. resources (pods, services, replication controllers, etc.)

2. policies (who can or cannot perform actions in their community)
3. constraints (this community is allowed this much quota, etc.)

A cluster operator may create a Namespace for each unique user community.

The Namespace provides a unique scope for:

1. named resources (to avoid basic naming collisions)

2. delegated management authority to trusted users
3. ability to limit community resource consumption

Use cases include:

1. As a cluster operator, I want to support multiple user communities on a single cluster.

2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted
users in those communities.
3. As a cluster operator, I want to limit the amount of resources each community can
consume in order to limit the impact to other communities using the cluster.
4. As a cluster user, I want to interact with resources that are pertinent to my user
community in isolation of what other user communities are doing on the cluster.
Understanding namespaces and DNS
When you create a Service, it creates a corresponding DNS entry. This entry is of the form
<service-name>.<namespace-name>.svc.cluster.local, which means that if a container uses
<service-name> it will resolve to the service which is local to a namespace. This is useful for
using the same configuration across multiple namespaces such as Development, Staging and
Production. If you want to reach across namespaces, you need to use the fully qualified domain
name (FQDN).

What's next
• Learn more about setting the namespace preference.
• Learn more about setting the namespace for a request
• See namespaces design.

Upgrade A Cluster
This page provides an overview of the steps you should follow to upgrade a Kubernetes cluster.

The way that you upgrade a cluster depends on how you initially deployed it and on any
subsequent changes.

At a high level, the steps you perform are:

• Upgrade the control plane

• Upgrade the nodes in your cluster
• Upgrade clients such as kubectl
• Adjust manifests and other resources based on the API changes that accompany the new
Kubernetes version

Before you begin

You must have an existing cluster. This page is about upgrading from Kubernetes 1.27 to
Kubernetes 1.28. If your cluster is not currently running Kubernetes 1.27 then please check the
documentation for the version of Kubernetes that you plan to upgrade to.

Upgrade approaches
kubeadm

If your cluster was deployed using the kubeadm tool, refer to Upgrading kubeadm clusters for
detailed information on how to upgrade the cluster.

Once you have upgraded the cluster, remember to install the latest version of kubectl.

Manual deployments

Caution: These steps do not account for third-party extensions such as network and storage
plugins.
You should manually update the control plane following this sequence:

• etcd (all instances)

• kube-apiserver (all control plane hosts)
• kube-controller-manager
• kube-scheduler
• cloud controller manager, if you use one

At this point you should install the latest version of kubectl.

For each node in your cluster, drain that node and then either replace it with a new node that
uses the 1.28 kubelet, or upgrade the kubelet on that node and bring the node back into service.

Caution: Draining nodes before upgrading kubelet ensures that pods are re-admitted and
containers are re-created, which may be necessary to resolve some security issues or other
important bugs.

Other deployments

Refer to the documentation for your cluster deployment tool to learn the recommended set up
steps for maintenance.

Post-upgrade tasks
Switch your cluster's storage API version

The objects that are serialized into etcd for a cluster's internal representation of the Kubernetes
resources active in the cluster are written using a particular version of the API.

When the supported API changes, these objects may need to be rewritten in the newer API.
Failure to do this will eventually result in resources that are no longer decodable or usable by
the Kubernetes API server.

For each affected object, fetch it using the latest supported API and then write it back also using
the latest supported API.

Update manifests

Upgrading to a new Kubernetes version can provide new APIs.

You can use kubectl convert command to convert manifests between different API versions. For
example:

kubectl convert -f pod.yaml --output-version v1

The kubectl tool replaces the contents of pod.yaml with a manifest that sets kind to Pod
(unchanged), but with a revised apiVersion.

Device Plugins

If your cluster is running device plugins and the node needs to be upgraded to a Kubernetes
release with a newer device plugin API version, device plugins must be upgraded to support
both version before the node is upgraded in order to guarantee that device allocations continue
to complete successfully during the upgrade.

Refer to API compatibility and Kubelet Device Manager API Versions for more details.

Use Cascading Deletion in a Cluster

This page shows you how to specify the type of cascading deletion to use in your cluster during
garbage collection.

Before you begin

• Killercoda
• Play with Kubernetes

You also need to create a sample Deployment to experiment with the different types of
cascading deletion. You will need to recreate the Deployment for each type.

Check owner references on your pods

Check that the ownerReferences field is present on your pods:

kubectl get pods -l app=nginx --output=yaml

The output has an ownerReferences field similar to this:

apiVersion: v1
...
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: nginx-deployment-6b474476c4
uid: 4fdcd81c-bd5d-41f7-97af-3a3b759af9a7
...

Use foreground cascading deletion

By default, Kubernetes uses background cascading deletion to delete dependents of an object.
You can switch to foreground cascading deletion using either kubectl or the Kubernetes API,
depending on the Kubernetes version your cluster runs. To check the version, enter kubectl
version.
You can delete objects using foreground cascading deletion using kubectl or the Kubernetes
API.

Using kubectl

Run the following command:

kubectl delete deployment nginx-deployment --cascade=foreground

Using the Kubernetes API

1. Start a local proxy session:

kubectl proxy --port=8080

2. Use curl to trigger deletion:

curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-

deployment \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
-H "Content-Type: application/json"

The output contains a foregroundDeletion finalizer like this:

"kind": "Deployment",
"apiVersion": "apps/v1",
"metadata": {
"name": "nginx-deployment",
"namespace": "default",
"uid": "d1ce1b02-cae8-4288-8a53-30e84d8fa505",
"resourceVersion": "1363097",
"creationTimestamp": "2021-07-08T20:24:37Z",
"deletionTimestamp": "2021-07-08T20:27:39Z",
"finalizers": [
"foregroundDeletion"
]
...

Use background cascading deletion

1. Create a sample Deployment.
2. Use either kubectl or the Kubernetes API to delete the Deployment, depending on the
Kubernetes version your cluster runs. To check the version, enter kubectl version.

You can delete objects using background cascading deletion using kubectl or the Kubernetes
API.

Kubernetes uses background cascading deletion by default, and does so even if you run the
following commands without the --cascade flag or the propagationPolicy argument.

Using kubectl

Run the following command:

kubectl delete deployment nginx-deployment --cascade=background

Using the Kubernetes API

1. Start a local proxy session:

kubectl proxy --port=8080

2. Use curl to trigger deletion:

curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-

deployment \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Background"}' \
-H "Content-Type: application/json"

The output is similar to this:

"kind": "Status",
"apiVersion": "v1",
...
"status": "Success",
"details": {
"name": "nginx-deployment",
"group": "apps",
"kind": "deployments",
"uid": "cc9eefb9-2d49-4445-b1c1-d261c9396456"
}

Delete owner objects and orphan dependents

By default, when you tell Kubernetes to delete an object, the controller also deletes dependent
objects. You can make Kubernetes orphan these dependents using kubectl or the Kubernetes
API, depending on the Kubernetes version your cluster runs. To check the version, enter
kubectl version.

Using kubectl

Run the following command:

kubectl delete deployment nginx-deployment --cascade=orphan

Using the Kubernetes API

1. Start a local proxy session:

kubectl proxy --port=8080

2. Use curl to trigger deletion:

curl -X DELETE localhost:8080/apis/apps/v1/namespaces/default/deployments/nginx-

deployment \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
-H "Content-Type: application/json"
The output contains orphan in the finalizers field, similar to this:

"kind": "Deployment",
"apiVersion": "apps/v1",
"namespace": "default",
"uid": "6f577034-42a0-479d-be21-78018c466f1f",
"creationTimestamp": "2021-07-09T16:46:37Z",
"deletionTimestamp": "2021-07-09T16:47:08Z",
"deletionGracePeriodSeconds": 0,
"finalizers": [
"orphan"
],
...

You can check that the Pods managed by the Deployment are still running:

kubectl get pods -l app=nginx

What's next
• Learn about owners and dependents in Kubernetes.
• Learn about Kubernetes finalizers.
• Learn about garbage collection.

Using a KMS provider for data encryption

This page shows how to configure a Key Management Service (KMS) provider and plugin to
enable secret data encryption. In Kubernetes 1.28 there are two versions of KMS at-rest
encryption. You should use KMS v2 if feasible because KMS v1 is deprecated (since Kubernetes
v1.28). However, you should also read and observe the Caution notices in this page that
highlight specific cases when you must not use KMS v2. KMS v2 offers significantly better
performance characteristics than KMS v1.

Before you begin

• Killercoda
• Play with Kubernetes

The version of Kubernetes that you need depends on which KMS API version you have selected.
Kubernetes recommends using KMS v2.

• If you selected KMS API v2, you should use Kubernetes v1.28 (if you are running a
different version of Kubernetes that also supports the v2 KMS API, switch to the
documentation for that version of Kubernetes).
• If you selected KMS API v1 to support clusters prior to version v1.27 or if you have a
legacy KMS plugin that only supports KMS v1, any supported Kubernetes version will
work. This API is deprecated as of Kubernetes v1.28. Kubernetes does not recommend the
use of this API.

To check the version, enter kubectl version.

KMS v1

FEATURE STATE: Kubernetes v1.28 [deprecated]

• Kubernetes version 1.10.0 or later is required

• Your cluster must use etcd v3 or later

KMS v2

FEATURE STATE: Kubernetes v1.27 [beta]

• For version 1.25 and 1.26, enabling the feature via kube-apiserver feature gate is required.
Set --feature-gates=KMSv2=true to configure a KMS v2 provider. For environments where
all API servers are running version 1.28 or later, and you do not require the ability to
downgrade to Kubernetes v1.27, you can enable the KMSv2KDF feature gate (a beta
feature) for more robust data encryption key generation. The Kubernetes project
recommends enabling KMS v2 KDF if those preconditions are met.

• Your cluster must use etcd v3 or later

Caution:

The KMS v2 API and implementation changed in incompatible ways in-between the alpha
release in v1.25 and the beta release in v1.27. Attempting to upgrade from old versions with the
alpha feature enabled will result in data loss.

Running mixed API server versions with some servers at v1.27, and others at v1.28 with the
KMSv2KDF feature gate enabled is not supported - and is likely to result in data loss.

The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The
data is encrypted using a data encryption key (DEK). The DEKs are encrypted with a key
encryption key (KEK) that is stored and managed in a remote KMS.

With KMS v1, a new DEK is generated for each encryption.

With KMS v2, there are two ways for the API server to generate a DEK. Kubernetes defaults to
generating a new DEK at API server startup, which is then reused for resource encryption.
However, if you use KMS v2 and enable the KMSv2KDF feature gate, then Kubernetes instead
generates a new DEK per encryption: the API server uses a key derivation function to generate
single use data encryption keys from a secret seed combined with some random data.
Whichever approach you configure, the DEK or seed is also rotated whenever the KEK is
rotated (see Understanding key_id and Key Rotation section below for more details).

The KMS provider uses gRPC to communicate with a specific KMS plugin over a UNIX domain
socket. The KMS plugin, which is implemented as a gRPC server and deployed on the same
host(s) as the Kubernetes control plane, is responsible for all communication with the remote
KMS.
Caution:

If you are running virtual machine (VM) based nodes that leverage VM state store with this
feature, using KMS v2 is insecure and an information security risk unless you also explicitly
enable the KMSv2KDF feature gate.

With KMS v2, the API server uses AES-GCM with a 12 byte nonce (8 byte atomic counter and 4
bytes random data) for encryption. The following issues could occur if the VM is saved and
restored:

1. The counter value may be lost or corrupted if the VM is saved in an inconsistent state or
restored improperly. This can lead to a situation where the same counter value is used
twice, resulting in the same nonce being used for two different messages.
2. If the VM is restored to a previous state, the counter value may be set back to its previous
value, resulting in the same nonce being used again.

Although both of these cases are partially mitigated by the 4 byte random nonce, this can
compromise the security of the encryption.

If you have enabled the KMSv2KDF feature gate and are using KMS v2 (not KMS v1), the API
server generates single use data encryption keys from a secret seed. This eliminates the need for
a counter based nonce while avoiding nonce collision concerns. It also removes any specific
concerns with using KMS v2 and VM state store.

Configuring the KMS provider

To configure a KMS provider on the API server, include a provider of type kms in the providers
array in the encryption configuration file and set the following properties:

KMS v1

• apiVersion: API Version for KMS provider. Leave this value empty or set it to v1.
• name: Display name of the KMS plugin. Cannot be changed once set.
• endpoint: Listen address of the gRPC server (KMS plugin). The endpoint is a UNIX
domain socket.
• cachesize: Number of data encryption keys (DEKs) to be cached in the clear. When
cached, DEKs can be used without another call to the KMS; whereas DEKs that are not
cached require a call to the KMS to unwrap.
• timeout: How long should kube-apiserver wait for kms-plugin to respond before
returning an error (default is 3 seconds).

KMS v2

• apiVersion: API Version for KMS provider. Set this to v2.

• name: Display name of the KMS plugin. Cannot be changed once set.
• endpoint: Listen address of the gRPC server (KMS plugin). The endpoint is a UNIX
domain socket.
• timeout: How long should kube-apiserver wait for kms-plugin to respond before
returning an error (default is 3 seconds).
KMS v2 does not support the cachesize property. All data encryption keys (DEKs) will be
cached in the clear once the server has unwrapped them via a call to the KMS. Once cached,
DEKs can be used to perform decryption indefinitely without making a call to the KMS.

See Understanding the encryption at rest configuration.

Implementing a KMS plugin

To implement a KMS plugin, you can develop a new plugin gRPC server or enable a KMS
plugin already provided by your cloud provider. You then integrate the plugin with the remote
KMS and deploy it on the Kubernetes control plane.

Enabling the KMS supported by your cloud provider

Refer to your cloud provider for instructions on enabling the cloud provider-specific KMS
plugin.

Developing a KMS plugin gRPC server

You can develop a KMS plugin gRPC server using a stub file available for Go. For other
languages, you use a proto file to create a stub file that you can use to develop the gRPC server
code.

KMS v1

• Using Go: Use the functions and data structures in the stub file: api.pb.go to develop the
gRPC server code

• Using languages other than Go: Use the protoc compiler with the proto file: api.proto to
generate a stub file for the specific language

KMS v2

• Using Go: A high level library is provided to make the process easier. Low level
implementations can use the functions and data structures in the stub file: api.pb.go to
develop the gRPC server code

• Using languages other than Go: Use the protoc compiler with the proto file: api.proto to
generate a stub file for the specific language

Then use the functions and data structures in the stub file to develop the server code.

Notes

KMS v1

• kms plugin version: v1beta1

In response to procedure call Version, a compatible KMS plugin should return v1beta1 as
VersionResponse.version.
message version: v1beta1
•
All messages from KMS provider have the version field set to v1beta1.

• protocol: UNIX domain socket (unix)

The plugin is implemented as a gRPC server that listens at UNIX domain socket. The
plugin deployment should create a file on the file system to run the gRPC unix domain
socket connection. The API server (gRPC client) is configured with the KMS provider
(gRPC server) unix domain socket endpoint in order to communicate with it. An abstract
Linux socket may be used by starting the endpoint with /@, i.e. unix:///@foo. Care must
be taken when using this type of socket as they do not have concept of ACL (unlike
traditional file based sockets). However, they are subject to Linux networking namespace,
so will only be accessible to containers within the same pod unless host networking is
used.

KMS v2

• KMS plugin version: v2beta1

In response to procedure call Status, a compatible KMS plugin should return v2beta1 as
StatusResponse.version, "ok" as StatusResponse.healthz and a key_id (remote KMS KEK
ID) as StatusResponse.key_id.

The API server polls the Status procedure call approximately every minute when
everything is healthy, and every 10 seconds when the plugin is not healthy. Plugins must
take care to optimize this call as it will be under constant load.

• Encryption

The EncryptRequest procedure call provides the plaintext and a UID for logging purposes.
The response must include the ciphertext, the key_id for the KEK used, and, optionally,
any metadata that the KMS plugin needs to aid in future DecryptRequest calls (via the
annotations field). The plugin must guarantee that any distinct plaintext results in a
distinct response (ciphertext, key_id, annotations).

If the plugin returns a non-empty annotations map, all map keys must be fully qualified
domain names such as example.com. An example use case of annotation is
{"kms.example.io/remote-kms-auditid":"<audit ID used by the remote KMS>"}

The API server does not perform the EncryptRequest procedure call at a high rate. Plugin
implementations should still aim to keep each request's latency at under 100 milliseconds.

• Decryption

The DecryptRequest procedure call provides the (ciphertext, key_id, annotations) from
EncryptRequest and a UID for logging purposes. As expected, it is the inverse of the
EncryptRequest call. Plugins must verify that the key_id is one that they understand -
they must not attempt to decrypt data unless they are sure that it was encrypted by them
at an earlier time.

The API server may perform thousands of DecryptRequest procedure calls on startup to
fill its watch cache. Thus plugin implementations must perform these calls as quickly as
possible, and should aim to keep each request's latency at under 10 milliseconds.
Understanding key_id and Key Rotation
•
The key_id is the public, non-secret name of the remote KMS KEK that is currently in use.
It may be logged during regular operation of the API server, and thus must not contain
any private data. Plugin implementations are encouraged to use a hash to avoid leaking
any data. The KMS v2 metrics take care to hash this value before exposing it via the /
metrics endpoint.

The API server considers the key_id returned from the Status procedure call to be
authoritative. Thus, a change to this value signals to the API server that the remote KEK
has changed, and data encrypted with the old KEK should be marked stale when a no-op
write is performed (as described below). If an EncryptRequest procedure call returns a
key_id that is different from Status, the response is thrown away and the plugin is
considered unhealthy. Thus implementations must guarantee that the key_id returned
from Status will be the same as the one returned by EncryptRequest. Furthermore,
plugins must ensure that the key_id is stable and does not flip-flop between values (i.e.
during a remote KEK rotation).

Plugins must not re-use key_ids, even in situations where a previously used remote KEK
has been reinstated. For example, if a plugin was using key_id=A, switched to key_id=B,
and then went back to key_id=A - instead of reporting key_id=A the plugin should report
some derivative value such as key_id=A_001 or use a new value such as key_id=C.

Since the API server polls Status about every minute, key_id rotation is not immediate.
Furthermore, the API server will coast on the last valid state for about three minutes.
Thus if a user wants to take a passive approach to storage migration (i.e. by waiting), they
must schedule a migration to occur at 3 + N + M minutes after the remote KEK has been
rotated (N is how long it takes the plugin to observe the key_id change and M is the
desired buffer to allow config changes to be processed - a minimum M of five minutes is
recommend). Note that no API server restart is required to perform KEK rotation.

Caution: Because you don't control the number of writes performed with the DEK, the
Kubernetes project recommends rotating the KEK at least every 90 days.

• protocol: UNIX domain socket (unix)

Integrating a KMS plugin with the remote KMS

The KMS plugin can communicate with the remote KMS using any protocol supported by the
KMS. All configuration data, including authentication credentials the KMS plugin uses to
communicate with the remote KMS, are stored and managed by the KMS plugin independently.
The KMS plugin can encode the ciphertext with additional metadata that may be required
before sending it to the KMS for decryption (KMS v2 makes this process easier by providing a
dedicated annotations field).

Deploying the KMS plugin

Ensure that the KMS plugin runs on the same host(s) as the Kubernetes API server(s).

Encrypting your data with the KMS provider

To encrypt the data:

1. Create a new EncryptionConfiguration file using the appropriate properties for the kms
provider to encrypt resources like Secrets and ConfigMaps. If you want to encrypt an
extension API that is defined in a CustomResourceDefinition, your cluster must be
running Kubernetes v1.26 or newer.

2. Set the --encryption-provider-config flag on the kube-apiserver to point to the location of

the configuration file.

3. --encryption-provider-config-automatic-reload boolean argument determines if the file

set by --encryption-provider-config should be automatically reloaded if the disk contents
change. This enables key rotation without API server restarts.

4. Restart your API server.

KMS v1

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- kms:
name: myKmsPluginFoo
endpoint: unix:///tmp/socketfile.sock
cachesize: 100
timeout: 3s
- kms:
name: myKmsPluginBar
endpoint: unix:///tmp/socketfile.sock
cachesize: 100
timeout: 3s

KMS v2

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
- configmaps
- pandas.awesome.bears.example
providers:
- kms:
apiVersion: v2
name: myKmsPluginFoo
endpoint: unix:///tmp/socketfile.sock
timeout: 3s
- kms:
apiVersion: v2
name: myKmsPluginBar
endpoint: unix:///tmp/socketfile.sock
timeout: 3s

Setting --encryption-provider-config-automatic-reload to true collapses all health checks to a

single health check endpoint. Individual health checks are only available when KMS v1
providers are in use and the encryption config is not auto-reloaded.

The following table summarizes the health check endpoints for each KMS version:

KMS configurations Without Automatic Reload With Automatic Reload

KMS v1 only Individual Healthchecks Single Healthcheck
KMS v2 only Single Healthcheck Single Healthcheck
Both KMS v1 and v2 Individual Healthchecks Single Healthcheck
No KMS None Single Healthcheck

Single Healthcheck means that the only health check endpoint is /healthz/kms-providers.

Individual Healthchecks means that each KMS plugin has an associated health check endpoint
based on its location in the encryption config: /healthz/kms-provider-0, /healthz/kms-
provider-1 etc.

These healthcheck endpoint paths are hard coded and generated/controlled by the server. The
indices for individual healthchecks corresponds to the order in which the KMS encryption
config is processed.

At a high level, restarting an API server when a KMS plugin is unhealthy is unlikely to make
the situation better. It can make the situation significantly worse by throwing away the API
server's DEK cache. Thus the general recommendation is to ignore the API server KMS healthz
checks for liveness purposes, i.e. /livez?exclude=kms-providers.

Until the steps defined in Ensuring all secrets are encrypted are performed, the providers list
should end with the identity: {} provider to allow unencrypted data to be read. Once all
resources are encrypted, the identity provider should be removed to prevent the API server
from honoring unencrypted data.

For details about the EncryptionConfiguration format, please check the API server encryption
API reference.
Verifying that the data is encrypted
When encryption at rest is correctly configured, resources are encrypted on write. After
restarting your kube-apiserver, any newly created or updated Secret or other resource types
configured in EncryptionConfiguration should be encrypted when stored. To verify, you can
use the etcdctl command line program to retrieve the contents of your secret data.

1. Create a new secret called secret1 in the default namespace:

kubectl create secret generic secret1 -n default --from-literal=mykey=mydata

2. Using the etcdctl command line, read that secret out of etcd:

ETCDCTL_API=3 etcdctl get /kubernetes.io/secrets/default/secret1 [...] | hexdump -C

where [...] contains the additional arguments for connecting to the etcd server.

3. Verify the stored secret is prefixed with k8s:enc:kms:v1: for KMS v1 or prefixed with
k8s:enc:kms:v2: for KMS v2, which indicates that the kms provider has encrypted the
resulting data.

4. Verify that the secret is correctly decrypted when retrieved via the API:

kubectl describe secret secret1 -n default

The Secret should contain mykey: mydata

Ensuring all secrets are encrypted

When encryption at rest is correctly configured, resources are encrypted on write. Thus we can
perform an in-place no-op update to ensure that data is encrypted.

The following command reads all secrets and then updates them to apply server side
encryption. If an error occurs due to a conflicting write, retry the command. For larger clusters,
you may wish to subdivide the secrets by namespace or script an update.

kubectl get secrets --all-namespaces -o json | kubectl replace -f -

Switching from a local encryption provider to the KMS

provider
To switch from a local encryption provider to the kms provider and re-encrypt all of the secrets:

1. Add the kms provider as the first entry in the configuration file as shown in the following
example.

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- kms:
apiVersion: v2
name : myKmsPlugin
endpoint: unix:///tmp/socketfile.sock
- aescbc:
keys:
- name: key1
secret: <BASE 64 ENCODED SECRET>

2. Restart all kube-apiserver processes.

3. Run the following command to force all secrets to be re-encrypted using the kms
provider.

kubectl get secrets --all-namespaces -o json | kubectl replace -f -

Disabling encryption at rest

To disable encryption at rest:

1. Place the identity provider as the first entry in the configuration file:

apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- identity: {}
- kms:
apiVersion: v2
name : myKmsPlugin
endpoint: unix:///tmp/socketfile.sock

2. Restart all kube-apiserver processes.

3. Run the following command to force all secrets to be decrypted.

kubectl get secrets --all-namespaces -o json | kubectl replace -f -

Using CoreDNS for Service Discovery

This page describes the CoreDNS upgrade process and how to install CoreDNS instead of kube-
dns.

Before you begin

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version v1.9. To check the version, enter
kubectl version.

About CoreDNS
CoreDNS is a flexible, extensible DNS server that can serve as the Kubernetes cluster DNS. Like
Kubernetes, the CoreDNS project is hosted by the CNCF.

You can use CoreDNS instead of kube-dns in your cluster by replacing kube-dns in an existing
deployment, or by using tools like kubeadm that will deploy and upgrade the cluster for you.

Installing CoreDNS
For manual deployment or replacement of kube-dns, see the documentation at the CoreDNS
GitHub project.

Migrating to CoreDNS
Upgrading an existing cluster with kubeadm

In Kubernetes version 1.21, kubeadm removed its support for kube-dns as a DNS application.
For kubeadm v1.28, the only supported cluster DNS application is CoreDNS.

You can move to CoreDNS when you use kubeadm to upgrade a cluster that is using kube-dns.
In this case, kubeadm generates the CoreDNS configuration ("Corefile") based upon the kube-
dns ConfigMap, preserving configurations for stub domains, and upstream name server.

Upgrading CoreDNS
You can check the version of CoreDNS that kubeadm installs for each version of Kubernetes in
the page CoreDNS version in Kubernetes.

CoreDNS can be upgraded manually in case you want to only upgrade CoreDNS or use your
own custom image. There is a helpful guideline and walkthrough available to ensure a smooth
upgrade. Make sure the existing CoreDNS configuration ("Corefile") is retained when upgrading
your cluster.

If you are upgrading your cluster using the kubeadm tool, kubeadm can take care of retaining
the existing CoreDNS configuration automatically.

Tuning CoreDNS
When resource utilisation is a concern, it may be useful to tune the configuration of CoreDNS.
For more details, check out the documentation on scaling CoreDNS.
What's next
You can configure CoreDNS to support many more use cases than kube-dns does by modifying
the CoreDNS configuration ("Corefile"). For more information, see the documentation for the
kubernetes CoreDNS plugin, or read the Custom DNS Entries for Kubernetes. in the CoreDNS
blog.

Using NodeLocal DNSCache in Kubernetes

Clusters
FEATURE STATE: Kubernetes v1.18 [stable]

This page provides an overview of NodeLocal DNSCache feature in Kubernetes.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Introduction
NodeLocal DNSCache improves Cluster DNS performance by running a DNS caching agent on
cluster nodes as a DaemonSet. In today's architecture, Pods in 'ClusterFirst' DNS mode reach
out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint
via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the
DNS caching agent running on the same node, thereby avoiding iptables DNAT rules and
connection tracking. The local caching agent will query kube-dns service for cache misses of
cluster hostnames ("cluster.local" suffix by default).

Motivation
• With the current DNS architecture, it is possible that Pods with the highest DNS QPS
have to reach out to a different node, if there is no local kube-dns/CoreDNS instance.
Having a local cache will help improve the latency in such scenarios.

• Skipping iptables DNAT and connection tracking will help reduce conntrack races and
avoid UDP DNS entries filling up conntrack table.

• Connections from the local caching agent to kube-dns service can be upgraded to TCP.
TCP conntrack entries will be removed on connection close in contrast with UDP entries
that have to timeout (default nf_conntrack_udp_timeout is 30 seconds)
Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped
• UDP packets and DNS timeouts usually up to 30s (3 retries + 10s timeout). Since the
nodelocal cache listens for UDP DNS queries, applications don't need to be changed.

• Metrics & visibility into DNS requests at a node level.

• Negative caching can be re-enabled, thereby reducing the number of queries for the kube-
dns service.

Architecture Diagram
This is the path followed by DNS Queries after NodeLocal DNSCache is enabled:

NodeLocal DNSCache flow

Nodelocal DNSCache flow

This image shows how NodeLocal DNSCache handles DNS queries.

Configuration
Note: The local listen IP address for NodeLocal DNSCache can be any address that can be
guaranteed to not collide with any existing IP in your cluster. It's recommended to use an
address with a local scope, for example, from the 'link-local' range '169.254.0.0/16' for IPv4 or
from the 'Unique Local Address' range in IPv6 'fd00::/8'.

This feature can be enabled using the following steps:

• Prepare a manifest similar to the sample nodelocaldns.yaml and save it as

nodelocaldns.yaml.

• If using IPv6, the CoreDNS configuration file needs to enclose all the IPv6 addresses into
square brackets if used in 'IP:Port' format. If you are using the sample manifest from the
previous point, this will require you to modify the configuration line L70 like this:
"health [__PILLAR__LOCAL__DNS__]:8080"

• Substitute the variables in the manifest with the right values:

kubedns=`kubectl get svc kube-dns -n kube-system -o jsonpath={.spec.clusterIP}`

domain=<cluster-domain>
localdns=<node-local-address>

<cluster-domain> is "cluster.local" by default. <node-local-address> is the local listen IP

address chosen for NodeLocal DNSCache.

◦ If kube-proxy is running in IPTABLES mode:

sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/
__PILLAR__DNS__DOMAIN__/$domain/g; s/__PILLAR__DNS__SERVER__/$kube
dns/g" nodelocaldns.yaml

PILLARCLUSTERDNS and PILLARUPSTREAMSERVERS will be

populated by the node-local-dns pods. In this mode, the node-local-dns pods listen
on both the kube-dns service IP as well as <node-local-address>, so pods can look
up DNS records using either IP address.

◦ If kube-proxy is running in IPVS mode:

sed -i "s/__PILLAR__LOCAL__DNS__/$localdns/g; s/
__PILLAR__DNS__DOMAIN__/$domain/g; s/,__PILLAR__DNS__SERVER__//g; s/
__PILLAR__CLUSTER__DNS__/$kubedns/g" nodelocaldns.yaml

In this mode, the node-local-dns pods listen only on <node-local-address>. The

node-local-dns interface cannot bind the kube-dns cluster IP since the interface
used for IPVS loadbalancing already uses this address.
__PILLAR__UPSTREAM__SERVERS__ will be populated by the node-local-dns
pods.

• Run kubectl create -f nodelocaldns.yaml

• If using kube-proxy in IPVS mode, --cluster-dns flag to kubelet needs to be modified to

use <node-local-address> that NodeLocal DNSCache is listening on. Otherwise, there is
no need to modify the value of the --cluster-dns flag, since NodeLocal DNSCache listens
on both the kube-dns service IP as well as <node-local-address>.

Once enabled, the node-local-dns Pods will run in the kube-system namespace on each of the
cluster nodes. This Pod runs CoreDNS in cache mode, so all CoreDNS metrics exposed by the
different plugins will be available on a per-node basis.

You can disable this feature by removing the DaemonSet, using kubectl delete -f <manifest>.
You should also revert any changes you made to the kubelet configuration.

StubDomains and Upstream server Configuration

StubDomains and upstream servers specified in the kube-dns ConfigMap in the kube-system
namespace are automatically picked up by node-local-dns pods. The ConfigMap contents need
to follow the format shown in the example. The node-local-dns ConfigMap can also be modified
directly with the stubDomain configuration in the Corefile format. Some cloud providers might
not allow modifying node-local-dns ConfigMap directly. In those cases, the kube-dns
ConfigMap can be updated.

Setting memory limits

The node-local-dns Pods use memory for storing cache entries and processing queries. Since
they do not watch Kubernetes objects, the cluster size or the number of Services /
EndpointSlices do not directly affect memory usage. Memory usage is influenced by the DNS
query pattern. From CoreDNS docs,

The default cache size is 10000 entries, which uses about 30 MB when completely
filled.

This would be the memory usage for each server block (if the cache gets completely filled).
Memory usage can be reduced by specifying smaller cache sizes.
The number of concurrent queries is linked to the memory demand, because each extra
goroutine used for handling a query requires an amount of memory. You can set an upper limit
using the max_concurrent option in the forward plugin.

If a node-local-dns Pod attempts to use more memory than is available (because of total system
resources, or because of a configured resource limit), the operating system may shut down that
pod's container. If this happens, the container that is terminated (“OOMKilled”) does not clean
up the custom packet filtering rules that it previously added during startup. The node-local-dns
container should get restarted (since managed as part of a DaemonSet), but this will lead to a
brief DNS downtime each time that the container fails: the packet filtering rules direct DNS
queries to a local Pod that is unhealthy.

You can determine a suitable memory limit by running node-local-dns pods without a limit and
measuring the peak usage. You can also set up and use a VerticalPodAutoscaler in recommender
mode, and then check its recommendations.

Using sysctls in a Kubernetes Cluster

FEATURE STATE: Kubernetes v1.21 [stable]

This document describes how to configure and use kernel parameters within a Kubernetes
cluster using the sysctl interface.

Note: Starting from Kubernetes version 1.23, the kubelet supports the use of either / or . as
separators for sysctl names. Starting from Kubernetes version 1.25, setting Sysctls for a Pod
supports setting sysctls with slashes. For example, you can represent the same sysctl name as
kernel.shm_rmid_forced using a period as the separator, or as kernel/shm_rmid_forced using a
slash as a separator. For more sysctl parameter conversion method details, please refer to the
page sysctl.d(5) from the Linux man-pages project.

Before you begin

Note: sysctl is a Linux-specific command-line tool used to configure various kernel parameters
and it is not available on non-Linux operating systems.

• Killercoda
• Play with Kubernetes

For some steps, you also need to be able to reconfigure the command line options for the
kubelets running on your cluster.
Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime.
Parameters are available via the /proc/sys/ virtual process file system. The parameters cover
various subsystems such as:

• kernel (common prefix: kernel.)

• networking (common prefix: net.)
• virtual memory (common prefix: vm.)
• MDADM (common prefix: dev.)
• More subsystems are described in Kernel docs.

To get a list of all parameters, you can run

sudo sysctl -a

Safe and Unsafe Sysctls

Kubernetes classes sysctls as either safe or unsafe. In addition to proper namespacing, a safe
sysctl must be properly isolated between pods on the same node. This means that setting a safe
sysctl for one pod

• must not have any influence on any other pod on the node
• must not allow to harm the node's health
• must not allow to gain CPU or memory resources outside of the resource limits of a pod.

By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls
are supported in the safe set:

• kernel.shm_rmid_forced,
• net.ipv4.ip_local_port_range,
• net.ipv4.tcp_syncookies,
• net.ipv4.ping_group_range (since Kubernetes 1.18),
• net.ipv4.ip_unprivileged_port_start (since Kubernetes 1.22).

Note:

There are some exceptions to the set of safe sysctls:

• The net.* sysctls are not allowed with host networking enabled.
• The net.ipv4.tcp_syncookies sysctl is not namespaced on Linux kernel version 4.4 or
lower.

This list will be extended in future Kubernetes versions when the kubelet supports better
isolation mechanisms.

Enabling Unsafe Sysctls

All safe sysctls are enabled by default.

All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on
a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very
special situations such as high-performance or real-time application tuning. Unsafe sysctls are
enabled on a node-by-node basis with a flag of the kubelet; for example:

kubelet --allowed-unsafe-sysctls \
'kernel.msg*,net.core.somaxconn' ...

For Minikube, this can be done via the extra-config flag:

minikube start --extra-config="kubelet.allowed-unsafe-

sysctls=kernel.msg*,net.core.somaxconn"...

Only namespaced sysctls can be enabled this way.

Setting Sysctls for a Pod

A number of sysctls are namespaced in today's Linux kernels. This means that they can be set
independently for each pod on a node. Only namespaced sysctls are configurable via the pod
securityContext within Kubernetes.

The following sysctls are known to be namespaced. This list could change in future versions of
the Linux kernel.

• kernel.shm*,
• kernel.msg*,
• kernel.sem,
• fs.mqueue.*,
• Those net.* that can be set in container networking namespace. However, there are
exceptions (e.g., net.netfilter.nf_conntrack_max and
net.netfilter.nf_conntrack_expect_max can be set in container networking namespace but
are unnamespaced before Linux 5.12.2).

Sysctls with no namespace are called node-level sysctls. If you need to set them, you must
manually configure them on each node's operating system, or by using a DaemonSet with
privileged containers.

Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all
containers in the same pod.

This example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two
unsafe sysctls net.core.somaxconn and kernel.msgmax. There is no distinction between safe and
unsafe sysctls in the specification.

Warning: Only modify sysctl parameters after you understand their effects, to avoid
destabilizing your operating system.

apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
...

Warning: Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and
can lead to severe problems like wrong behavior of containers, resource shortage or complete
breakage of a node.

It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and
only schedule pods onto them which need those sysctl settings. It is suggested to use the
Kubernetes taints and toleration feature to implement this.

A pod with the unsafe sysctls will fail to launch on any node which has not enabled those two
unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration
feature or taints on nodes to schedule those pods onto the right nodes.

Utilizing the NUMA-aware Memory

Manager
FEATURE STATE: Kubernetes v1.22 [beta]

The Kubernetes Memory Manager enables the feature of guaranteed memory (and hugepages)
allocation for pods in the Guaranteed QoS class.

The Memory Manager employs hint generation protocol to yield the most suitable NUMA
affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with
these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected
or admitted to the node.

Moreover, the Memory Manager ensures that the memory which a pod requests is allocated
from a minimum number of NUMA nodes.

The Memory Manager is only pertinent to Linux based hosts.

Before you begin

• Killercoda
• Play with Kubernetes

Your Kubernetes server must be at or later than version v1.21. To check the version, enter
kubectl version.
To align memory resources with other requested resources in a Pod spec:

• the CPU Manager should be enabled and proper CPU Manager policy should be
configured on a Node. See control CPU Management Policies;
• the Topology Manager should be enabled and proper Topology Manager policy should be
configured on a Node. See control Topology Management Policies.

Starting from v1.22, the Memory Manager is enabled by default through MemoryManager
feature gate.

Preceding v1.22, the kubelet must be started with the following flag:

--feature-gates=MemoryManager=true

in order to enable the Memory Manager feature.

How Memory Manager Operates?

The Memory Manager currently offers the guaranteed memory (and hugepages) allocation for
Pods in Guaranteed QoS class. To immediately put the Memory Manager into operation follow
the guidelines in the section Memory Manager configuration, and subsequently, prepare and
deploy a Guaranteed pod as illustrated in the section Placing a Pod in the Guaranteed QoS
class.

The Memory Manager is a Hint Provider, and it provides topology hints for the Topology
Manager which then aligns the requested resources according to these topology hints. It also
enforces cgroups (i.e. cpuset.mems) for pods. The complete flow diagram concerning pod
admission and deployment process is illustrated in Memory Manager KEP: Design Overview
and below:

Memory Manager in the pod admission and deployment process

During this process, the Memory Manager updates its internal counters stored in Node Map
and Memory Maps to manage guaranteed memory allocation.

The Memory Manager updates the Node Map during the startup and runtime as follows.

Startup

This occurs once a node administrator employs --reserved-memory (section Reserved memory
flag). In this case, the Node Map becomes updated to reflect this reservation as illustrated in
Memory Manager KEP: Memory Maps at start-up (with examples).

The administrator must provide --reserved-memory flag when Static policy is configured.

Runtime

Reference Memory Manager KEP: Memory Maps at runtime (with examples) illustrates how a
successful pod deployment affects the Node Map, and it also relates to how potential Out-of-
Memory (OOM) situations are handled further by Kubernetes or operating system.

Important topic in the context of Memory Manager operation is the management of NUMA
groups. Each time pod's memory request is in excess of single NUMA node capacity, the
Memory Manager attempts to create a group that comprises several NUMA nodes and features
extend memory capacity. The problem has been solved as elaborated in Memory Manager KEP:
How to enable the guaranteed memory allocation over many NUMA nodes?. Also, reference
Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)
illustrates how the management of groups occurs.

Memory Manager configuration

Other Managers should be first pre-configured. Next, the Memory Manager feature should be
enabled and be run with Static policy (section Static policy). Optionally, some amount of
memory can be reserved for system or kubelet processes to increase node stability (section
Reserved memory flag).

Policies

Memory Manager supports two policies. You can select a policy via a kubelet flag --memory-
manager-policy:

• None (default)
• Static

None policy

This is the default policy and does not affect the memory allocation in any way. It acts the same
as if the Memory Manager is not present at all.

The None policy returns default topology hint. This special hint denotes that Hint Provider
(Memory Manager in this case) has no preference for NUMA affinity with any resource.

Static policy

In the case of the Guaranteed pod, the Static Memory Manager policy returns topology hints
relating to the set of NUMA nodes where the memory can be guaranteed, and reserves the
memory through updating the internal NodeMap object.

In the case of the BestEffort or Burstable pod, the Static Memory Manager policy sends back the
default topology hint as there is no request for the guaranteed memory, and does not reserve
the memory in the internal NodeMap object.

Reserved memory flag

The Node Allocatable mechanism is commonly used by node administrators to reserve K8S
node system resources for the kubelet or operating system processes in order to enhance the
node stability. A dedicated set of flags can be used for this purpose to set the total amount of
reserved memory for a node. This pre-configured value is subsequently utilized to calculate the
real amount of node's "allocatable" memory available to pods.

The Kubernetes scheduler incorporates "allocatable" to optimise pod scheduling process. The
foregoing flags include --kube-reserved, --system-reserved and --eviction-threshold. The sum of
their values will account for the total amount of reserved memory.
A new --reserved-memory flag was added to Memory Manager to allow for this total reserved
memory to be split (by a node administrator) and accordingly reserved across many NUMA
nodes.

The flag specifies a comma-separated list of memory reservations of different memory types per
NUMA node. Memory reservations across multiple NUMA nodes can be specified using
semicolon as separator. This parameter is only useful in the context of the Memory Manager
feature. The Memory Manager will not use this reserved memory for the allocation of container
workloads.

For example, if you have a NUMA node "NUMA0" with 10Gi of memory available, and the --
reserved-memory was specified to reserve 1Gi of memory at "NUMA0", the Memory Manager
assumes that only 9Gi is available for containers.

You can omit this parameter, however, you should be aware that the quantity of reserved
memory from all NUMA nodes should be equal to the quantity of memory specified by the
Node Allocatable feature. If at least one node allocatable parameter is non-zero, you will need
to specify --reserved-memory for at least one NUMA node. In fact, eviction-hard threshold
value is equal to 100Mi by default, so if Static policy is used, --reserved-memory is obligatory.

Also, avoid the following configurations:

1. duplicates, i.e. the same NUMA node or memory type, but with a different value;
2. setting zero limit for any of memory types;
3. NUMA node IDs that do not exist in the machine hardware;
4. memory type names different than memory or hugepages-<size> (hugepages of particular
<size> should also exist).

Syntax:

--reserved-memory N:memory-type1=value1,memory-type2=value2,...

• N (integer) - NUMA node index, e.g. 0

• memory-type (string) - represents memory type:
◦ memory - conventional memory
◦ hugepages-2Mi or hugepages-1Gi - hugepages
• value (string) - the quantity of reserved memory, e.g. 1Gi

Example usage:

--reserved-memory 0:memory=1Gi,hugepages-1Gi=2Gi

--reserved-memory 0:memory=1Gi --reserved-memory 1:memory=2Gi

--reserved-memory '0:memory=1Gi;1:memory=2Gi'

When you specify values for --reserved-memory flag, you must comply with the setting that
you prior provided via Node Allocatable Feature flags. That is, the following rule must be
obeyed for each memory type:

sum(reserved-memory(i)) = kube-reserved + system-reserved + eviction-threshold,

where i is an index of a NUMA node.

If you do not follow the formula above, the Memory Manager will show an error on startup.

In other words, the example above illustrates that for the conventional memory
(type=memory), we reserve 3Gi in total, i.e.:

sum(reserved-memory(i)) = reserved-memory(0) + reserved-memory(1) = 1Gi + 2Gi = 3Gi

An example of kubelet command-line arguments relevant to the node Allocatable

configuration:

• --kube-reserved=cpu=500m,memory=50Mi
• --system-reserved=cpu=123m,memory=333Mi
• --eviction-hard=memory.available<500Mi

Note: The default hard eviction threshold is 100MiB, and not zero. Remember to increase the
quantity of memory that you reserve by setting --reserved-memory by that hard eviction
threshold. Otherwise, the kubelet will not start Memory Manager and display an error.

Here is an example of a correct configuration:

--feature-gates=MemoryManager=true
--kube-reserved=cpu=4,memory=4Gi
--system-reserved=cpu=1,memory=1Gi
--memory-manager-policy=Static
--reserved-memory '0:memory=3Gi;1:memory=2148Mi'

Let us validate the configuration above:

1. kube-reserved + system-reserved + eviction-hard(default) = reserved-memory(0) +

reserved-memory(1)
2. 4GiB + 1GiB + 100MiB = 3GiB + 2148MiB
3. 5120MiB + 100MiB = 3072MiB + 2148MiB
4. 5220MiB = 5220MiB (which is correct)

Placing a Pod in the Guaranteed QoS class

If the selected policy is anything other than None, the Memory Manager identifies pods that are
in the Guaranteed QoS class. The Memory Manager provides specific topology hints to the
Topology Manager for each Guaranteed pod. For pods in a QoS class other than Guaranteed,
the Memory Manager provides default topology hints to the Topology Manager.

The following excerpts from pod manifests assign a pod to the Guaranteed QoS class.

Pod with integer CPU(s) runs in the Guaranteed QoS class, when requests are equal to limits:

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "2"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "2"
example.com/device: "1"

Also, a pod sharing CPU(s) runs in the Guaranteed QoS class, when requests are equal to limits.

spec:
containers:
- name: nginx
image: nginx
resources:
limits:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"
requests:
memory: "200Mi"
cpu: "300m"
example.com/device: "1"

Notice that both CPU and memory requests must be specified for a Pod to lend it to Guaranteed
QoS class.

Troubleshooting
The following means can be used to troubleshoot the reason why a pod could not be deployed
or became rejected at a node:

• pod status - indicates topology affinity errors

• system logs - include valuable information for debugging, e.g., about generated hints
• state file - the dump of internal state of the Memory Manager (includes Node Map and
Memory Maps)
• starting from v1.22, the device plugin resource API can be used to retrieve information
about the memory reserved for containers

Pod status (TopologyAffinityError)

This error typically occurs in the following situations:

• a node has not enough resources available to satisfy the pod's request
• the pod's request is rejected due to particular Topology Manager policy constraints

The error appears in the status of a pod:

kubectl get pods

NAME READY STATUS RESTARTS AGE

guaranteed 0/1 TopologyAffinityError 0 113s

Use kubectl describe pod <id> or kubectl get events to obtain detailed error message:
Warning TopologyAffinityError 10m kubelet, dell8 Resources cannot be allocated with
Topology locality

System logs

Search system logs with respect to a particular pod.

The set of hints that Memory Manager generated for the pod can be found in the logs. Also, the
set of hints generated by CPU Manager should be present in the logs.

Topology Manager merges these hints to calculate a single best hint. The best hint should be
also present in the logs.

The best hint indicates where to allocate all the resources. Topology Manager tests this hint
against its current policy, and based on the verdict, it either admits the pod to the node or
rejects it.

Also, search the logs for occurrences associated with the Memory Manager, e.g. to find out
information about cgroups and cpuset.mems updates.

Examine the memory manager state on a node

Let us first deploy a sample Guaranteed pod whose specification is as follows:

apiVersion: v1
kind: Pod
metadata:
name: guaranteed
spec:
containers:
- name: guaranteed
image: consumer
imagePullPolicy: Never
resources:
limits:
cpu: "2"
memory: 150Gi
requests:
cpu: "2"
memory: 150Gi
command: ["sleep","infinity"]

Next, let us log into the node where it was deployed and examine the state file in /var/lib/
kubelet/memory_manager_state:

{
"policyName":"Static",
"machineState":{
"0":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":134987354112,
"systemReserved":3221225472,
"allocatable":131766128640,
"reserved":131766128640,
"free":0
}
},
"nodes":[
0,
1
]
},
"1":{
"numberOfAssignments":1,
"memoryMap":{
"hugepages-1Gi":{
"total":0,
"systemReserved":0,
"allocatable":0,
"reserved":0,
"free":0
},
"memory":{
"total":135286722560,
"systemReserved":2252341248,
"allocatable":133034381312,
"reserved":29295144960,
"free":103739236352
}
},
"nodes":[
0,
1
]
}
},
"entries":{
"fa9bdd38-6df9-4cf9-aa67-8c4814da37a8":{
"guaranteed":[
{
"numaAffinity":[
0,
1
],
"type":"memory",
"size":161061273600
}
]
}
},
"checksum":4142013182
}

It can be deduced from the state file that the pod was pinned to both NUMA nodes, i.e.:

"numaAffinity":[
0,
1
],

Pinned term means that pod's memory consumption is constrained (through cgroups
configuration) to these NUMA nodes.

This automatically implies that Memory Manager instantiated a new group that comprises
these two NUMA nodes, i.e. 0 and 1 indexed NUMA nodes.

Notice that the management of groups is handled in a relatively complex manner, and further
elaboration is provided in Memory Manager KEP in this and this sections.

In order to analyse memory resources available in a group,the corresponding entries from

NUMA nodes belonging to the group must be added up.

For example, the total amount of free "conventional" memory in the group can be computed by
adding up the free memory available at every NUMA node in the group, i.e., in the "memory"
section of NUMA node 0 ("free":0) and NUMA node 1 ("free":103739236352). So, the total
amount of free "conventional" memory in this group is equal to 0 + 103739236352 bytes.

The line "systemReserved":3221225472 indicates that the administrator of this node reserved
3221225472 bytes (i.e. 3Gi) to serve kubelet and system processes at NUMA node 0, by using --
reserved-memory flag.

Device plugin resource API

The kubelet provides a PodResourceLister gRPC service to enable discovery of resources and
associated metadata. By using its List gRPC endpoint, information about reserved memory for
each container can be retrieved, which is contained in protobuf ContainerMemory message.
This information can be retrieved solely for pods in Guaranteed QoS class.

What's next
• Memory Manager KEP: Design Overview
• Memory Manager KEP: Memory Maps at start-up (with examples)
• Memory Manager KEP: Memory Maps at runtime (with examples)
• Memory Manager KEP: Simulation - how the Memory Manager works? (by examples)
• Memory Manager KEP: The Concept of Node Map and Memory Maps
• Memory Manager KEP: How to enable the guaranteed memory allocation over many
NUMA nodes?
Verify Signed Kubernetes Artifacts
FEATURE STATE: Kubernetes v1.26 [beta]

Before you begin

You will need to have the following tools installed:

• cosign (install guide)

• curl (often provided by your operating system)
• jq (download jq)

Verifying binary signatures

The Kubernetes release process signs all binary artifacts (tarballs, SPDX files, standalone
binaries) by using cosign's keyless signing. To verify a particular binary, retrieve it together
with its signature and certificate:

URL=https://dl.k8s.io/release/v1.28.4/bin/linux/amd64
BINARY=kubectl

FILES=(
"$BINARY"
"$BINARY.sig"
"$BINARY.cert"
)

for FILE in "${FILES[@]}"; do

curl -sSfL --retry 3 --retry-delay 3 "$URL/$FILE" -o "$FILE"
done

Then verify the blob by using cosign verify-blob:

cosign verify-blob "$BINARY" \

--signature "$BINARY".sig \
--certificate "$BINARY".cert \
--certificate-identity [email protected] \
--certificate-oidc-issuer https://accounts.google.com

Note:

Cosign 2.0 requires the --certificate-identity and --certificate-oidc-issuer options.

To learn more about keyless signing, please refer to Keyless Signatures.

Previous versions of Cosign required that you set COSIGN_EXPERIMENTAL=1.

For additional information, plase refer to the sigstore Blog

Verifying image signatures
For a complete list of images that are signed please refer to Releases.

Pick one image from this list and verify its signature using the cosign verify command:

cosign verify registry.k8s.io/kube-apiserver-amd64:v1.28.4 \

--certificate-identity [email protected] \
--certificate-oidc-issuer https://accounts.google.com \
| jq .

Verifying images for all control plane components

To verify all signed control plane images for the latest stable version (v1.28.4), please run the
following commands:

curl -Ls "https://sbom.k8s.io/$(curl -Ls https://dl.k8s.io/release/stable.txt)/release" \

| grep "SPDXID: SPDXRef-Package-registry.k8s.io" \
| grep -v sha256 | cut -d- -f3- | sed 's/-/\//' | sed 's/-v1/:v1/' \
| sort > images.txt
input=images.txt
while IFS= read -r image
do
cosign verify "$image" \
--certificate-identity [email protected] \
--certificate-oidc-issuer https://accounts.google.com \
| jq .
done < "$input"

Once you have verified an image, you can specify the image by its digest in your Pod manifests
as per this example:

registry-url/image-
name@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2

For more information, please refer to the Image Pull Policy section.

Verifying Image Signatures with Admission Controller

For non-control plane images (for example conformance image), signatures can also be verified
at deploy time using sigstore policy-controller admission controller.

Here are some helpful resources to get started with policy-controller:

• Installation
• Configuration Options

Verify the Software Bill Of Materials

You can verify the Kubernetes Software Bill of Materials (SBOM) by using the sigstore
certificate and signature, or the corresponding SHA files:
# Retrieve the latest available Kubernetes release version
VERSION=$(curl -Ls https://dl.k8s.io/release/stable.txt)

# Verify the SHA512 sum

curl -Ls "https://sbom.k8s.io/$VERSION/release" -o "$VERSION.spdx"
echo "$(curl -Ls "https://sbom.k8s.io/$VERSION/release.sha512") $VERSION.spdx" | sha512sum
--check

# Verify the SHA256 sum

echo "$(curl -Ls "https://sbom.k8s.io/$VERSION/release.sha256") $VERSION.spdx" | sha256sum
--check

# Retrieve sigstore signature and certificate

curl -Ls "https://sbom.k8s.io/$VERSION/release.sig" -o "$VERSION.spdx.sig"
curl -Ls "https://sbom.k8s.io/$VERSION/release.cert" -o "$VERSION.spdx.cert"

# Verify the sigstore signature

cosign verify-blob \
--certificate "$VERSION.spdx.cert" \
--signature "$VERSION.spdx.sig" \
--certificate-identity [email protected] \
--certificate-oidc-issuer https://accounts.google.com \
"$VERSION.spdx"

Configure Pods and Containers

Perform common configuration tasks for Pods and containers.

Assign Memory Resources to Containers and Pods

Assign CPU Resources to Containers and Pods

Configure GMSA for Windows Pods and containers

Resize CPU and Memory Resources assigned to Containers

Configure RunAsUserName for Windows pods and containers

Create a Windows HostProcess Pod

Configure Quality of Service for Pods

Assign Extended Resources to a Container

Configure a Pod to Use a Volume for Storage

Configure a Pod to Use a PersistentVolume for Storage

Configure a Pod to Use a Projected Volume for Storage

Configure a Security Context for a Pod or Container

Configure Service Accounts for Pods

Pull an Image from a Private Registry

Configure Liveness, Readiness and Startup Probes

Assign Pods to Nodes

Assign Pods to Nodes using Node Affinity

Configure Pod Initialization

Attach Handlers to Container Lifecycle Events

Configure a Pod to Use a ConfigMap

Share Process Namespace between Containers in a Pod

Use a User Namespace With a Pod

Create static Pods

Translate a Docker Compose File to Kubernetes Resources

Enforce Pod Security Standards by Configuring the Built-in Admission Controller

Enforce Pod Security Standards with Namespace Labels

Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller

Assign Memory Resources to Containers

and Pods
This page shows how to assign a memory request and a memory limit to a Container. A
Container is guaranteed to have as much memory as it requests, but is not allowed to use more
memory than its limit.

Before you begin

• Killercoda
• Play with Kubernetes
To check the version, enter kubectl version.

Each node in your cluster must have at least 300 MiB of memory.

A few of the steps on this page require you to run the metrics-server service in your cluster. If
you have the metrics-server running, you can skip those steps.

If you are running Minikube, run the following command to enable the metrics-server:

minikube addons enable metrics-server

To see whether the metrics-server is running, or another provider of the resource metrics API
(metrics.k8s.io), run the following command:

kubectl get apiservices

If the resource metrics API is available, the output includes a reference to metrics.k8s.io.

NAME
v1beta1.metrics.k8s.io

Create a namespace
Create a namespace so that the resources you create in this exercise are isolated from the rest of
your cluster.

kubectl create namespace mem-example

Specify a memory request and a memory limit

To specify a memory request for a Container, include the resources:requests field in the
Container's resource manifest. To specify a memory limit, include resources:limits.

In this exercise, you create a Pod that has one Container. The Container has a memory request
of 100 MiB and a memory limit of 200 MiB. Here's the configuration file for the Pod:

pods/resource/memory-request-limit.yaml

apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
requests:
memory: "100Mi"
limits:
memory: "200Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]

The args section in the configuration file provides arguments for the Container when it starts.
The "--vm-bytes", "150M" arguments tell the Container to attempt to allocate 150 MiB of
memory.

Create the Pod:

kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit.yaml --

namespace=mem-example

Verify that the Pod Container is running:

kubectl get pod memory-demo --namespace=mem-example

View detailed information about the Pod:

kubectl get pod memory-demo --output=yaml --namespace=mem-example

The output shows that the one Container in the Pod has a memory request of 100 MiB and a
memory limit of 200 MiB.

...
resources:
requests:
memory: 100Mi
limits:
memory: 200Mi
...

Run kubectl top to fetch the metrics for the pod:

kubectl top pod memory-demo --namespace=mem-example

The output shows that the Pod is using about 162,900,000 bytes of memory, which is about 150
MiB. This is greater than the Pod's 100 MiB request, but within the Pod's 200 MiB limit.

NAME CPU(cores) MEMORY(bytes)

memory-demo <something> 162856960

Delete your Pod:

kubectl delete pod memory-demo --namespace=mem-example

Exceed a Container's memory limit

A Container can exceed its memory request if the Node has memory available. But a Container
is not allowed to use more than its memory limit. If a Container allocates more memory than its
limit, the Container becomes a candidate for termination. If the Container continues to
consume memory beyond its limit, the Container is terminated. If a terminated Container can
be restarted, the kubelet restarts it, as with any other type of runtime failure.
In this exercise, you create a Pod that attempts to allocate more memory than its limit. Here is
the configuration file for a Pod that has one Container with a memory request of 50 MiB and a
memory limit of 100 MiB:

pods/resource/memory-request-limit-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: memory-demo-2
namespace: mem-example
spec:
containers:
- name: memory-demo-2-ctr
image: polinux/stress
resources:
requests:
memory: "50Mi"
limits:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]

In the args section of the configuration file, you can see that the Container will attempt to
allocate 250 MiB of memory, which is well above the 100 MiB limit.

Create the Pod:

kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-2.yaml --

namespace=mem-example

View detailed information about the Pod:

kubectl get pod memory-demo-2 --namespace=mem-example

At this point, the Container might be running or killed. Repeat the preceding command until
the Container is killed:

NAME READY STATUS RESTARTS AGE

memory-demo-2 0/1 OOMKilled 1 24s

Get a more detailed view of the Container status:

kubectl get pod memory-demo-2 --output=yaml --namespace=mem-example

The output shows that the Container was killed because it is out of memory (OOM):

lastState:
terminated:
containerID: 65183c1877aaec2e8427bc95609cc52677a454b56fcb24340dbd22917c23b10f
exitCode: 137
finishedAt: 2017-06-20T20:52:19Z
reason: OOMKilled
startedAt: null
The Container in this exercise can be restarted, so the kubelet restarts it. Repeat this command
several times to see that the Container is repeatedly killed and restarted:

kubectl get pod memory-demo-2 --namespace=mem-example

The output shows that the Container is killed, restarted, killed again, restarted again, and so on:

kubectl get pod memory-demo-2 --namespace=mem-example

NAME READY STATUS RESTARTS AGE
memory-demo-2 0/1 OOMKilled 1 37s

kubectl get pod memory-demo-2 --namespace=mem-example

NAME READY STATUS RESTARTS AGE
memory-demo-2 1/1 Running 2 40s

View detailed information about the Pod history:

kubectl describe pod memory-demo-2 --namespace=mem-example

The output shows that the Container starts and fails repeatedly:

... Normal Created Created container with id

66a3a20aa7980e61be4922780bf9d24d1a1d8b7395c09861225b0eba1b1f8511
... Warning BackOff Back-off restarting failed container

View detailed information about your cluster's Nodes:

kubectl describe nodes

The output includes a record of the Container being killed because of an out-of-memory
condition:

Warning OOMKilling Memory cgroup out of memory: Kill process 4481 (stress) score 1994 or
sacrifice child

Delete your Pod:

kubectl delete pod memory-demo-2 --namespace=mem-example

Specify a memory request that is too big for your Nodes

Memory requests and limits are associated with Containers, but it is useful to think of a Pod as
having a memory request and limit. The memory request for the Pod is the sum of the memory
requests for all the Containers in the Pod. Likewise, the memory limit for the Pod is the sum of
the limits of all the Containers in the Pod.

Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has
enough available memory to satisfy the Pod's memory request.

In this exercise, you create a Pod that has a memory request so big that it exceeds the capacity
of any Node in your cluster. Here is the configuration file for a Pod that has one Container with
a request for 1000 GiB of memory, which likely exceeds the capacity of any Node in your
cluster.
pods/resource/memory-request-limit-3.yaml

apiVersion: v1
kind: Pod
metadata:
name: memory-demo-3
namespace: mem-example
spec:
containers:
- name: memory-demo-3-ctr
image: polinux/stress
resources:
requests:
memory: "1000Gi"
limits:
memory: "1000Gi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]

Create the Pod:

kubectl apply -f https://k8s.io/examples/pods/resource/memory-request-limit-3.yaml --

namespace=mem-example

View the Pod status:

kubectl get pod memory-demo-3 --namespace=mem-example

The output shows that the Pod status is PENDING. That is, the Pod is not scheduled to run on
any Node, and it will remain in the PENDING state indefinitely:

kubectl get pod memory-demo-3 --namespace=mem-example

NAME READY STATUS RESTARTS AGE
memory-demo-3 0/1 Pending 0 25s

View detailed information about the Pod, including events:

kubectl describe pod memory-demo-3 --namespace=mem-example

The output shows that the Container cannot be scheduled because of insufficient memory on
the Nodes:

Events:
... Reason Message
------ -------
... FailedScheduling No nodes are available that match all of the following predicates::
Insufficient memory (3).

Memory units
The memory resource is measured in bytes. You can express memory as a plain integer or a
fixed-point integer with one of these suffixes: E, P, T, G, M, K, Ei, Pi, Ti, Gi, Mi, Ki. For example,
the following represent approximately the same value:
128974848, 129e6, 129M, 123Mi

Delete your Pod:

kubectl delete pod memory-demo-3 --namespace=mem-example

If you do not specify a memory limit

If you do not specify a memory limit for a Container, one of the following situations applies:

• The Container has no upper bound on the amount of memory it uses. The Container
could use all of the memory available on the Node where it is running which in turn
could invoke the OOM Killer. Further, in case of an OOM Kill, a container with no
resource limits will have a greater chance of being killed.

• The Container is running in a namespace that has a default memory limit, and the
Container is automatically assigned the default limit. Cluster administrators can use a
LimitRange to specify a default value for the memory limit.

Motivation for memory requests and limits

By configuring memory requests and limits for the Containers that run in your cluster, you can
make efficient use of the memory resources available on your cluster's Nodes. By keeping a
Pod's memory request low, you give the Pod a good chance of being scheduled. By having a
memory limit that is greater than the memory request, you accomplish two things:

• The Pod can have bursts of activity where it makes use of memory that happens to be
available.
• The amount of memory a Pod can use during a burst is limited to some reasonable
amount.

Clean up
Delete your namespace. This deletes all the Pods that you created for this task:

kubectl delete namespace mem-example

What's next
For app developers

• Assign CPU Resources to Containers and Pods

• Configure Quality of Service for Pods

For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

Configure Minimum and Maximum Memory Constraints for a Namespace
•
• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

• Resize CPU and Memory Resources assigned to Containers

Assign CPU Resources to Containers and

Pods
This page shows how to assign a CPU request and a CPU limit to a container. Containers cannot
use more CPU than the configured limit. Provided the system has CPU time free, a container is
guaranteed to be allocated as much CPU as it requests.

Before you begin

• Killercoda
• Play with Kubernetes

To check the version, enter kubectl version.

Your cluster must have at least 1 CPU available for use to run the task examples.

A few of the steps on this page require you to run the metrics-server service in your cluster. If
you have the metrics-server running, you can skip those steps.

If you are running Minikube, run the following command to enable metrics-server:

minikube addons enable metrics-server

To see whether metrics-server (or another provider of the resource metrics API, metrics.k8s.io)
is running, type the following command:

kubectl get apiservices

If the resource metrics API is available, the output will include a reference to metrics.k8s.io.

NAME
v1beta1.metrics.k8s.io
Create a namespace
Create a Namespace so that the resources you create in this exercise are isolated from the rest
of your cluster.

kubectl create namespace cpu-example

Specify a CPU request and a CPU limit

To specify a CPU request for a container, include the resources:requests field in the Container
resource manifest. To specify a CPU limit, include resources:limits.

In this exercise, you create a Pod that has one container. The container has a request of 0.5 CPU
and a limit of 1 CPU. Here is the configuration file for the Pod:

pods/resource/cpu-request-limit.yaml

apiVersion: v1
kind: Pod
metadata:
name: cpu-demo
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr
image: vish/stress
resources:
limits:
cpu: "1"
requests:
cpu: "0.5"
args:
- -cpus
- "2"

The args section of the configuration file provides arguments for the container when it starts.
The -cpus "2" argument tells the Container to attempt to use 2 CPUs.

Create the Pod:

kubectl apply -f https://k8s.io/examples/pods/resource/cpu-request-limit.yaml --namespace=cpu

-example

Verify that the Pod is running:

kubectl get pod cpu-demo --namespace=cpu-example

View detailed information about the Pod:

kubectl get pod cpu-demo --output=yaml --namespace=cpu-example

The output shows that the one container in the Pod has a CPU request of 500 milliCPU and a
CPU limit of 1 CPU.
resources:
limits:
cpu: "1"
requests:
cpu: 500m

Use kubectl top to fetch the metrics for the Pod:

kubectl top pod cpu-demo --namespace=cpu-example

This example output shows that the Pod is using 974 milliCPU, which is slightly less than the
limit of 1 CPU specified in the Pod configuration.

NAME CPU(cores) MEMORY(bytes)

cpu-demo 974m <something>

Recall that by setting -cpu "2", you configured the Container to attempt to use 2 CPUs, but the
Container is only being allowed to use about 1 CPU. The container's CPU use is being throttled,
because the container is attempting to use more CPU resources than its limit.

Note: Another possible explanation for the CPU use being below 1.0 is that the Node might not
have enough CPU resources available. Recall that the prerequisites for this exercise require
your cluster to have at least 1 CPU available for use. If your Container runs on a Node that has
only 1 CPU, the Container cannot use more than 1 CPU regardless of the CPU limit specified
for the Container.

CPU units
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:

• 1 AWS vCPU
• 1 GCP Core
• 1 Azure vCore
• 1 Hyperthread on a bare-metal Intel processor with Hyperthreading

Fractional values are allowed. A Container that requests 0.5 CPU is guaranteed half as much
CPU as a Container that requests 1 CPU. You can use the suffix m to mean milli. For example
100m CPU, 100 milliCPU, and 0.1 CPU are all the same. Precision finer than 1m is not allowed.

CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same
amount of CPU on a single-core, dual-core, or 48-core machine.

Delete your Pod:

kubectl delete pod cpu-demo --namespace=cpu-example

Specify a CPU request that is too big for your Nodes

CPU requests and limits are associated with Containers, but it is useful to think of a Pod as
having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for
all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for
all the Containers in the Pod.
Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has
enough CPU resources available to satisfy the Pod CPU request.

In this exercise, you create a Pod that has a CPU request so big that it exceeds the capacity of
any Node in your cluster. Here is the configuration file for a Pod that has one Container. The
Container requests 100 CPU, which is likely to exceed the capacity of any Node in your cluster.

pods/resource/cpu-request-limit-2.yaml

apiVersion: v1
kind: Pod
metadata:
name: cpu-demo-2
namespace: cpu-example
spec:
containers:
- name: cpu-demo-ctr-2
image: vish/stress
resources:
limits:
cpu: "100"
requests:
cpu: "100"
args:
- -cpus
- "2"

Create the Pod:

kubectl apply -f https://k8s.io/examples/pods/resource/cpu-request-limit-2.yaml --namespace=c

pu-example

View the Pod status:

kubectl get pod cpu-demo-2 --namespace=cpu-example

The output shows that the Pod status is Pending. That is, the Pod has not been scheduled to run
on any Node, and it will remain in the Pending state indefinitely:

NAME READY STATUS RESTARTS AGE

cpu-demo-2 0/1 Pending 0 7m

View detailed information about the Pod, including events:

kubectl describe pod cpu-demo-2 --namespace=cpu-example

The output shows that the Container cannot be scheduled because of insufficient CPU resources
on the Nodes:

Events:
Reason Message
------ -------
FailedScheduling No nodes are available that match all of the following predicates::
Insufficient cpu (3).
Delete your Pod:

kubectl delete pod cpu-demo-2 --namespace=cpu-example

If you do not specify a CPU limit

If you do not specify a CPU limit for a Container, then one of these situations applies:

• The Container has no upper bound on the CPU resources it can use. The Container could
use all of the CPU resources available on the Node where it is running.

• The Container is running in a namespace that has a default CPU limit, and the Container
is automatically assigned the default limit. Cluster administrators can use a LimitRange to
specify a default value for the CPU limit.

If you specify a CPU limit but do not specify a CPU

request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes
automatically assigns a CPU request that matches the limit. Similarly, if a Container specifies its
own memory limit, but does not specify a memory request, Kubernetes automatically assigns a
memory request that matches the limit.

Motivation for CPU requests and limits

By configuring the CPU requests and limits of the Containers that run in your cluster, you can
make efficient use of the CPU resources available on your cluster Nodes. By keeping a Pod CPU
request low, you give the Pod a good chance of being scheduled. By having a CPU limit that is
greater than the CPU request, you accomplish two things:

• The Pod can have bursts of activity where it makes use of CPU resources that happen to
be available.
• The amount of CPU resources a Pod can use during a burst is limited to some reasonable
amount.

Clean up
Delete your namespace:

kubectl delete namespace cpu-example

What's next
For app developers

• Assign Memory Resources to Containers and Pods

• Configure Quality of Service for Pods

For cluster administrators

• Configure Default Memory Requests and Limits for a Namespace

• Configure Default CPU Requests and Limits for a Namespace

• Configure Minimum and Maximum Memory Constraints for a Namespace

• Configure Minimum and Maximum CPU Constraints for a Namespace

• Configure Memory and CPU Quotas for a Namespace

• Configure a Pod Quota for a Namespace

• Configure Quotas for API Objects

• Resize CPU and Memory Resources assigned to Containers

Configure GMSA for Windows Pods and

containers
FEATURE STATE: Kubernetes v1.18 [stable]

This page shows how to configure Group Managed Service Accounts (GMSA) for Pods and
containers that will run on Windows nodes. Group Managed Service Accounts are a specific
type of Active Directory account that provides automatic password management, simplified
service principal name (SPN) management, and the ability to delegate the management to other
administrators across multiple servers.

In Kubernetes, GMSA credential specs are configured at a Kubernetes cluster-wide scope as

Custom Resources. Windows Pods, as well as individual containers within a Pod, can be
configured to use a GMSA for domain based functions (e.g. Kerberos authentication) when
interacting with other Windows services.

Before you begin

You need to have a Kubernetes cluster and the kubectl command-line tool must be configured
to communicate with your cluster. The cluster is expected to have Windows worker nodes. This
section covers a set of initial steps required once for each cluster:

Install the GMSACredentialSpec CRD

A CustomResourceDefinition(CRD) for GMSA credential spec resources needs to be configured

on the cluster to define the custom resource type GMSACredentialSpec. Download the GMSA
CRD YAML and save it as gmsa-crd.yaml. Next, install the CRD with kubectl apply -f gmsa-
crd.yaml
Install webhooks to validate GMSA users

Two webhooks need to be configured on the Kubernetes cluster to populate and validate GMSA
credential spec references at the Pod or container level:

1. A mutating webhook that expands references to GMSAs (by name from a Pod
specification) into the full credential spec in JSON form within the Pod spec.

2. A validating webhook ensures all references to GMSAs are authorized to be used by the
Pod service account.

Installing the above webhooks and associated objects require the steps below:

1. Create a certificate key pair (that will be used to allow the webhook container to
communicate to the cluster)

2. Install a secret with the certificate from above.

3. Create a deployment for the core webhook logic.

4. Create the validating and mutating webhook configurations referring to the deployment.

A script can be used to deploy and configure the GMSA webhooks and associated objects
mentioned above. The script can be run with a --dry-run=server option to allow you to review
the changes that would be made to your cluster.

The YAML template used by the script may also be used to deploy the webhooks and associated
objects manually (with appropriate substitutions for the parameters)

Configure GMSAs and Windows nodes in Active

Directory
Before Pods in Kubernetes can be configured to use GMSAs, the desired GMSAs need to be
provisioned in Active Directory as described in the Windows GMSA documentation. Windows
worker nodes (that are part of the Kubernetes cluster) need to be configured in Active Directory
to access the secret credentials associated with the desired GMSA as described in the Windows
GMSA documentation.

Create GMSA credential spec resources

With the GMSACredentialSpec CRD installed (as described earlier), custom resources
containing GMSA credential specs can be configured. The GMSA credential spec does not
contain secret or sensitive data. It is information that a container runtime can use to describe
the desired GMSA of a container to Windows. GMSA credential specs can be generated in
YAML format with a utility PowerShell script.

Following are the steps for generating a GMSA credential spec YAML manually in JSON format
and then converting it:

1. Import the CredentialSpec module: ipmo CredentialSpec.psm1

Create a credential spec in JSON format using New-CredentialSpec. To create a GMSA
2. credential spec named WebApp1, invoke New-CredentialSpec -Name WebApp1 -
AccountName WebApp1 -Domain $(Get-ADDomain -Current LocalComputer)

3. Use Get-CredentialSpec to show the path of the JSON file.

4. Convert the credspec file from JSON to YAML format and apply the necessary header
fields apiVersion, kind, metadata and credspec to make it a GMSACredentialSpec custom
resource that can be configured in Kubernetes.

The following YAML configuration describes a GMSA credential spec named gmsa-WebApp1:

apiVersion: windows.k8s.io/v1
kind: GMSACredentialSpec
metadata:
name: gmsa-WebApp1 # This is an arbitrary name but it will be used as a reference
credspec:
ActiveDirectoryConfig:
GroupManagedServiceAccounts:
- Name: WebApp1 # Username of the GMSA account
Scope: CONTOSO # NETBIOS Domain Name
- Name: WebApp1 # Username of the GMSA account
Scope: contoso.com # DNS Domain Name
CmsPlugins:
- ActiveDirectory
DomainJoinConfig:
DnsName: contoso.com # DNS Domain Name
DnsTreeName: contoso.com # DNS Domain Name Root
Guid: 244818ae-87ac-4fcd-92ec-e79e5252348a # GUID
MachineAccountName: WebApp1 # Username of the GMSA account
NetBiosName: CONTOSO # NETBIOS Domain Name
Sid: S-1-5-21-2126449477-2524075714-3094792973 # SID of GMSA

The above credential spec resource may be saved as gmsa-Webapp1-credspec.yaml and applied
to the cluster using: kubectl apply -f gmsa-Webapp1-credspec.yml

Configure cluster role to enable RBAC on specific GMSA

credential specs
A cluster role needs to be defined for each GMSA credential spec resource. This authorizes the
use verb on a specific GMSA resource by a subject which is typically a service account. The
following example shows a cluster role that authorizes usage of the gmsa-WebApp1 credential
spec from above. Save the file as gmsa-webapp1-role.yaml and apply using kubectl apply -f
gmsa-webapp1-role.yaml

# Create the Role to read the credspec

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: webapp1-role
rules:
- apiGroups: ["windows.k8s.io"]
resources: ["gmsacredentialspecs"]
verbs: ["use"]
resourceNames: ["gmsa-WebApp1"]

Assign role to service accounts to use specific GMSA

credspecs
A service account (that Pods will be configured with) needs to be bound to the cluster role
create above. This authorizes the service account to use the desired GMSA credential spec
resource. The following shows the default service account being bound to a cluster role
webapp1-role to use gmsa-WebApp1 credential spec resource created above.

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: allow-default-svc-account-read-on-gmsa-WebApp1
namespace: default
subjects:
- kind: ServiceAccount
name: default
namespace: default
roleRef:
kind: ClusterRole
name: webapp1-role
apiGroup: rbac.authorization.k8s.io

Configure GMSA credential spec reference in Pod spec

The Pod spec field securityContext.windowsOptions.gmsaCredentialSpecName is used to
specify references to desired GMSA credential spec custom resources in Pod specs. This
configures all containers in the Pod spec to use the specified GMSA. A sample Pod spec with
the annotation populated to refer to gmsa-WebApp1:

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-webapp1
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
nodeSelector:
kubernetes.io/os: windows

Individual containers in a Pod spec can also specify the desired GMSA credspec using a per-
container securityContext.windowsOptions.gmsaCredentialSpecName field. For example:

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
run: with-creds
name: with-creds
namespace: default
spec:
replicas: 1
selector:
matchLabels:
run: with-creds
template:
metadata:
labels:
run: with-creds
spec:
containers:
- image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019
imagePullPolicy: Always
name: iis
securityContext:
windowsOptions:
gmsaCredentialSpecName: gmsa-Webapp1
nodeSelector:
kubernetes.io/os: windows

As Pod specs with GMSA fields populated (as described above) are applied in a cluster, the
following sequence of events take place:

1. The mutating webhook resolves and expands all references to GMSA credential spec
resources to the contents of the GMSA credential spec.

2. The validating webhook ensures the service account associated with the Pod is
authorized for the use verb on the specified GMSA credential spec.

3. The container runtime configures each Windows container with the specified GMSA
credential spec so that the container can assume the identity of the GMSA in Active
Directory and access services in the domain using that identity.
Authenticating to network shares using hostname or
FQDN
If you are experiencing issues connecting to SMB shares from Pods using hostname or FQDN,
but are able to access the shares via their IPv4 address then make sure the following registry
key is set on the Windows nodes.

reg add "HKLM\SYSTEM\CurrentControlSet\Services\hns\State" /v

EnableCompartmentNamespace /t REG_DWORD /d 1

Running Pods will then need to be recreated to pick up the behavior changes. More information
on how this registry key is used can be found here

Troubleshooting
If you are having difficulties getting GMSA to work in your environment, there are a few
troubleshooting steps you can take.

First, make sure the credspec has been passed to the Pod. To do this you will need to exec into
one of your Pods and check the output of the nltest.exe /parentdomain command.

In the example below the Pod did not get the credspec correctly:

kubectl exec -it iis-auth-7776966999-n5nzr powershell.exe

nltest.exe /parentdomain results in the following error:

Getting parent domain failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE

If your Pod did get the credspec correctly, then next check communication with the domain.
First, from inside of your Pod, quickly do an nslookup to find the root of your domain.

This will tell us 3 things:

1. The Pod can reach the DC

2. The DC can reach the Pod
3. DNS is working correctly.

If the DNS and communication test passes, next you will need to check if the Pod has
established secure channel communication with the domain. To do this, again, exec into your
Pod and run the nltest.exe /query command.

nltest.exe /query

Results in the following output:

I_NetLogonControl failed: Status = 1722 0x6ba RPC_S_SERVER_UNAVAILABLE

This tells us that for some reason, the Pod was unable to logon to the domain using the account
specified in the credspec. You can try to repair the secure channel by running the following:

nltest /sc_reset:domain.example
If the command is successful you will see and output similar to this:

Flags: 30 HAS_IP HAS_TIMESERV

Trusted DC Name \\dc10.domain.example
Trusted DC Connection Status Status = 0 0x0 NERR_Success
The command completed successfully

If the above corrects the error, you can automate the step by adding the following lifecycle
hook to your Pod spec. If it did not correct the error, you will need to examine your credspec
again and confirm that it is correct and complete.

image: registry.domain.example/iis-auth:1809v1
lifecycle:
postStart:
exec:
command: ["powershell.exe","-command","do { Restart-Service -Name netlogon } while
( $($Result = (nltest.exe /query); if ($Result -like '*0x0 NERR_Success*') {return $true} else
{return $false}) -eq $false)"]
imagePullPolicy: IfNotPresent

If you add the lifecycle section show above to your Pod spec, the Pod will execute the
commands listed to restart the netlogon service until the nltest.exe /query command exits
without error.

Resize CPU and Memory Resources

assigned to Containers
FEATURE STATE: Kubernetes v1.27 [alpha]

This page assumes that you are familiar with Quality of Service for Kubernetes Pods.

This page shows how to resize CPU and memory resources assigned to containers of a running
pod without restarting the pod or its containers. A Kubernetes node allocates resources for a
pod based on its requests, and restricts the pod's resource usage based on the limits specified in
the pod's containers.

For in-place resize of pod resources:

• Container's resource requests and limits are mutable for CPU and memory resources.
• allocatedResources field in containerStatuses of the Pod's status reflects the resources
allocated to the pod's containers.
• resources field in containerStatuses of the Pod's status reflects the actual resource
requests and limits that are configured on the running containers as reported by the
container runtime.
• resize field in the Pod's status shows the status of the last requested pending resize. It can
have the following values:
◦ Proposed: This value indicates an acknowledgement of the requested resize and
that the request was validated and recorded.
◦ InProgress: This value indicates that the node has accepted the resize request and is
in the process of applying it to the pod's containers.
◦ Deferred: This value means that the requested resize cannot be granted at this time,