Kubernetes Concepts
Kubernetes Concepts
abstractions Kubernetes uses to represent your cluster, and helps you obtain a deeper
understanding of how Kubernetes works.
Overview
Cluster Architecture
Containers
Workloads
Understand Pods, the smallest deployable compute object in Kubernetes, and the higher-level
abstractions that help you to run them.
Storage
Ways to provide both long-term and temporary storage to Pods in your cluster.
Configuration
Security
Policies
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the
kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that
Pods with higher Priority can schedule on Nodes. Eviction is the process of proactively
terminating one or more Pods on resource-starved Nodes.
Cluster Administration
Windows in Kubernetes
Extending Kubernetes
Overview
Kubernetes is a portable, extensible, open source platform for managing containerized
workloads and services, that facilitates both declarative configuration and automation. It has a
large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.
The name Kubernetes originates from Greek, meaning helmsman or pilot. K8s as an
abbreviation results from counting the eight letters between the "K" and the "s". Google open-
sourced the Kubernetes project in 2014. Kubernetes combines over 15 years of Google's
experience running production workloads at scale with best-of-breed ideas and practices from
the community.
Deployment evolution
Traditional deployment era: Early on, organizations ran applications on physical servers.
There was no way to define resource boundaries for applications in a physical server, and this
caused resource allocation issues. For example, if multiple applications run on a physical server,
there can be instances where one application would take up most of the resources, and as a
result, the other applications would underperform. A solution for this would be to run each
application on a different physical server. But this did not scale as resources were underutilized,
and it was expensive for organizations to maintain many physical servers.
Each VM is a full machine running all the components, including its own operating system, on
top of the virtualized hardware.
Container deployment era: Containers are similar to VMs, but they have relaxed isolation
properties to share the Operating System (OS) among the applications. Therefore, containers
are considered lightweight. Similar to a VM, a container has its own filesystem, share of CPU,
memory, process space, and more. As they are decoupled from the underlying infrastructure,
they are portable across clouds and OS distributions.
Containers have become popular because they provide extra benefits, such as:
• Agile application creation and deployment: increased ease and efficiency of container
image creation compared to VM image use.
• Continuous development, integration, and deployment: provides for reliable and frequent
container image build and deployment with quick and efficient rollbacks (due to image
immutability).
• Dev and Ops separation of concerns: create application container images at build/release
time rather than deployment time, thereby decoupling applications from infrastructure.
• Observability: not only surfaces OS-level information and metrics, but also application
health and other signals.
• Environmental consistency across development, testing, and production: runs the same
on a laptop as it does in the cloud.
• Cloud and OS distribution portability: runs on Ubuntu, RHEL, CoreOS, on-premises, on
major public clouds, and anywhere else.
• Application-centric management: raises the level of abstraction from running an OS on
virtual hardware to running an application on an OS using logical resources.
• Loosely coupled, distributed, elastic, liberated micro-services: applications are broken
into smaller, independent pieces and can be deployed and managed dynamically – not a
monolithic stack running on one big single-purpose machine.
• Resource isolation: predictable application performance.
• Resource utilization: high efficiency and density.
That's how Kubernetes comes to the rescue! Kubernetes provides you with a framework to run
distributed systems resiliently. It takes care of scaling and failover for your application, provides
deployment patterns, and more. For example: Kubernetes can easily manage a canary
deployment for your system.
Kubernetes provides you with:
• Service discovery and load balancing Kubernetes can expose a container using the
DNS name or using their own IP address. If traffic to a container is high, Kubernetes is
able to load balance and distribute the network traffic so that the deployment is stable.
• Storage orchestration Kubernetes allows you to automatically mount a storage system
of your choice, such as local storages, public cloud providers, and more.
• Automated rollouts and rollbacks You can describe the desired state for your deployed
containers using Kubernetes, and it can change the actual state to the desired state at a
controlled rate. For example, you can automate Kubernetes to create new containers for
your deployment, remove existing containers and adopt all their resources to the new
container.
• Automatic bin packing You provide Kubernetes with a cluster of nodes that it can use
to run containerized tasks. You tell Kubernetes how much CPU and memory (RAM) each
container needs. Kubernetes can fit containers onto your nodes to make the best use of
your resources.
• Self-healing Kubernetes restarts containers that fail, replaces containers, kills containers
that don't respond to your user-defined health check, and doesn't advertise them to
clients until they are ready to serve.
• Secret and configuration management Kubernetes lets you store and manage
sensitive information, such as passwords, OAuth tokens, and SSH keys. You can deploy
and update secrets and application configuration without rebuilding your container
images, and without exposing secrets in your stack configuration.
• Batch execution In addition to services, Kubernetes can manage your batch and CI
workloads, replacing containers that fail, if desired.
• Horizontal scaling Scale your application up and down with a simple command, with a
UI, or automatically based on CPU usage.
• IPv4/IPv6 dual-stack Allocation of IPv4 and IPv6 addresses to Pods and Services
• Designed for extensibility Add features to your Kubernetes cluster without changing
upstream source code.
Kubernetes:
• Does not limit the types of applications supported. Kubernetes aims to support an
extremely diverse variety of workloads, including stateless, stateful, and data-processing
workloads. If an application can run in a container, it should run great on Kubernetes.
• Does not deploy source code and does not build your application. Continuous Integration,
Delivery, and Deployment (CI/CD) workflows are determined by organization cultures
and preferences as well as technical requirements.
• Does not provide application-level services, such as middleware (for example, message
buses), data-processing frameworks (for example, Spark), databases (for example,
MySQL), caches, nor cluster storage systems (for example, Ceph) as built-in services.
Such components can run on Kubernetes, and/or can be accessed by applications running
on Kubernetes through portable mechanisms, such as the Open Service Broker.
• Does not dictate logging, monitoring, or alerting solutions. It provides some integrations
as proof of concept, and mechanisms to collect and export metrics.
• Does not provide nor mandate a configuration language/system (for example, Jsonnet). It
provides a declarative API that may be targeted by arbitrary forms of declarative
specifications.
• Does not provide nor adopt any comprehensive machine configuration, maintenance,
management, or self-healing systems.
• Additionally, Kubernetes is not a mere orchestration system. In fact, it eliminates the
need for orchestration. The technical definition of orchestration is execution of a defined
workflow: first do A, then B, then C. In contrast, Kubernetes comprises a set of
independent, composable control processes that continuously drive the current state
towards the provided desired state. It shouldn't matter how you get from A to C.
Centralized control is also not required. This results in a system that is easier to use and
more powerful, robust, resilient, and extensible.
What's next
• Take a look at the Kubernetes Components
• Take a look at the The Kubernetes API
• Take a look at the Cluster Architecture
• Ready to Get Started?
Objects In Kubernetes
Kubernetes objects are persistent entities in the Kubernetes system. Kubernetes uses these
entities to represent the state of your cluster. Learn about the Kubernetes object model and how
to work with these objects.
This page explains how Kubernetes objects are represented in the Kubernetes API, and how you
can express them in .yaml format.
A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system
will constantly work to ensure that object exists. By creating an object, you're effectively telling
the Kubernetes system what you want your cluster's workload to look like; this is your cluster's
desired state.
To work with Kubernetes objects—whether to create, modify, or delete them—you'll need to use
the Kubernetes API. When you use the kubectl command-line interface, for example, the CLI
makes the necessary Kubernetes API calls for you. You can also use the Kubernetes API directly
in your own programs using one of the Client Libraries.
Almost every Kubernetes object includes two nested object fields that govern the object's
configuration: the object spec and the object status. For objects that have a spec, you have to set
this when you create the object, providing a description of the characteristics you want the
resource to have: its desired state.
The status describes the current state of the object, supplied and updated by the Kubernetes
system and its components. The Kubernetes control plane continually and actively manages
every object's actual state to match the desired state you supplied.
For more information on the object spec, status, and metadata, see the Kubernetes API
Conventions.
When you create an object in Kubernetes, you must provide the object spec that describes its
desired state, as well as some basic information about the object (such as a name). When you
use the Kubernetes API to create the object (either directly or via kubectl), that API request
must include that information as JSON in the request body. Most often, you provide the
information to kubectl in file known as a manifest. By convention, manifests are YAML (you
could also use JSON format). Tools such as kubectl convert the information from a manifest into
JSON or another supported serialization format when making the API request over HTTP.
Here's an example manifest that shows the required fields and object spec for a Kubernetes
Deployment:
application/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
One way to create a Deployment using a manifest file like the one above is to use the kubectl
apply command in the kubectl command-line interface, passing the .yaml file as an argument.
Here's an example:
deployment.apps/nginx-deployment created
Required fields
In the manifest (YAML or JSON file) for the Kubernetes object you want to create, you'll need to
set values for the following fields:
• apiVersion - Which version of the Kubernetes API you're using to create this object
• kind - What kind of object you want to create
• metadata - Data that helps uniquely identify the object, including a name string, UID, and
optional namespace
• spec - What state you desire for the object
The precise format of the object spec is different for every Kubernetes object, and contains
nested fields specific to that object. The Kubernetes API Reference can help you find the spec
format for all of the objects you can create using Kubernetes.
For example, see the spec field for the Pod API reference. For each Pod, the .spec field specifies
the pod and its desired state (such as the container image name for each container within that
pod). Another example of an object specification is the spec field for the StatefulSet API. For
StatefulSet, the .spec field specifies the StatefulSet and its desired state. Within the .spec of a
StatefulSet is a template for Pod objects. That template describes Pods that the StatefulSet
controller will create in order to satisfy the StatefulSet specification. Different kinds of object
can also have different .status; again, the API reference pages detail the structure of that .status
field, and its content for each different type of object.
Note: See Configuration Best Practices for additional information on writing YAML
configuration files.
The kubectl tool uses the --validate flag to set the level of field validation. It accepts the values
ignore, warn, and strict while also accepting the values true (equivalent to strict) and false
(equivalent to ignore). The default validation setting for kubectl is --validate=true.
Strict
Strict field validation, errors on validation failure
Warn
Field validation is performed, but errors are exposed as warnings rather than failing the
request
Ignore
No server side field validation is performed
When kubectl cannot connect to an API server that supports field validation it will fall back to
using client-side validation. Kubernetes 1.27 and later versions always offer field validation;
older Kubernetes releases might not. If your cluster is older than v1.27, check the
documentation for your version of Kubernetes.
What's next
If you're new to Kubernetes, read more about the following:
Kubernetes Object Management explains how to use kubectl to manage objects. You might need
to install kubectl if you don't already have it available.
To learn about objects in Kubernetes in more depth, read other pages in this section:
Management techniques
Warning: A Kubernetes object should be managed using only one technique. Mixing and
matching techniques for the same object results in undefined behavior.
Management Recommended Supported Learning
Operates on
technique environment writers curve
Imperative commands Live objects Development projects 1+ Lowest
Imperative object
Individual files Production projects 1 Moderate
configuration
Declarative object Directories of
Production projects 1+ Highest
configuration files
Imperative commands
When using imperative commands, a user operates directly on live objects in a cluster. The user
provides operations to the kubectl command as arguments or flags.
This is the recommended way to get started or to run a one-off task in a cluster. Because this
technique operates directly on live objects, it provides no history of previous configurations.
Examples
Trade-offs
Warning: The imperative replace command replaces the existing spec with the newly provided
one, dropping all changes to the object missing from the configuration file. This approach
should not be used with resource types whose specs are updated independently of the
configuration file. Services of type LoadBalancer, for example, have their externalIPs field
updated independently from the configuration by the cluster.
Examples
Update the objects defined in a configuration file by overwriting the live configuration:
Trade-offs
Note: Declarative object configuration retains changes made by other writers, even if the
changes are not merged back to the object configuration file. This is possible by using the patch
API operation to write only observed differences, instead of using the replace API operation to
replace the entire object configuration.
Examples
Process all object configuration files in the configs directory, and create or patch the live
objects. You can first diff to see what changes are going to be made, and then apply:
Trade-offs
• Changes made directly to live objects are retained, even if they are not merged back into
the configuration files.
• Declarative object configuration has better support for operating on directories and
automatically detecting operation types (create, patch, delete) per-object.
• Declarative object configuration is harder to debug and understand results when they are
unexpected.
• Partial updates using diffs create complex merge and patch operations.
What's next
• Managing Kubernetes Objects Using Imperative Commands
• Imperative Management of Kubernetes Objects Using Configuration Files
• Declarative Management of Kubernetes Objects Using Configuration Files
• Declarative Management of Kubernetes Objects Using Kustomize
• Kubectl Command Reference
• Kubectl Book
• Kubernetes API Reference
For example, you can only have one Pod named myapp-1234 within the same namespace, but
you can have one Pod and one Deployment that are each named myapp-1234.
Only one object of a given kind can have a given name at a time. However, if you delete the
object, you can make a new object with the same name.
Names must be unique across all API versions of the same resource. API resources are
distinguished by their API group, resource type, namespace (for namespaced
resources), and name. In other words, API version is irrelevant in this context.
Note: In cases when objects represent a physical entity, like a Node representing a physical
host, when the host is re-created under the same name without deleting and re-creating the
Node, Kubernetes treats the new host as the old one, which may lead to inconsistencies.
Below are four types of commonly used name constraints for resources.
Most resource types require a name that can be used as a DNS subdomain name as defined in
RFC 1123. This means the name must:
Some resource types require their names to follow the DNS label standard as defined in RFC
1123. This means the name must:
Some resource types require their names to follow the DNS label standard as defined in RFC
1035. This means the name must:
Some resource types require their names to be able to be safely encoded as a path segment. In
other words, the name may not be "." or ".." and the name may not contain "/" or "%".
Here's an example manifest for a Pod named nginx-demo.
apiVersion: v1
kind: Pod
metadata:
name: nginx-demo
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
UIDs
A Kubernetes systems-generated string to uniquely identify objects.
Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is
intended to distinguish between historical occurrences of similar entities.
Kubernetes UIDs are universally unique identifiers (also known as UUIDs). UUIDs are
standardized as ISO/IEC 9834-8 and as ITU-T X.667.
What's next
• Read about labels and annotations in Kubernetes.
• See the Identifiers and Names in Kubernetes design document.
"metadata": {
"labels": {
"key1" : "value1",
"key2" : "value2"
}
}
Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-
identifying information should be recorded using annotations.
Motivation
Labels enable users to map their own organizational structures onto system objects in a loosely
coupled fashion, without requiring clients to store these mappings.
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g.,
multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-
services per tier). Management often requires cross-cutting operations, which breaks
encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by
the infrastructure rather than by users.
Example labels:
These are examples of commonly used labels; you are free to develop your own conventions.
Keep in mind that label Key must be unique for a given object.
If the prefix is omitted, the label Key is presumed to be private to the user. Automated system
components (e.g. kube-scheduler, kube-controller-manager, kube-apiserver, kubectl, or other
third-party automation) which add labels to end-user objects must specify a prefix.
The kubernetes.io/ and k8s.io/ prefixes are reserved for Kubernetes core components.
For example, here's a manifest for a Pod that has two labels environment: production and app:
nginx:
apiVersion: v1
kind: Pod
metadata:
name: label-demo
labels:
environment: production
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Label selectors
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects
to carry the same label(s).
Via a label selector, the client/user can identify a set of objects. The label selector is the core
grouping primitive in Kubernetes.
The API currently supports two types of selectors: equality-based and set-based. A label selector
can be made of multiple requirements which are comma-separated. In the case of multiple
requirements, all must be satisfied so the comma separator acts as a logical AND (&&) operator.
The semantics of empty or non-specified selectors are dependent on the context, and API types
that use selectors should document the validity and meaning of them.
Note: For some API types, such as ReplicaSets, the label selectors of two instances must not
overlap within a namespace, or the controller can see that as conflicting instructions and fail to
determine how many replicas should be present.
Caution: For both equality-based and set-based conditions there is no logical OR (||) operator.
Ensure your filter statements are structured accordingly.
Equality-based requirement
Equality- or inequality-based requirements allow filtering by label keys and values. Matching
objects must satisfy all of the specified label constraints, though they may have additional labels
as well. Three kinds of operators are admitted =,==,!=. The first two represent equality (and are
synonyms), while the latter represents inequality. For example:
environment = production
tier != frontend
The former selects all resources with key equal to environment and value equal to production.
The latter selects all resources with key equal to tier and value distinct from frontend, and all
resources with no labels with the tier key. One could filter for resources in production
excluding frontend using the comma operator: environment=production,tier!=frontend
One usage scenario for equality-based label requirement is for Pods to specify node selection
criteria. For example, the sample Pod below selects nodes with the label "accelerator=nvidia-
tesla-p100".
apiVersion: v1
kind: Pod
metadata:
name: cuda-test
spec:
containers:
- name: cuda-test
image: "registry.k8s.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100
Set-based requirement
Set-based label requirements allow filtering keys according to a set of values. Three kinds of
operators are supported: in,notin and exists (only the key identifier). For example:
• The first example selects all resources with key equal to environment and value equal to
production or qa.
• The second example selects all resources with key equal to tier and values other than
frontend and backend, and all resources with no labels with the tier key.
• The third example selects all resources including a label with key partition; no values are
checked.
• The fourth example selects all resources without a label with key partition; no values are
checked.
Similarly the comma separator acts as an AND operator. So filtering resources with a partition
key (no matter the value) and with environment different than qa can be achieved using
partition,environment notin (qa). The set-based label selector is a general form of equality since
environment=production is equivalent to environment in (production); similarly for != and
notin.
Set-based requirements can be mixed with equality-based requirements. For example: partition
in (customerA, customerB),environment!=qa.
API
LIST and WATCH filtering
LIST and WATCH operations may specify label selectors to filter the sets of objects returned
using a query parameter. Both requirements are permitted (presented here as they would
appear in a URL query string):
Both label selector styles can be used to list or watch resources via a REST client. For example,
targeting apiserver with kubectl and using equality-based one may write:
As already mentioned set-based requirements are more expressive. For instance, they can
implement the OR operator on values:
Some Kubernetes objects, such as services and replicationcontrollers, also use label selectors to
specify sets of other resources, such as pods.
The set of pods that a service targets is defined with a label selector. Similarly, the population of
pods that a replicationcontroller should manage is also defined with a label selector.
Label selectors for both objects are defined in json or yaml files using maps, and only equality-
based requirement selectors are supported:
"selector": {
"component" : "redis",
}
or
selector:
component: redis
Newer resources, such as Job, Deployment, ReplicaSet, and DaemonSet, support set-based
requirements as well.
selector:
matchLabels:
component: redis
matchExpressions:
- { key: tier, operator: In, values: [cache] }
- { key: environment, operator: NotIn, values: [dev] }
One use case for selecting over labels is to constrain the set of nodes onto which a pod can
schedule. See the documentation on node selection for more information.
For instance, different applications would use different values for the app label, but a multi-tier
application, such as the guestbook example, would additionally need to distinguish each tier.
The frontend could carry the following labels:
labels:
app: guestbook
tier: frontend
while the Redis master and replica would have different tier labels, and perhaps even an
additional role label:
labels:
app: guestbook
tier: backend
role: master
and
labels:
app: guestbook
tier: backend
role: replica
The labels allow for slicing and dicing the resources along any dimension specified by a label:
Updating labels
Sometimes you may want to relabel existing pods and other resources before creating new
resources. This can be done with kubectl label. For example, if you want to label all your
NGINX Pods as frontend tier, run:
pod/my-nginx-2035384211-j5fhi labeled
pod/my-nginx-2035384211-u2c7e labeled
pod/my-nginx-2035384211-u3t6x labeled
This first filters all pods with the label "app=nginx", and then labels them with the "tier=fe". To
see the pods you labeled, run:
This outputs all "app=nginx" pods, with an additional label column of pods' tier (specified with -
L or --label-columns).
What's next
• Learn how to add a label to a node
• Find Well-known labels, Annotations and Taints
• See Recommended labels
• Enforce Pod Security Standards with Namespace Labels
• Read a blog on Writing a Controller for Pod Labels
Namespaces
In Kubernetes, namespaces provides a mechanism for isolating groups of resources within a
single cluster. Names of resources need to be unique within a namespace, but not across
namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g.
Deployments, Services, etc) and not for cluster-wide objects (e.g. StorageClass, Nodes,
PersistentVolumes, etc).
When to Use Multiple Namespaces
Namespaces are intended for use in environments with many users spread across multiple
teams, or projects. For clusters with a few to tens of users, you should not need to create or
think about namespaces at all. Start using namespaces when you need the features they
provide.
Namespaces provide a scope for names. Names of resources need to be unique within a
namespace, but not across namespaces. Namespaces cannot be nested inside one another and
each Kubernetes resource can only be in one namespace.
Namespaces are a way to divide cluster resources between multiple users (via resource quota).
It is not necessary to use multiple namespaces to separate slightly different resources, such as
different versions of the same software: use labels to distinguish resources within the same
namespace.
Note: For a production cluster, consider not using the default namespace. Instead, make other
namespaces and use those.
Initial namespaces
Kubernetes starts with four initial namespaces:
default
Kubernetes includes this namespace so that you can start using your new cluster without
first creating a namespace.
kube-node-lease
This namespace holds Lease objects associated with each node. Node leases allow the
kubelet to send heartbeats so that the control plane can detect node failure.
kube-public
This namespace is readable by all clients (including those not authenticated). This
namespace is mostly reserved for cluster usage, in case that some resources should be
visible and readable publicly throughout the whole cluster. The public aspect of this
namespace is only a convention, not a requirement.
kube-system
The namespace for objects created by the Kubernetes system.
Note: Avoid creating namespaces with the prefix kube-, since it is reserved for Kubernetes
system namespaces.
Viewing namespaces
To set the namespace for a current request, use the --namespace flag.
For example:
You can permanently save the namespace for all subsequent kubectl commands in that context.
As a result, all namespace names must be valid RFC 1123 DNS labels.
Warning:
By creating namespaces with the same name as public top-level domains, Services in these
namespaces can have short DNS names that overlap with public DNS records. Workloads from
any namespace performing a DNS lookup without a trailing dot will be redirected to those
services, taking precedence over public DNS.
To mitigate this, limit privileges for creating namespaces to trusted users. If required, you could
additionally configure third-party security controls, such as admission webhooks, to block
creating any namespace with the name of public TLDs.
# Not in a namespace
kubectl api-resources --namespaced=false
Automatic labelling
FEATURE STATE: Kubernetes 1.22 [stable]
What's next
• Learn more about creating a new namespace.
• Learn more about deleting a namespace.
Annotations
You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects.
Clients such as tools and libraries can retrieve this metadata.
"metadata": {
"annotations": {
"key1" : "value1",
"key2" : "value2"
}
}
Note: The keys and the values in the map must be strings. In other words, you cannot use
numeric, boolean, list or other types for either the keys or the values.
• Client library or tool information that can be used for debugging purposes: for example,
name, version, and build information.
• User or tool/system provenance information, such as URLs of related objects from other
ecosystem components.
• Phone or pager numbers of persons responsible, or directory entries that specify where
that information can be found, such as a team web site.
• Directives from the end-user to the implementations to modify behavior or engage non-
standard features.
Instead of using annotations, you could store this type of information in an external database or
directory, but that would make it much harder to produce shared client libraries and tools for
deployment, management, introspection, and the like.
If the prefix is omitted, the annotation Key is presumed to be private to the user. Automated
system components (e.g. kube-scheduler, kube-controller-manager, kube-apiserver, kubectl, or
other third-party automation) which add annotations to end-user objects must specify a prefix.
The kubernetes.io/ and k8s.io/ prefixes are reserved for Kubernetes core components.
For example, here's a manifest for a Pod that has the annotation imageregistry: https://
hub.docker.com/ :
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
What's next
• Learn more about Labels and Selectors.
• Find Well-known labels, Annotations and Taints
Field Selectors
Field selectors let you select Kubernetes objects based on the value of one or more resource
fields. Here are some examples of field selector queries:
• metadata.name=my-service
• metadata.namespace!=default
• status.phase=Pending
This kubectl command selects all Pods for which the value of the status.phase field is Running:
Note: Field selectors are essentially resource filters. By default, no selectors/filters are applied,
meaning that all resources of the specified type are selected. This makes the kubectl queries
kubectl get pods and kubectl get pods --field-selector "" equivalent.
Supported fields
Supported field selectors vary by Kubernetes resource type. All resource types support the
metadata.name and metadata.namespace fields. Using unsupported field selectors produces an
error. For example:
Error from server (BadRequest): Unable to find "ingresses" that match label selector "", field
selector "foo.bar=baz": "foo.bar" is not a known field selector: only "metadata.name",
"metadata.namespace"
Supported operators
You can use the =, ==, and != operators with field selectors (= and == mean the same thing).
This kubectl command, for example, selects all Kubernetes Services that aren't in the default
namespace:
Note: Set-based operators (in, notin, exists) are not supported for field selectors.
Chained selectors
As with label and other selectors, field selectors can be chained together as a comma-separated
list. This kubectl command selects all Pods for which the status.phase does not equal Running
and the spec.restartPolicy field equals Always:
kubectl get pods --field-selector=status.phase!=Running,spec.restartPolicy=Always
Finalizers
Finalizers are namespaced keys that tell Kubernetes to wait until specific conditions are met
before it fully deletes resources marked for deletion. Finalizers alert controllers to clean up
resources the deleted object owned.
When you tell Kubernetes to delete an object that has finalizers specified for it, the Kubernetes
API marks the object for deletion by populating .metadata.deletionTimestamp, and returns a
202 status code (HTTP "Accepted"). The target object remains in a terminating state while the
control plane, or other components, take the actions defined by the finalizers. After these
actions are complete, the controller removes the relevant finalizers from the target object.
When the metadata.finalizers field is empty, Kubernetes considers the deletion complete and
deletes the object.
You can use finalizers to control garbage collection of resources. For example, you can define a
finalizer to clean up related resources or infrastructure before the controller deletes the target
resource.
You can use finalizers to control garbage collection of objects by alerting controllers to perform
specific cleanup tasks before deleting the target resource.
Finalizers don't usually specify the code to execute. Instead, they are typically lists of keys on a
specific resource similar to annotations. Kubernetes specifies some finalizers automatically, but
you can also specify your own.
• Modifies the object to add a metadata.deletionTimestamp field with the time you started
the deletion.
• Prevents the object from being removed until all items are removed from its
metadata.finalizers field
• Returns a 202 status code (HTTP "Accepted")
The controller managing that finalizer notices the update to the object setting the
metadata.deletionTimestamp, indicating deletion of the object has been requested. The
controller then attempts to satisfy the requirements of the finalizers specified for that resource.
Each time a finalizer condition is satisfied, the controller removes that key from the resource's
finalizers field. When the finalizers field is emptied, an object with a deletionTimestamp field
set is automatically deleted. You can also use finalizers to prevent deletion of unmanaged
resources.
Note:
• When you DELETE an object, Kubernetes adds the deletion timestamp for that object and
then immediately starts to restrict changes to the .metadata.finalizers field for the object
that is now pending deletion. You can remove existing finalizers (deleting an entry from
the finalizers list) but you cannot add a new finalizer. You also cannot modify the
deletionTimestamp for an object once it is set.
• After the deletion is requested, you can not resurrect this object. The only way is to delete
it and make a new similar object.
The Job controller also adds owner references to those Pods, pointing at the Job that created the
Pods. If you delete the Job while these Pods are running, Kubernetes uses the owner references
(not labels) to determine which Pods in the cluster need cleanup.
Kubernetes also processes finalizers when it identifies owner references on a resource targeted
for deletion.
In some situations, finalizers can block the deletion of dependent objects, which can cause the
targeted owner object to remain for longer than expected without being fully deleted. In these
situations, you should check finalizers and owner references on the target owner and
dependent objects to troubleshoot the cause.
Note: In cases where objects are stuck in a deleting state, avoid manually removing finalizers to
allow deletion to continue. Finalizers are usually added to resources for a reason, so forcefully
removing them can lead to issues in your cluster. This should only be done when the purpose of
the finalizer is understood and is accomplished in another way (for example, manually cleaning
up some dependent object).
What's next
• Read Using Finalizers to Control Deletion on the Kubernetes blog.
Owners and Dependents
In Kubernetes, some objects are owners of other objects. For example, a ReplicaSet is the owner
of a set of Pods. These owned objects are dependents of their owner.
Ownership is different from the labels and selectors mechanism that some resources also use.
For example, consider a Service that creates EndpointSlice objects. The Service uses labels to
allow the control plane to determine which EndpointSlice objects are used for that Service. In
addition to the labels, each EndpointSlice that is managed on behalf of a Service has an owner
reference. Owner references help different parts of Kubernetes avoid interfering with objects
they don’t control.
A Kubernetes admission controller controls user access to change this field for dependent
resources, based on the delete permissions of the owner. This control prevents unauthorized
users from delaying owner object deletion.
Note:
Kubernetes also adds finalizers to an owner resource when you use either foreground or orphan
cascading deletion. In foreground deletion, it adds the foreground finalizer so that the controller
must delete dependent resources that also have ownerReferences.blockOwnerDeletion=true
before it deletes the owner. If you specify an orphan deletion policy, Kubernetes adds the
orphan finalizer so that the controller ignores dependent resources after it deletes the owner
object.
What's next
• Learn more about Kubernetes finalizers.
• Learn about garbage collection.
• Read the API reference for object metadata.
Recommended Labels
You can visualize and manage Kubernetes objects with more tools than kubectl and the
dashboard. A common set of labels allows tools to work interoperably, describing objects in a
common manner that all tools can understand.
In addition to supporting tooling, the recommended labels describe applications in a way that
can be queried.
The metadata is organized around the concept of an application. Kubernetes is not a platform as
a service (PaaS) and doesn't have or enforce a formal notion of an application. Instead,
applications are informal and described with metadata. The definition of what an application
contains is loose.
Note: These are recommended labels. They make it easier to manage applications but aren't
required for any core tooling.
Shared labels and annotations share a common prefix: app.kubernetes.io. Labels without a
prefix are private to users. The shared prefix ensures that shared labels do not interfere with
custom user labels.
Labels
In order to take full advantage of using these labels, they should be applied on every resource
object.
Key Description Example Type
app.kubernetes.io/name The name of the application mysql string
app.kubernetes.io/ A unique name identifying the instance of an mysql-
string
instance application abcxzy
app.kubernetes.io/ The current version of the application (e.g., a
5.7.21 string
version SemVer 1.0, revision hash, etc.)
app.kubernetes.io/
The component within the architecture database string
component
The name of a higher level application this one is
app.kubernetes.io/part-of wordpress string
part of
app.kubernetes.io/ The tool being used to manage the operation of
helm string
managed-by an application
# This is an excerpt
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
The name of an application and the instance name are recorded separately. For example,
WordPress has a app.kubernetes.io/name of wordpress while it has an instance name,
represented as app.kubernetes.io/instance with a value of wordpress-abcxzy. This enables the
application and instance of the application to be identifiable. Every instance of an application
must have a unique name.
Examples
To illustrate different ways to use these labels the following examples have varying complexity.
Consider the case for a simple stateless service deployed using Deployment and Service objects.
The following two snippets represent how the labels could be used in their simplest form.
The Deployment is used to oversee the pods running the application itself.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: myservice
app.kubernetes.io/instance: myservice-abcxzy
...
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: wordpress
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "4.9.4"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: server
app.kubernetes.io/part-of: wordpress
...
MySQL is exposed as a StatefulSet with metadata for both it and the larger application it
belongs to:
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: mysql-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: helm
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
...
With the MySQL StatefulSet and Service you'll notice information about both MySQL and
WordPress, the broader application, are included.
Kubernetes Components
A Kubernetes cluster consists of the components that are a part of the control plane and a set of
machines called nodes.
A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized
applications. Every cluster has at least one worker node.
The worker node(s) host the Pods that are the components of the application workload. The
control plane manages the worker nodes and the Pods in the cluster. In production
environments, the control plane usually runs across multiple computers and a cluster usually
runs multiple nodes, providing fault-tolerance and high availability.
This document outlines the various components you need to have for a complete and working
Kubernetes cluster.
Components of Kubernetes
Control plane components can be run on any machine in the cluster. However, for simplicity,
set up scripts typically start all control plane components on the same machine, and do not run
user containers on this machine. See Creating Highly Available clusters with kubeadm for an
example control plane setup that runs across multiple machines.
kube-apiserver
The API server is a component of the Kubernetes control plane that exposes the Kubernetes
API. The API server is the front end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster
data.
If your Kubernetes cluster uses etcd as its backing store, make sure you have a back up plan for
the data.
You can find in-depth information about etcd in the official documentation.
kube-scheduler
Control plane component that watches for newly created Pods with no assigned node, and
selects a node for them to run on.
Factors taken into account for scheduling decisions include: individual and collective resource
requirements, hardware/software/policy constraints, affinity and anti-affinity specifications,
data locality, inter-workload interference, and deadlines.
kube-controller-manager
Logically, each controller is a separate process, but to reduce complexity, they are all compiled
into a single binary and run in a single process.
There are many different types of controllers. Some examples of them are:
• Node controller: Responsible for noticing and responding when nodes go down.
• Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to
run those tasks to completion.
• EndpointSlice controller: Populates EndpointSlice objects (to provide a link between
Services and Pods).
• ServiceAccount controller: Create default ServiceAccounts for new namespaces.
cloud-controller-manager
A Kubernetes control plane component that embeds cloud-specific control logic. The cloud
controller manager lets you link your cluster into your cloud provider's API, and separates out
the components that interact with that cloud platform from components that only interact with
your cluster.
The cloud-controller-manager only runs controllers that are specific to your cloud provider. If
you are running Kubernetes on your own premises, or in a learning environment inside your
own PC, the cluster does not have a cloud controller manager.
• Node controller: For checking the cloud provider to determine if a node has been deleted
in the cloud after it stops responding
• Route controller: For setting up routes in the underlying cloud infrastructure
• Service controller: For creating, updating and deleting cloud provider load balancers
Node Components
Node components run on every node, maintaining running pods and providing the Kubernetes
runtime environment.
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a
Pod.
The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures
that the containers described in those PodSpecs are running and healthy. The kubelet doesn't
manage containers which were not created by Kubernetes.
kube-proxy
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the
Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network
communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available.
Otherwise, kube-proxy forwards the traffic itself.
Container runtime
Kubernetes supports container runtimes such as containerd, CRI-O, and any other
implementation of the Kubernetes CRI (Container Runtime Interface).
Addons
Addons use Kubernetes resources (DaemonSet, Deployment, etc) to implement cluster features.
Because these are providing cluster-level features, namespaced resources for addons belong
within the kube-system namespace.
Selected addons are described below; for an extended list of available addons, please see
Addons.
DNS
While the other addons are not strictly required, all Kubernetes clusters should have cluster
DNS, as many examples rely on it.
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your environment, which
serves DNS records for Kubernetes services.
Containers started by Kubernetes automatically include this DNS server in their DNS searches.
Web UI (Dashboard)
Cluster-level Logging
A cluster-level logging mechanism is responsible for saving container logs to a central log store
with search/browsing interface.
Network Plugins
Network plugins are software components that implement the container network interface
(CNI) specification. They are responsible for allocating IP addresses to pods and enabling them
to communicate with each other within the cluster.
What's next
Learn more about the following:
The core of Kubernetes' control plane is the API server. The API server exposes an HTTP API
that lets end users, different parts of your cluster, and external components communicate with
one another.
The Kubernetes API lets you query and manipulate the state of API objects in Kubernetes (for
example: Pods, Namespaces, ConfigMaps, and Events).
Most operations can be performed through the kubectl command-line interface or other
command-line tools, such as kubeadm, which in turn use the API. However, you can also access
the API directly using REST calls.
Consider using one of the client libraries if you are writing an application using the Kubernetes
API.
OpenAPI specification
Complete API details are documented using OpenAPI.
OpenAPI V2
The Kubernetes API server serves an aggregated OpenAPI v2 spec via the /openapi/v2
endpoint. You can request the response format using request headers as follows:
OpenAPI V3
A discovery endpoint /openapi/v3 is provided to see a list of all group/versions available. This
endpoint only returns JSON. These group/versions are provided in the following format:
{
"paths": {
...,
"api/v1": {
"serverRelativeURL": "/openapi/v3/api/v1?
hash=CC0E9BFD992D8C59AEC98A1E2336F899E8318D3CF4C68944C3DEC640AF5AB52D864A
C50DAA8D145B3494F75FA3CFF939FCBDDA431DAD3CA79738B297795818CF"
},
"apis/admissionregistration.k8s.io/v1": {
"serverRelativeURL": "/openapi/v3/apis/admissionregistration.k8s.io/v1?
hash=E19CC93A116982CE5422FC42B590A8AFAD92CDE9AE4D59B5CAAD568F083AD07946E6
CB5817531680BCE6E215C16973CD39003B0425F3477CFD854E89A9DB6597"
},
....
}
}
The relative URLs are pointing to immutable OpenAPI descriptions, in order to improve client-
side caching. The proper HTTP caching headers are also set by the API server for that purpose
(Expires to 1 year in the future, and Cache-Control to immutable). When an obsolete URL is
used, the API server returns a redirect to the newest URL.
The Kubernetes API server publishes an OpenAPI v3 spec per Kubernetes group version at the /
openapi/v3/apis/<group>/<version>?hash=<hash> endpoint.
API Discovery
A list of all group versions supported by a cluster is published at the /api and /apis endpoints.
Each group version also advertises the list of resources supported via /apis/<group>/<version>
(for example: /apis/rbac.authorization.k8s.io/v1alpha1). These endpoints are used by kubectl to
fetch the list of resources supported by a cluster.
Aggregated Discovery
Kubernetes offers beta support for aggregated discovery, publishing all resources supported by
a cluster through two endpoints (/api and /apis) compared to one for every group version.
Requesting this endpoint drastically reduces the number of requests sent to fetch the discovery
for the average Kubernetes cluster. This may be accessed by requesting the respective endpoints
with an Accept header indicating the aggregated discovery resource: Accept: application/
json;v=v2beta1;g=apidiscovery.k8s.io;as=APIGroupDiscoveryList.
Versioning is done at the API level rather than at the resource or field level to ensure that the
API presents a clear, consistent view of system resources and behavior, and to enable
controlling access to end-of-life and/or experimental APIs.
To make it easier to evolve and to extend its API, Kubernetes implements API groups that can
be enabled or disabled.
API resources are distinguished by their API group, resource type, namespace (for namespaced
resources), and name. The API server handles the conversion between API versions
transparently: all the different versions are actually representations of the same persisted data.
The API server may serve the same underlying data through multiple API versions.
For example, suppose there are two API versions, v1 and v1beta1, for the same resource. If you
originally created an object using the v1beta1 version of its API, you can later read, update, or
delete that object using either the v1beta1 or the v1 API version, until the v1beta1 version is
deprecated and removed. At that point you can continue accessing and modifying the object
using the v1 API.
API changes
Any system that is successful needs to grow and change as new use cases emerge or existing
ones change. Therefore, Kubernetes has designed the Kubernetes API to continuously change
and grow. The Kubernetes project aims to not break compatibility with existing clients, and to
maintain that compatibility for a length of time so that other projects have an opportunity to
adapt.
In general, new API resources and new resource fields can be added often and frequently.
Elimination of resources or fields requires following the API deprecation policy.
Kubernetes makes a strong commitment to maintain compatibility for official Kubernetes APIs
once they reach general availability (GA), typically at API version v1. Additionally, Kubernetes
maintains compatibility with data persisted via beta API versions of official Kubernetes APIs,
and ensures that data can be converted and accessed via GA API versions when the feature
goes stable.
If you adopt a beta API version, you will need to transition to a subsequent beta or stable API
version once the API graduates. The best time to do this is while the beta API is in its
deprecation period, since objects are simultaneously accessible via both API versions. Once the
beta API completes its deprecation period and is no longer served, the replacement API version
must be used.
Note: Although Kubernetes also aims to maintain compatibility for alpha APIs versions, in
some circumstances this is not possible. If you use any alpha API versions, check the release
notes for Kubernetes when upgrading your cluster, in case the API did change in incompatible
ways that require deleting all existing alpha objects prior to upgrade.
Refer to API versions reference for more details on the API version level definitions.
API Extension
The Kubernetes API can be extended in one of two ways:
1. Custom resources let you declaratively define how the API server should provide your
chosen resource API.
2. You can also extend the Kubernetes API by implementing an aggregation layer.
What's next
• Learn how to extend the Kubernetes API by adding your own CustomResourceDefinition.
• Controlling Access To The Kubernetes API describes how the cluster manages
authentication and authorization for API access.
• Learn about API endpoints, resource types and samples by reading API Reference.
• Learn about what constitutes a compatible change, and how to change the API, from API
changes.
Cluster Architecture
The architectural concepts behind Kubernetes.
Components of Kubernetes
Nodes
Controllers
Leases
About cgroup v2
Garbage Collection
Nodes
Kubernetes runs your workload by placing containers into Pods to run on Nodes. A node may
be a virtual or physical machine, depending on the cluster. Each node is managed by the control
plane and contains the services necessary to run Pods.
The components on a node include the kubelet, a container runtime, and the kube-proxy.
Management
There are two main ways to have Nodes added to the API server:
After you create a Node object, or the kubelet on a node self-registers, the control plane checks
whether the new Node object is valid. For example, if you try to create a Node from the
following JSON manifest:
{
"kind": "Node",
"apiVersion": "v1",
"metadata": {
"name": "10.240.79.157",
"labels": {
"name": "my-first-k8s-node"
}
}
}
Kubernetes creates a Node object internally (the representation). Kubernetes checks that a
kubelet has registered to the API server that matches the metadata.name field of the Node. If
the node is healthy (i.e. all necessary services are running), then it is eligible to run a Pod.
Otherwise, that node is ignored for any cluster activity until it becomes healthy.
Note:
Kubernetes keeps the object for the invalid Node and continues checking to see whether it
becomes healthy.
You, or a controller, must explicitly delete the Node object to stop that health checking.
The name identifies a Node. Two Nodes cannot have the same name at the same time.
Kubernetes also assumes that a resource with the same name is the same object. In case of a
Node, it is implicitly assumed that an instance using the same name will have the same state
(e.g. network settings, root disk contents) and attributes like node labels. This may lead to
inconsistencies if an instance was modified without changing its name. If the Node needs to be
replaced or updated significantly, the existing Node object needs to be removed from API server
first and re-added after the update.
Self-registration of Nodes
When the kubelet flag --register-node is true (the default), the kubelet will attempt to register
itself with the API server. This is the preferred pattern, used by most distros.
• --register-with-taints - Register the node with the given list of taints (comma separated
<key>=<value>:<effect>).
• --node-ip - Optional comma-separated list of the IP addresses for the node. You can only
specify a single address for each address family. For example, in a single-stack IPv4
cluster, you set this value to be the IPv4 address that the kubelet should use for the node.
See configure IPv4/IPv6 dual stack for details of running a dual-stack cluster.
If you don't provide this argument, the kubelet uses the node's default IPv4 address, if
any; if the node has no IPv4 addresses then the kubelet uses the node's default IPv6
address.
• --node-labels - Labels to add when registering the node in the cluster (see label
restrictions enforced by the NodeRestriction admission plugin).
• --node-status-update-frequency - Specifies how often kubelet posts its node status to the
API server.
When the Node authorization mode and NodeRestriction admission plugin are enabled,
kubelets are only authorized to create/modify their own Node resource.
Note:
As mentioned in the Node name uniqueness section, when Node configuration needs to be
updated, it is a good practice to re-register the node with the API server. For example, if the
kubelet being restarted with the new set of --node-labels, but the same Node name is used, the
change will not take an effect, as labels are being set on the Node registration.
Pods already scheduled on the Node may misbehave or cause issues if the Node configuration
will be changed on kubelet restart. For example, already running Pod may be tainted against
the new labels assigned to the Node, while other Pods, that are incompatible with that Pod will
be scheduled based on this new label. Node re-registration ensures all Pods will be drained and
properly re-scheduled.
When you want to create Node objects manually, set the kubelet flag --register-node=false.
You can modify Node objects regardless of the setting of --register-node. For example, you can
set labels on an existing Node or mark it unschedulable.
You can use labels on Nodes in conjunction with node selectors on Pods to control scheduling.
For example, you can constrain a Pod to only be eligible to run on a subset of the available
nodes.
Marking a node as unschedulable prevents the scheduler from placing new pods onto that Node
but does not affect existing Pods on the Node. This is useful as a preparatory step before a node
reboot or other maintenance.
Note: Pods that are part of a DaemonSet tolerate being run on an unschedulable Node.
DaemonSets typically provide node-local services that should run on the Node even if it is
being drained of workload applications.
Node status
A Node's status contains the following information:
• Addresses
• Conditions
• Capacity and Allocatable
• Info
You can use kubectl to view a Node's status and other details:
Node heartbeats
Heartbeats, sent by Kubernetes nodes, help your cluster determine the availability of each node,
and to take action when failures are detected.
Node controller
The node controller is a Kubernetes control plane component that manages various aspects of
nodes.
The node controller has multiple roles in a node's life. The first is assigning a CIDR block to the
node when it is registered (if CIDR assignment is turned on).
The second is keeping the node controller's internal list of nodes up to date with the cloud
provider's list of available machines. When running in a cloud environment and whenever a
node is unhealthy, the node controller asks the cloud provider if the VM for that node is still
available. If not, the node controller deletes the node from its list of nodes.
The third is monitoring the nodes' health. The node controller is responsible for:
• In the case that a node becomes unreachable, updating the Ready condition in the
Node's .status field. In this case the node controller sets the Ready condition to Unknown.
• If a node remains unreachable: triggering API-initiated eviction for all of the Pods on the
unreachable node. By default, the node controller waits 5 minutes between marking the
node as Unknown and submitting the first eviction request.
By default, the node controller checks the state of each node every 5 seconds. This period can
be configured using the --node-monitor-period flag on the kube-controller-manager
component.
Rate limits on eviction
In most cases, the node controller limits the eviction rate to --node-eviction-rate (default 0.1)
per second, meaning it won't evict pods from more than 1 node per 10 seconds.
The node eviction behavior changes when a node in a given availability zone becomes
unhealthy. The node controller checks what percentage of nodes in the zone are unhealthy (the
Ready condition is Unknown or False) at the same time:
The reason these policies are implemented per availability zone is because one availability zone
might become partitioned from the control plane while the others remain connected. If your
cluster does not span multiple cloud provider availability zones, then the eviction mechanism
does not take per-zone unavailability into account.
A key reason for spreading your nodes across availability zones is so that the workload can be
shifted to healthy zones when one entire zone goes down. Therefore, if all nodes in a zone are
unhealthy, then the node controller evicts at the normal rate of --node-eviction-rate. The corner
case is when all zones are completely unhealthy (none of the nodes in the cluster are healthy).
In such a case, the node controller assumes that there is some problem with connectivity
between the control plane and the nodes, and doesn't perform any evictions. (If there has been
an outage and some nodes reappear, the node controller does evict pods from the remaining
nodes that are unhealthy or unreachable).
The node controller is also responsible for evicting pods running on nodes with NoExecute
taints, unless those pods tolerate that taint. The node controller also adds taints corresponding
to node problems like node unreachable or not ready. This means that the scheduler won't place
Pods onto unhealthy nodes.
The Kubernetes scheduler ensures that there are enough resources for all the Pods on a Node.
The scheduler checks that the sum of the requests of containers on the node is no greater than
the node's capacity. That sum of requests includes all containers managed by the kubelet, but
excludes any containers started directly by the container runtime, and also excludes any
processes running outside of the kubelet's control.
Note: If you want to explicitly reserve resources for non-Pod processes, see reserve resources
for system daemons.
Node topology
FEATURE STATE: Kubernetes v1.18 [beta]
If you have enabled the TopologyManager feature gate, then the kubelet can use topology hints
when making resource assignment decisions. See Control Topology Management Policies on a
Node for more information.
The kubelet attempts to detect node system shutdown and terminates pods running on the
node.
Kubelet ensures that pods follow the normal pod termination process during the node
shutdown. During node shutdown, the kubelet does not accept new Pods (even if those Pods are
already bound to the node).
The Graceful node shutdown feature depends on systemd since it takes advantage of systemd
inhibitor locks to delay the node shutdown with a given duration.
Graceful node shutdown is controlled with the GracefulNodeShutdown feature gate which is
enabled by default in 1.21.
Note that by default, both configuration options described below, shutdownGracePeriod and
shutdownGracePeriodCriticalPods are set to zero, thus not activating the graceful node
shutdown functionality. To activate the feature, the two kubelet config settings should be
configured appropriately and set to non-zero values.
Once systemd detects or notifies node shutdown, the kubelet sets a NotReady condition on the
Node, with the reason set to "node is shutting down". The kube-scheduler honors this condition
and does not schedule any Pods onto the affected node; other third-party schedulers are
expected to follow the same logic. This means that new Pods won't be scheduled onto that node
and therefore none will start.
The kubelet also rejects Pods during the PodAdmission phase if an ongoing node shutdown has
been detected, so that even Pods with a toleration for node.kubernetes.io/not-ready:NoSchedule
do not start there.
At the same time when kubelet is setting that condition on its Node via the API, the kubelet
also begins terminating any Pods that are running locally.
• shutdownGracePeriod:
◦ Specifies the total duration that the node should delay the shutdown by. This is the
total grace period for pod termination for both regular and critical pods.
• shutdownGracePeriodCriticalPods:
◦ Specifies the duration used to terminate critical pods during a node shutdown. This
value should be less than shutdownGracePeriod.
Note: There are cases when Node termination was cancelled by the system (or perhaps
manually by an administrator). In either of those situations the Node will return to the Ready
state. However, Pods which already started the process of termination will not be restored by
kubelet and will need to be re-scheduled.
Note:
When pods were evicted during the graceful node shutdown, they are marked as shutdown.
Running kubectl get pods shows the status of the evicted pods as Terminated. And kubectl
describe pod indicates that the pod was evicted because of node shutdown:
Reason: Terminated
Message: Pod was terminated in response to imminent node shutdown.
To provide more flexibility during graceful node shutdown around the ordering of pods during
shutdown, graceful node shutdown honors the PriorityClass for Pods, provided that you
enabled this feature in your cluster. The feature allows cluster administers to explicitly define
the ordering of pods during graceful node shutdown based on priority classes.
The Graceful Node Shutdown feature, as described above, shuts down pods in two phases, non-
critical pods, followed by critical pods. If additional flexibility is needed to explicitly define the
ordering of pods during shutdown in a more granular way, pod priority based graceful
shutdown can be used.
When graceful node shutdown honors pod priorities, this makes it possible to do graceful node
shutdown in multiple phases, each phase shutting down a particular priority class of pods. The
kubelet can be configured with the exact phases and shutdown time per phase.
shutdownGracePeriodByPodPriority:
- priority: 100000
shutdownGracePeriodSeconds: 10
- priority: 10000
shutdownGracePeriodSeconds: 180
- priority: 1000
shutdownGracePeriodSeconds: 120
- priority: 0
shutdownGracePeriodSeconds: 60
The above table implies that any pod with priority value >= 100000 will get just 10 seconds to
stop, any pod with value >= 10000 and < 100000 will get 180 seconds to stop, any pod with
value >= 1000 and < 10000 will get 120 seconds to stop. Finally, all other pods will get 60
seconds to stop.
One doesn't have to specify values corresponding to all of the classes. For example, you could
instead use these settings:
In the above case, the pods with custom-class-b will go into the same bucket as custom-class-c
for shutdown.
If there are no pods in a particular range, then the kubelet does not wait for pods in that
priority range. Instead, the kubelet immediately skips to the next priority class value range.
If this feature is enabled and no configuration is provided, then no ordering action will be
taken.
Note: The ability to take Pod priority into account during graceful node shutdown was
introduced as an Alpha feature in Kubernetes v1.23. In Kubernetes 1.28 the feature is Beta and
is enabled by default.
A node shutdown action may not be detected by kubelet's Node Shutdown Manager, either
because the command does not trigger the inhibitor locks mechanism used by kubelet or
because of a user error, i.e., the ShutdownGracePeriod and ShutdownGracePeriodCriticalPods
are not configured properly. Please refer to above section Graceful Node Shutdown for more
details.
When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods that
are part of a StatefulSet will be stuck in terminating status on the shutdown node and cannot
move to a new running node. This is because kubelet on the shutdown node is not available to
delete the pods so the StatefulSet cannot create a new pod with the same name. If there are
volumes used by the pods, the VolumeAttachments will not be deleted from the original
shutdown node so the volumes used by these pods cannot be attached to a new running node.
As a result, the application running on the StatefulSet cannot function properly. If the original
shutdown node comes up, the pods will be deleted by kubelet and new pods will be created on a
different running node. If the original shutdown node does not come up, these pods will be
stuck in terminating status on the shutdown node forever.
To mitigate the above situation, a user can manually add the taint node.kubernetes.io/out-of-
service with either NoExecute or NoSchedule effect to a Node marking it out-of-service. If the
NodeOutOfServiceVolumeDetachfeature gate is enabled on kube-controller-manager, and a
Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted if
there are no matching tolerations on it and volume detach operations for the pods terminating
on the node will happen immediately. This allows the Pods on the out-of-service node to
recover quickly on a different node.
1. Force delete the Pods that do not have matching out-of-service tolerations.
2. Immediately perform detach volume operation for such pods.
Note:
To enable swap on a node, the NodeSwap feature gate must be enabled on the kubelet, and the
--fail-swap-on command line flag or failSwapOn configuration setting must be set to false.
Warning: When the memory swap feature is turned on, Kubernetes data such as the content of
Secret objects that were written to tmpfs now could be swapped to disk.
A user can also optionally configure memorySwap.swapBehavior in order to specify how a
node will use swap memory. For example,
memorySwap:
swapBehavior: UnlimitedSwap
• UnlimitedSwap (default): Kubernetes workloads can use as much swap memory as they
request, up to the system limit.
• LimitedSwap: The utilization of swap memory by Kubernetes workloads is subject to
limitations. Only Pods of Burstable QoS are permitted to employ swap.
If configuration for memorySwap is not specified and the feature gate is enabled, by default the
kubelet will apply the same behaviour as the UnlimitedSwap setting.
With LimitedSwap, Pods that do not fall under the Burstable QoS classification (i.e. BestEffort/
Guaranteed Qos Pods) are prohibited from utilizing swap memory. To maintain the
aforementioned security and node health guarantees, these Pods are not permitted to use swap
memory when LimitedSwap is in effect.
Prior to detailing the calculation of the swap limit, it is necessary to define the following terms:
It is important to note that, for containers within Burstable QoS Pods, it is possible to opt-out of
swap usage by specifying memory requests that are equal to memory limits. Containers
configured in this manner will not have access to swap memory.
For more information, and to assist with testing and provide feedback, please see the blog-post
about Kubernetes 1.28: NodeSwap graduates to Beta1, KEP-2400 and its design proposal.
What's next
Learn more about the following:
Nodes should be provisioned with the public root certificate for the cluster such that they can
connect securely to the API server along with valid client credentials. A good approach is that
the client credentials provided to the kubelet are in the form of a client certificate. See kubelet
TLS bootstrapping for automated provisioning of kubelet client certificates.
Pods that wish to connect to the API server can do so securely by leveraging a service account
so that Kubernetes will automatically inject the public root certificate and a valid bearer token
into the pod when it is instantiated. The kubernetes service (in default namespace) is configured
with a virtual IP address that is redirected (via kube-proxy) to the HTTPS endpoint on the API
server.
The control plane components also communicate with the API server over the secure port.
As a result, the default operating mode for connections from the nodes and pod running on the
nodes to the control plane is secured by default and can run over untrusted and/or public
networks.
The connections from the API server to the kubelet are used for:
To verify this connection, use the --kubelet-certificate-authority flag to provide the API server
with a root certificate bundle to use to verify the kubelet's serving certificate.
If that is not possible, use SSH tunneling between the API server and kubelet if required to
avoid connecting over an untrusted or public network.
Finally, Kubelet authentication and/or authorization should be enabled to secure the kubelet
API.
The connections from the API server to a node, pod, or service default to plain HTTP
connections and are therefore neither authenticated nor encrypted. They can be run over a
secure HTTPS connection by prefixing https: to the node, pod, or service name in the API URL,
but they will not validate the certificate provided by the HTTPS endpoint nor provide client
credentials. So while the connection will be encrypted, it will not provide any guarantees of
integrity. These connections are not currently safe to run over untrusted or public networks.
SSH tunnels
Kubernetes supports SSH tunnels to protect the control plane to nodes communication paths. In
this configuration, the API server initiates an SSH tunnel to each node in the cluster
(connecting to the SSH server listening on port 22) and passes all traffic destined for a kubelet,
node, pod, or service through the tunnel. This tunnel ensures that the traffic is not exposed
outside of the network in which the nodes are running.
Note: SSH tunnels are currently deprecated, so you shouldn't opt to use them unless you know
what you are doing. The Konnectivity service is a replacement for this communication channel.
Konnectivity service
As a replacement to the SSH tunnels, the Konnectivity service provides TCP level proxy for the
control plane to cluster communication. The Konnectivity service consists of two parts: the
Konnectivity server in the control plane network and the Konnectivity agents in the nodes
network. The Konnectivity agents initiate connections to the Konnectivity server and maintain
the network connections. After enabling the Konnectivity service, all control plane to nodes
traffic goes through these connections.
Follow the Konnectivity service task to set up the Konnectivity service in your cluster.
What's next
• Read about the Kubernetes control plane components
• Learn more about Hubs and Spoke model
• Learn how to Secure a Cluster
• Learn more about the Kubernetes API
• Set up Konnectivity service
• Use Port Forwarding to Access Applications in a Cluster
• Learn how to Fetch logs for Pods, use kubectl port-forward
Controllers
In robotics and automation, a control loop is a non-terminating loop that regulates the state of a
system.
When you set the temperature, that's telling the thermostat about your desired state. The actual
room temperature is the current state. The thermostat acts to bring the current state closer to
the desired state, by turning equipment on or off.
In Kubernetes, controllers are control loops that watch the state of your cluster, then make or
request changes where needed. Each controller tries to move the current cluster state closer to
the desired state.
Controller pattern
A controller tracks at least one Kubernetes resource type. These objects have a spec field that
represents the desired state. The controller(s) for that resource are responsible for making the
current state come closer to that desired state.
The controller might carry the action out itself; more commonly, in Kubernetes, a controller
will send messages to the API server that have useful side effects. You'll see examples of this
below.
The Job controller is an example of a Kubernetes built-in controller. Built-in controllers manage
state by interacting with the cluster API server.
Job is a Kubernetes resource that runs a Pod, or perhaps several Pods, to carry out a task and
then stop.
(Once scheduled, Pod objects become part of the desired state for a kubelet).
When the Job controller sees a new task it makes sure that, somewhere in your cluster, the
kubelets on a set of Nodes are running the right number of Pods to get the work done. The Job
controller does not run any Pods or containers itself. Instead, the Job controller tells the API
server to create or remove Pods. Other components in the control plane act on the new
information (there are new Pods to schedule and run), and eventually the work is done.
After you create a new Job, the desired state is for that Job to be completed. The Job controller
makes the current state for that Job be nearer to your desired state: creating Pods that do the
work you wanted for that Job, so that the Job is closer to completion.
Controllers also update the objects that configure them. For example: once the work is done for
a Job, the Job controller updates that Job object to mark it Finished.
(This is a bit like how some thermostats turn a light off to indicate that your room is now at the
temperature you set).
Direct control
In contrast with Job, some controllers need to make changes to things outside of your cluster.
For example, if you use a control loop to make sure there are enough Nodes in your cluster,
then that controller needs something outside the current cluster to set up new Nodes when
needed.
Controllers that interact with external state find their desired state from the API server, then
communicate directly with an external system to bring the current state closer in line.
(There actually is a controller that horizontally scales the nodes in your cluster.)
The important point here is that the controller makes some changes to bring about your desired
state, and then reports the current state back to your cluster's API server. Other control loops
can observe that reported data and take their own actions.
In the thermostat example, if the room is very cold then a different controller might also turn
on a frost protection heater. With Kubernetes clusters, the control plane indirectly works with
IP address management tools, storage services, cloud provider APIs, and other services by
extending Kubernetes to implement that.
Your cluster could be changing at any point as work happens and control loops automatically
fix failures. This means that, potentially, your cluster never reaches a stable state.
As long as the controllers for your cluster are running and able to make useful changes, it
doesn't matter if the overall state is stable or not.
Design
As a tenet of its design, Kubernetes uses lots of controllers that each manage a particular aspect
of cluster state. Most commonly, a particular control loop (controller) uses one kind of resource
as its desired state, and has a different kind of resource that it manages to make that desired
state happen. For example, a controller for Jobs tracks Job objects (to discover new work) and
Pod objects (to run the Jobs, and then to see when the work is finished). In this case something
else creates the Jobs, whereas the Job controller creates Pods.
It's useful to have simple controllers rather than one, monolithic set of control loops that are
interlinked. Controllers can fail, so Kubernetes is designed to allow for that.
Note:
There can be several controllers that create or update the same kind of object. Behind the
scenes, Kubernetes controllers make sure that they only pay attention to the resources linked to
their controlling resource.
For example, you can have Deployments and Jobs; these both create Pods. The Job controller
does not delete the Pods that your Deployment created, because there is information (labels) the
controllers can use to tell those Pods apart.
The Deployment controller and Job controller are examples of controllers that come as part of
Kubernetes itself ("built-in" controllers). Kubernetes lets you run a resilient control plane, so
that if any of the built-in controllers were to fail, another part of the control plane will take
over the work.
You can find controllers that run outside the control plane, to extend Kubernetes. Or, if you
want, you can write a new controller yourself. You can run your own controller as a set of Pods,
or externally to Kubernetes. What fits best will depend on what that particular controller does.
What's next
• Read about the Kubernetes control plane
• Discover some of the basic Kubernetes objects
• Learn more about the Kubernetes API
• If you want to write your own controller, see Extension Patterns in Extending
Kubernetes.
Leases
Distributed systems often have a need for leases, which provide a mechanism to lock shared
resources and coordinate activity between members of a set. In Kubernetes, the lease concept is
represented by Lease objects in the coordination.k8s.io API Group, which are used for system-
critical capabilities such as node heartbeats and component-level leader election.
Node heartbeats
Kubernetes uses the Lease API to communicate kubelet node heartbeats to the Kubernetes API
server. For every Node , there is a Lease object with a matching name in the kube-node-lease
namespace. Under the hood, every kubelet heartbeat is an update request to this Lease object,
updating the spec.renewTime field for the Lease. The Kubernetes control plane uses the time
stamp of this field to determine the availability of this Node.
Leader election
Kubernetes also uses Leases to ensure only one instance of a component is running at any
given time. This is used by control plane components like kube-controller-manager and kube-
scheduler in HA configurations, where only one instance of the component should be actively
running while the other instances are on stand-by.
API server identity
FEATURE STATE: Kubernetes v1.26 [beta]
Starting in Kubernetes v1.26, each kube-apiserver uses the Lease API to publish its identity to
the rest of the system. While not particularly useful on its own, this provides a mechanism for
clients to discover how many instances of kube-apiserver are operating the Kubernetes control
plane. Existence of kube-apiserver leases enables future capabilities that may require
coordination between each kube-apiserver.
You can inspect Leases owned by each kube-apiserver by checking for lease objects in the kube-
system namespace with the name kube-apiserver-<sha256-hash>. Alternatively you can use the
label selector apiserver.kubernetes.io/identity=kube-apiserver:
The SHA256 hash used in the lease name is based on the OS hostname as seen by that API
server. Each kube-apiserver should be configured to use a hostname that is unique within the
cluster. New instances of kube-apiserver that use the same hostname will take over existing
Leases using a new holder identity, as opposed to instantiating new Lease objects. You can
check the hostname used by kube-apisever by checking the value of the kubernetes.io/
hostname label:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2023-07-02T13:16:48Z"
labels:
apiserver.kubernetes.io/identity: kube-apiserver
kubernetes.io/hostname: master-1
name: apiserver-07a5ea9b9b072c4a5f3d1c3702
namespace: kube-system
resourceVersion: "334899"
uid: 90870ab5-1ba9-4523-b215-e4d4e662acb1
spec:
holderIdentity: apiserver-07a5ea9b9b072c4a5f3d1c3702_0c8914f7-0f35-440e-8676-7844977d3a05
leaseDurationSeconds: 3600
renewTime: "2023-07-04T21:58:48.065888Z"
Expired leases from kube-apiservers that no longer exist are garbage collected by new kube-
apiservers after 1 hour.
You can disable API server identity leases by disabling the APIServerIdentity feature gate.
Workloads
Your own workload can define its own use of Leases. For example, you might run a custom
controller where a primary or leader member performs operations that its peers do not. You
define a Lease so that the controller replicas can select or elect a leader, using the Kubernetes
API for coordination. If you do use a Lease, it's a good practice to define a name for the Lease
that is obviously linked to the product or component. For example, if you have a component
named Example Foo, use a Lease named example-foo.
If a cluster operator or another end user could deploy multiple instances of a component, select
a name prefix and pick a mechanism (such as hash of the name of the Deployment) to avoid
name collisions for the Leases.
You can use another approach so long as it achieves the same outcome: different software
products do not conflict with one another.
Cloud infrastructure technologies let you run Kubernetes on public, private, and hybrid clouds.
Kubernetes believes in automated, API-driven infrastructure without tight coupling between
components.
By decoupling the interoperability logic between Kubernetes and the underlying cloud
infrastructure, the cloud-controller-manager component enables cloud providers to release
features at a different pace compared to the main Kubernetes project.
Design
Kubernetes components
The cloud controller manager runs in the control plane as a replicated set of processes (usually,
these are containers in Pods). Each cloud-controller-manager implements multiple controllers
in a single process.
Note: You can also run the cloud controller manager as a Kubernetes addon rather than as part
of the control plane.
The node controller is responsible for updating Node objects when new servers are created in
your cloud infrastructure. The node controller obtains information about the hosts running
inside your tenancy with the cloud provider. The node controller performs the following
functions:
1. Update a Node object with the corresponding server's unique identifier obtained from the
cloud provider API.
2. Annotating and labelling the Node object with cloud-specific information, such as the
region the node is deployed into and the resources (CPU, memory, etc) that it has
available.
3. Obtain the node's hostname and network addresses.
4. Verifying the node's health. In case a node becomes unresponsive, this controller checks
with your cloud provider's API to see if the server has been deactivated / deleted /
terminated. If the node has been deleted from the cloud, the controller deletes the Node
object from your Kubernetes cluster.
Some cloud provider implementations split this into a node controller and a separate node
lifecycle controller.
Route controller
The route controller is responsible for configuring routes in the cloud appropriately so that
containers on different nodes in your Kubernetes cluster can communicate with each other.
Depending on the cloud provider, the route controller might also allocate blocks of IP addresses
for the Pod network.
Service controller
Services integrate with cloud infrastructure components such as managed load balancers, IP
addresses, network packet filtering, and target health checking. The service controller interacts
with your cloud provider's APIs to set up load balancers and other infrastructure components
when you declare a Service resource that requires them.
Authorization
This section breaks down the access that the cloud controller manager requires on various API
objects, in order to perform its operations.
Node controller
The Node controller only works with Node objects. It requires full access to read and modify
Node objects.
v1/Node:
• get
• list
• create
• update
• patch
• watch
• delete
Route controller
The route controller listens to Node object creation and configures routes appropriately. It
requires Get access to Node objects.
v1/Node:
• get
Service controller
The service controller watches for Service object create, update and delete events and then
configures Endpoints for those Services appropriately (for EndpointSlices, the kube-controller-
manager manages these on demand).
To access Services, it requires list, and watch access. To update Services, it requires patch and
update access.
To set up Endpoints resources for the Services, it requires access to create, list, get, watch,
and update.
v1/Service:
• list
• get
• watch
• patch
• update
Others
The implementation of the core of the cloud controller manager requires access to create Event
objects, and to ensure secure operation, it requires access to create ServiceAccounts.
v1/Event:
• create
• patch
• update
v1/ServiceAccount:
• create
The RBAC ClusterRole for the cloud controller manager looks like:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cloud-controller-manager
rules:
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
- apiGroups:
- ""
resources:
- nodes
verbs:
- '*'
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
- apiGroups:
- ""
resources:
- services
verbs:
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- update
- watch
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- list
- watch
- update
What's next
• Cloud Controller Manager Administration has instructions on running and managing the
cloud controller manager.
• To upgrade a HA control plane to use the cloud controller manager, see Migrate
Replicated Control Plane To Use Cloud Controller Manager.
• Want to know how to implement your own cloud controller manager, or extend an
existing project?
About cgroup v2
On Linux, control groups constrain resources that are allocated to processes.
The kubelet and the underlying container runtime need to interface with cgroups to enforce
resource management for pods and containers which includes cpu/memory requests and limits
for containerized workloads.
There are two versions of cgroups in Linux: cgroup v1 and cgroup v2. cgroup v2 is the new
generation of the cgroup API.
cgroup v2 is the next version of the Linux cgroup API. cgroup v2 provides a unified control
system with enhanced resource management capabilities.
cgroup v2 offers several improvements over cgroup v1, such as the following:
Some Kubernetes features exclusively use cgroup v2 for enhanced resource management and
isolation. For example, the MemoryQoS feature improves memory QoS and relies on cgroup v2
primitives.
Using cgroup v2
The recommended way to use cgroup v2 is to use a Linux distribution that enables and uses
cgroup v2 by default.
To check if your distribution uses cgroup v2, refer to Identify cgroup version on Linux nodes.
Requirements
For a list of Linux distributions that use cgroup v2, refer to the cgroup v2 documentation
To check if your distribution is using cgroup v2, refer to your distribution's documentation or
follow the instructions in Identify the cgroup version on Linux nodes.
You can also enable cgroup v2 manually on your Linux distribution by modifying the kernel
cmdline boot arguments. If your distribution uses GRUB, systemd.unified_cgroup_hierarchy=1
should be added in GRUB_CMDLINE_LINUX under /etc/default/grub, followed by sudo update-
grub. However, the recommended approach is to use a distribution that already enables cgroup
v2 by default.
Migrating to cgroup v2
To migrate to cgroup v2, ensure that you meet the requirements, then upgrade to a kernel
version that enables cgroup v2 by default.
The kubelet automatically detects that the OS is running on cgroup v2 and performs
accordingly with no additional configuration required.
There should not be any noticeable difference in the user experience when switching to cgroup
v2, unless users are accessing the cgroup file system directly, either on the node or from within
the containers.
cgroup v2 uses a different API than cgroup v1, so if there are any applications that directly
access the cgroup file system, they need to be updated to newer versions that support cgroup
v2. For example:
• Some third-party monitoring and security agents may depend on the cgroup filesystem.
Update these agents to versions that support cgroup v2.
• If you run cAdvisor as a stand-alone DaemonSet for monitoring pods and containers,
update it to v0.43.0 or later.
• If you deploy Java applications, prefer to use versions which fully support cgroup v2:
◦ OpenJDK / HotSpot: jdk8u372, 11.0.16, 15 and later
◦ IBM Semeru Runtimes: 8.0.382.0, 11.0.20.0, 17.0.8.0, and later
◦ IBM Java: 8.0.8.6 and later
• If you are using the uber-go/automaxprocs package, make sure the version you use is
v1.5.1 or higher.
What's next
• Learn more about cgroups
• Learn more about container runtime
• Learn more about cgroup drivers
You need a working container runtime on each Node in your cluster, so that the kubelet can
launch Pods and their containers.
The Container Runtime Interface (CRI) is the main protocol for the communication between the
kubelet and Container Runtime.
The Kubernetes Container Runtime Interface (CRI) defines the main gRPC protocol for the
communication between the node components kubelet and container runtime.
The API
FEATURE STATE: Kubernetes v1.23 [stable]
The kubelet acts as a client when connecting to the container runtime via gRPC. The runtime
and image service endpoints have to be available in the container runtime, which can be
configured separately within the kubelet by using the --image-service-endpoint command line
flags.
For Kubernetes v1.28, the kubelet prefers to use CRI v1. If a container runtime does not support
v1 of the CRI, then the kubelet tries to negotiate any older supported version. The v1.28 kubelet
can also negotiate CRI v1alpha2, but this version is considered as deprecated. If the kubelet
cannot negotiate a supported CRI version, the kubelet gives up and doesn't register as a node.
Upgrading
When upgrading Kubernetes, the kubelet tries to automatically select the latest CRI version on
restart of the component. If that fails, then the fallback will take place as mentioned above. If a
gRPC re-dial was required because the container runtime has been upgraded, then the
container runtime must also support the initially selected version or the redial is expected to
fail. This requires a restart of the kubelet.
What's next
• Learn more about the CRI protocol definition
Garbage Collection
Garbage collection is a collective term for the various mechanisms Kubernetes uses to clean up
cluster resources. This allows the clean up of resources like the following:
• Terminated pods
• Completed Jobs
• Objects without owner references
• Unused containers and container images
• Dynamically provisioned PersistentVolumes with a StorageClass reclaim policy of Delete
• Stale or expired CertificateSigningRequests (CSRs)
• Nodes deleted in the following scenarios:
◦ On a cloud when the cluster uses a cloud controller manager
◦ On-premises when the cluster uses an addon similar to a cloud controller manager
• Node Lease objects
Note:
Cascading deletion
Kubernetes checks for and deletes objects that no longer have owner references, like the pods
left behind when you delete a ReplicaSet. When you delete an object, you can control whether
Kubernetes deletes the object's dependents automatically, in a process called cascading deletion.
There are two types of cascading deletion, as follows:
You can also control how and when garbage collection deletes resources that have owner
references using Kubernetes finalizers.
In foreground cascading deletion, the owner object you're deleting first enters a deletion in
progress state. In this state, the following happens to the owner object:
• The Kubernetes API server sets the object's metadata.deletionTimestamp field to the time
the object was marked for deletion.
• The Kubernetes API server also sets the metadata.finalizers field to foregroundDeletion.
• The object remains visible through the Kubernetes API until the deletion process is
complete.
After the owner object enters the deletion in progress state, the controller deletes the
dependents. After deleting all the dependent objects, the controller deletes the owner object. At
this point, the object is no longer visible in the Kubernetes API.
During foreground cascading deletion, the only dependents that block owner deletion are those
that have the ownerReference.blockOwnerDeletion=true field. See Use foreground cascading
deletion to learn more.
In background cascading deletion, the Kubernetes API server deletes the owner object
immediately and the controller cleans up the dependent objects in the background. By default,
Kubernetes uses background cascading deletion unless you manually use foreground deletion
or choose to orphan the dependent objects.
Orphaned dependents
When Kubernetes deletes an owner object, the dependents left behind are called orphan objects.
By default, Kubernetes deletes dependent objects. To learn how to override this behaviour, see
Delete owner objects and orphan dependents.
To configure options for unused container and image garbage collection, tune the kubelet using
a configuration file and change the parameters related to garbage collection using the
KubeletConfiguration resource type.
Kubernetes manages the lifecycle of all images through its image manager, which is part of the
kubelet, with the cooperation of cadvisor. The kubelet considers the following disk usage limits
when making garbage collection decisions:
• HighThresholdPercent
• LowThresholdPercent
Disk usage above the configured HighThresholdPercent value triggers garbage collection,
which deletes images in order based on the last time they were used, starting with the oldest
first. The kubelet deletes images until disk usage reaches the LowThresholdPercent value.
The kubelet garbage collects unused containers based on the following variables, which you can
define:
• MinAge: the minimum age at which the kubelet can garbage collect a container. Disable
by setting to 0.
• MaxPerPodContainer: the maximum number of dead containers each Pod can have.
Disable by setting to less than 0.
• MaxContainers: the maximum number of dead containers the cluster can have. Disable
by setting to less than 0.
In addition to these variables, the kubelet garbage collects unidentified and deleted containers,
typically starting with the oldest first.
MaxPerPodContainer and MaxContainers may potentially conflict with each other in situations
where retaining the maximum number of containers per Pod (MaxPerPodContainer) would go
outside the allowable total of global dead containers (MaxContainers). In this situation, the
kubelet adjusts MaxPerPodContainer to address the conflict. A worst-case scenario would be to
downgrade MaxPerPodContainer to 1 and evict the oldest containers. Additionally, containers
owned by pods that have been deleted are removed once they are older than MinAge.
What's next
• Learn more about ownership of Kubernetes objects.
• Learn more about Kubernetes finalizers.
• Learn about the TTL controller that cleans up finished Jobs.
Kubernetes 1.28 includes an alpha feature that lets an API Server proxy a resource requests to
other peer API servers. This is useful when there are multiple API servers running different
versions of Kubernetes in one cluster (for example, during a long-lived rollout to a new release
of Kubernetes).
This enables cluster administrators to configure highly available clusters that can be upgraded
more safely, by directing resource requests (made during the upgrade) to the correct kube-
apiserver. That proxying prevents users from seeing unexpected 404 Not Found errors that stem
from the upgrade process.
• The source kube-apiserver reuses the existing APIserver client authentication flags --
proxy-client-cert-file and --proxy-client-key-file to present its identity that will be
verified by its peer (the destination kube-apiserver). The destination API server verifies
that peer connection based on the configuration you specify using the --requestheader-
client-ca-file command line argument.
• To authenticate the destination server's serving certs, you must configure a certificate
authority bundle by specifying the --peer-ca-file command line argument to the source
API server.
To set the network location of a kube-apiserver that peers will use to proxy requests, use the --
peer-advertise-ip and --peer-advertise-port command line arguments to kube-apiserver or
specify these fields in the API server configuration file. If these flags are unspecified, peers will
use the value from either --advertise-address or --bind-address command line argument to the
kube-apiserver. If those too, are unset, the host's default interface is used.
• When a resource request reaches an API server that cannot serve that API (either because
it is at a version pre-dating the introduction of the API or the API is turned off on the API
server) the API server attempts to send the request to a peer API server that can serve the
requested API. It does so by identifying API groups / versions / resources that the local
server doesn't recognise, and tries to proxy those requests to a peer API server that is
capable of handling the request.
• If the peer API server fails to respond, the source API server responds with 503 ("Service
Unavailable") error.
How it works under the hood
When an API Server receives a resource request, it first checks which API servers can serve the
requested resource. This check happens using the internal StorageVersion API.
• If the resource is known to the API server that received the request (for example, GET /
api/v1/pods/some-pod), the request is handled locally.
• If there is no internal StorageVersion object found for the requested resource (for
example, GET /my-api/v1/my-resource) and the configured APIService specifies proxying
to an extension API server, that proxying happens following the usual flow for extension
APIs.
• If a valid internal StorageVersion object is found for the requested resource (for example,
GET /batch/v1/jobs) and the API server trying to handle the request (the handling API
server) has the batch API disabled, then the handling API server fetches the peer API
servers that do serve the relevant API group / version / resource (api/v1/batch in this
case) using the information in the fetched StorageVersion object. The handling API server
then proxies the request to one of the matching peer kube-apiservers that are aware of
the requested resource.
◦ If there is no peer known for that API group / version / resource, the handling API
server passes the request to its own handler chain which should eventually return a
404 ("Not Found") response.
◦ If the handling API server has identified and selected a peer API server, but that
peer fails to respond (for reasons such as network connectivity issues, or a data
race between the request being received and a controller registering the peer's info
into the control plane), then the handling API server responds with a 503 ("Service
Unavailable") error.
Containers
Technology for packaging an application along with its runtime dependencies.
Each container that you run is repeatable; the standardization from having dependencies
included means that you get the same behavior wherever you run it.
Containers decouple applications from the underlying host infrastructure. This makes
deployment easier in different cloud or OS environments.
Each node in a Kubernetes cluster runs the containers that form the Pods assigned to that node.
Containers in a Pod are co-located and co-scheduled to run on the same node.
Container images
A container image is a ready-to-run software package containing everything needed to run an
application: the code and any runtime it requires, application and system libraries, and default
values for any essential settings.
Containers are intended to be stateless and immutable: you should not change the code of a
container that is already running. If you have a containerized application and want to make
changes, the correct process is to build a new image that includes the change, then recreate the
container to start from the updated image.
Container runtimes
A fundamental component that empowers Kubernetes to run containers effectively. It is
responsible for managing the execution and lifecycle of containers within the Kubernetes
environment.
Kubernetes supports container runtimes such as containerd, CRI-O, and any other
implementation of the Kubernetes CRI (Container Runtime Interface).
Usually, you can allow your cluster to pick the default container runtime for a Pod. If you need
to use more than one container runtime in your cluster, you can specify the RuntimeClass for a
Pod to make sure that Kubernetes runs those containers using a particular container runtime.
You can also use RuntimeClass to run different Pods with the same container runtime but with
different settings.
Container Environment
Images
A container image represents binary data that encapsulates an application and all its software
dependencies. Container images are executable software bundles that can run standalone and
that make very well defined assumptions about their runtime environment.
You typically create a container image of your application and push it to a registry before
referring to it in a Pod.
Note: If you are looking for the container images for a Kubernetes release (such as v1.28, the
latest minor release), visit Download Kubernetes.
Image names
Container images are usually given a name such as pause, example/mycontainer, or kube-
apiserver. Images can also include a registry hostname; for example: fictional.registry.example/
imagename, and possibly a port number as well; for example: fictional.registry.example:10443/
imagename.
If you don't specify a registry hostname, Kubernetes assumes that you mean the Docker public
registry.
After the image name part you can add a tag (in the same way you would when using with
commands like docker or podman). Tags let you identify different versions of the same series of
images.
Image tags consist of lowercase and uppercase letters, digits, underscores (_), periods (.), and
dashes (-).
There are additional rules about where you can place the separator characters (_, -, and .) inside
an image tag.
If you don't specify a tag, Kubernetes assumes you mean the tag latest.
Updating images
When you first create a Deployment, StatefulSet, Pod, or other object that includes a Pod
template, then by default the pull policy of all containers in that pod will be set to IfNotPresent
if it is not explicitly specified. This policy causes the kubelet to skip pulling an image if it
already exists.
The imagePullPolicy for a container and the tag of the image affect when the kubelet attempts
to pull (download) the specified image.
Here's a list of the values you can set for imagePullPolicy and the effects these values have:
IfNotPresent
the image is pulled only if it is not already present locally.
Always
every time the kubelet launches a container, the kubelet queries the container image
registry to resolve the name to an image digest. If the kubelet has a container image with
that exact digest cached locally, the kubelet uses its cached image; otherwise, the kubelet
pulls the image with the resolved digest, and uses that image to launch the container.
Never
the kubelet does not try fetching the image. If the image is somehow already present
locally, the kubelet attempts to start the container; otherwise, startup fails. See pre-pulled
images for more details.
The caching semantics of the underlying image provider make even imagePullPolicy: Always
efficient, as long as the registry is reliably accessible. Your container runtime can notice that the
image layers already exist on the node so that they don't need to be downloaded again.
Note:
You should avoid using the :latest tag when deploying containers in production as it is harder to
track which version of the image is running and more difficult to roll back properly.
To make sure the Pod always uses the same version of a container image, you can specify the
image's digest; replace <image-name>:<tag> with <image-name>@<digest> (for example,
image@sha256:45b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5cb2).
When using image tags, if the image registry were to change the code that the tag on that
image represents, you might end up with a mix of Pods running the old and new code. An
image digest uniquely identifies a specific version of the image, so Kubernetes runs the same
code every time it starts a container with that image name and digest specified. Specifying an
image by digest fixes the code that you run so that a change at the registry cannot lead to that
mix of versions.
There are third-party admission controllers that mutate Pods (and pod templates) when they are
created, so that the running workload is defined based on an image digest rather than a tag.
That might be useful if you want to make sure that all your workload is running the same code
no matter what tag changes happen at the registry.
When you (or a controller) submit a new Pod to the API server, your cluster sets the
imagePullPolicy field when specific conditions are met:
• if you omit the imagePullPolicy field, and you specify the digest for the container image,
the imagePullPolicy is automatically set to IfNotPresent.
• if you omit the imagePullPolicy field, and the tag for the container image is :latest,
imagePullPolicy is automatically set to Always;
• if you omit the imagePullPolicy field, and you don't specify the tag for the container
image, imagePullPolicy is automatically set to Always;
• if you omit the imagePullPolicy field, and you specify the tag for the container image that
isn't :latest, the imagePullPolicy is automatically set to IfNotPresent.
Note:
The value of imagePullPolicy of the container is always set when the object is first created, and
is not updated if the image's tag or digest later changes.
For example, if you create a Deployment with an image whose tag is not :latest, and later
update that Deployment's image to a :latest tag, the imagePullPolicy field will not change to
Always. You must manually change the pull policy of any object after its initial creation.
If you would like to always force a pull, you can do one of the following:
ImagePullBackOff
When a kubelet starts creating containers for a Pod using a container runtime, it might be
possible the container is in Waiting state because of ImagePullBackOff.
The status ImagePullBackOff means that a container could not start because Kubernetes could
not pull a container image (for reasons such as invalid image name, or pulling from a private
registry without imagePullSecret). The BackOff part indicates that Kubernetes will keep trying
to pull the image, with an increasing back-off delay.
Kubernetes raises the delay between each attempt until it reaches a compiled-in limit, which is
300 seconds (5 minutes).
Serial and parallel image pulls
By default, kubelet pulls images serially. In other words, kubelet sends only one image pull
request to the image service at a time. Other image pull requests have to wait until the one
being processed is complete.
Nodes make image pull decisions in isolation. Even when you use serialized image pulls, two
different nodes can pull the same image in parallel.
If you would like to enable parallel image pulls, you can set the field serializeImagePulls to false
in the kubelet configuration. With serializeImagePulls set to false, image pull requests will be
sent to the image service immediately, and multiple images will be pulled at the same time.
When enabling parallel image pulls, please make sure the image service of your container
runtime can handle parallel image pulls.
The kubelet never pulls multiple images in parallel on behalf of one Pod. For example, if you
have a Pod that has an init container and an application container, the image pulls for the two
containers will not be parallelized. However, if you have two Pods that use different images, the
kubelet pulls the images in parallel on behalf of the two different Pods, when parallel image
pulls is enabled.
When serializeImagePulls is set to false, the kubelet defaults to no limit on the maximum
number of images being pulled at the same time. If you would like to limit the number of
parallel image pulls, you can set the field maxParallelImagePulls in kubelet configuration. With
maxParallelImagePulls set to n, only n images can be pulled at the same time, and any image
pull beyond n will have to wait until at least one ongoing image pull is complete.
Limiting the number parallel image pulls would prevent image pulling from consuming too
much network bandwidth or disk I/O, when parallel image pulling is enabled.
You can set maxParallelImagePulls to a positive number that is greater than or equal to 1. If you
set maxParallelImagePulls to be greater than or equal to 2, you must set the serializeImagePulls
to false. The kubelet will fail to start with invalid maxParallelImagePulls settings.
Kubernetes itself typically names container images with a suffix -$(ARCH). For backward
compatibility, please generate the older images with suffixes. The idea is to generate say pause
image which has the manifest for all the arch(es) and say pause-amd64 which is backwards
compatible for older configurations or YAML files which may have hard coded the images with
suffixes.
Using a private registry
Private registries may require keys to read images from them.
Credentials can be provided in several ways:
Specific instructions for setting credentials depends on the container runtime and registry you
chose to use. You should refer to your solution's documentation for the most accurate
information.
For an example of configuring a private container image registry, see the Pull an Image from a
Private Registry task. That example uses a private registry in Docker Hub.
Note: This approach is especially suitable when kubelet needs to fetch registry credentials
dynamically. Most commonly used for registries provided by cloud providers where auth tokens
are short-lived.
You can configure the kubelet to invoke a plugin binary to dynamically fetch registry
credentials for a container image. This is the most robust and versatile way to fetch credentials
for private registries, but also requires kubelet-level configuration to enable.
Interpretation of config.json
The interpretation of config.json varies between the original Docker implementation and the
Kubernetes interpretation. In Docker, the auths keys can only specify root URLs, whereas
Kubernetes allows glob URLs as well as prefix-matched paths. The only limitation is that glob
patterns (*) have to include the dot (.) for each subdomain. The amount of matched subdomains
has to be equal to the amount of glob patterns (*.), for example:
{
"auths": {
"my-registry.io/images": { "auth": "..." },
"*.my-registry.io/images": { "auth": "..." }
}
}
Image pull operations would now pass the credentials to the CRI container runtime for every
valid pattern. For example the following container image names would match successfully:
• my-registry.io/images
• my-registry.io/images/my-image
• my-registry.io/images/another-image
• sub.my-registry.io/images/my-image
But not:
• a.sub.my-registry.io/images/my-image
• a.b.sub.my-registry.io/images/my-image
The kubelet performs image pulls sequentially for every found credential. This means, that
multiple entries in config.json for different paths are possible, too:
{
"auths": {
"my-registry.io/images": {
"auth": "..."
},
"my-registry.io/images/subpath": {
"auth": "..."
}
}
}
Pre-pulled images
Note: This approach is suitable if you can control node configuration. It will not work reliably if
your cloud provider manages nodes and replaces them automatically.
By default, the kubelet tries to pull each image from the specified registry. However, if the
imagePullPolicy property of the container is set to IfNotPresent or Never, then a local image is
used (preferentially or exclusively, respectively).
If you want to rely on pre-pulled images as a substitute for registry authentication, you must
ensure all nodes in the cluster have the same pre-pulled images.
This can be used to preload certain images for speed or as an alternative to authenticating to a
private registry.
Note: This is the recommended approach to run containers based on images in private
registries.
Kubernetes supports specifying container image registry keys on a Pod. imagePullSecrets must
all be in the same namespace as the Pod. The referenced Secrets must be of type kubernetes.io/
dockercfg or kubernetes.io/dockerconfigjson.
You need to know the username, registry password and client email address for authenticating
to the registry, as well as its hostname. Run the following command, substituting the
appropriate uppercase values:
If you already have a Docker credentials file then, rather than using the above command, you
can import the credentials file as a Kubernetes Secrets.
Create a Secret based on existing Docker credentials explains how to set this up.
This is particularly useful if you are using multiple private container registries, as
kubectl create secret docker-registry creates a Secret that only works with a single private
registry.
Note: Pods can only reference image pull secrets in their own namespace, so this process needs
to be done one time per namespace.
Now, you can create pods which reference that secret by adding an imagePullSecrets section to
a Pod definition. Each item in the imagePullSecrets array can only reference a Secret in the
same namespace.
For example:
This needs to be done for each pod that is using a private registry.
You can use this in conjunction with a per-node .docker/config.json. The credentials will be
merged.
Use cases
There are a number of solutions for configuring private registries. Here are some common use
cases and suggested solutions.
1. Cluster running only non-proprietary (e.g. open-source) images. No need to hide images.
◦ Use public images from a public registry
▪ No configuration required.
▪ Some cloud providers automatically cache or mirror public images, which
improves availability and reduces the time to pull images.
2. Cluster running some proprietary images which should be hidden to those outside the
company, but visible to all cluster users.
◦ Use a hosted private registry
▪ Manual configuration may be required on the nodes that need to access to
private registry
◦ Or, run an internal private registry behind your firewall with open read access.
▪ No Kubernetes configuration is required.
◦ Use a hosted container image registry service that controls image access
▪ It will work better with cluster autoscaling than manual node configuration.
◦ Or, on a cluster where changing the node configuration is inconvenient, use
imagePullSecrets.
3. Cluster with proprietary images, a few of which require stricter access control.
◦ Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods
potentially have access to all images.
◦ Move sensitive data into a "Secret" resource, instead of packaging it in an image.
4. A multi-tenant cluster where each tenant needs own private registry.
◦ Ensure AlwaysPullImages admission controller is active. Otherwise, all Pods of all
tenants potentially have access to all images.
◦ Run a private registry with authorization required.
◦ Generate registry credential for each tenant, put into secret, and populate secret to
each tenant namespace.
◦ The tenant adds that secret to imagePullSecrets of each namespace.
If you need access to multiple registries, you can create one secret for each registry.
There were three built-in implementations of the kubelet credential provider integration: ACR
(Azure Container Registry), ECR (Elastic Container Registry), and GCR (Google Container
Registry).
For more information on the legacy mechanism, read the documentation for the version of
Kubernetes that you are using. Kubernetes v1.26 through to v1.28 do not include the legacy
mechanism, so you would need to either:
What's next
• Read the OCI Image Manifest Specification.
• Learn about container image garbage collection.
• Learn more about pulling an Image from a Private Registry.
Container Environment
This page describes the resources available to Containers in the Container environment.
Container environment
The Kubernetes Container environment provides several important resources to Containers:
Container information
The hostname of a Container is the name of the Pod in which the Container is running. It is
available through the hostname command or the gethostname function call in libc.
The Pod name and namespace are available as environment variables through the downward
API.
User defined environment variables from the Pod definition are also available to the Container,
as are any environment variables specified statically in the container image.
Cluster information
A list of all services that were running when a Container was created is available to that
Container as environment variables. This list is limited to services within the same namespace
as the new Container's Pod and Kubernetes control plane services.
For a service named foo that maps to a Container named bar, the following variables are
defined:
Services have dedicated IP addresses and are available to the Container via DNS, if DNS addon
is enabled.
What's next
• Learn more about Container lifecycle hooks.
• Get hands-on experience attaching handlers to Container lifecycle events.
Runtime Class
FEATURE STATE: Kubernetes v1.20 [stable]
This page describes the RuntimeClass resource and runtime selection mechanism.
RuntimeClass is a feature for selecting the container runtime configuration. The container
runtime configuration is used to run a Pod's containers.
Motivation
You can set a different RuntimeClass between different Pods to provide a balance of
performance versus security. For example, if part of your workload deserves a high level of
information security assurance, you might choose to schedule those Pods so that they run in a
container runtime that uses hardware virtualization. You'd then benefit from the extra isolation
of the alternative runtime, at the expense of some additional overhead.
You can also use RuntimeClass to run different Pods with the same container runtime but with
different settings.
Setup
1. Configure the CRI implementation on nodes (runtime dependent)
2. Create the corresponding RuntimeClass resources
The configurations available through RuntimeClass are Container Runtime Interface (CRI)
implementation dependent. See the corresponding documentation (below) for your CRI
implementation for how to configure.
Note: RuntimeClass assumes a homogeneous node configuration across the cluster by default
(which means that all nodes are configured the same way with respect to container runtimes).
To support heterogeneous node configurations, see Scheduling below.
The configurations have a corresponding handler name, referenced by the RuntimeClass. The
handler must be a valid DNS label name.
The configurations setup in step 1 should each have an associated handler name, which
identifies the configuration. For each handler, create a corresponding RuntimeClass object.
The RuntimeClass resource currently only has 2 significant fields: the RuntimeClass name
(metadata.name) and the handler (handler). The object definition looks like this:
Usage
Once RuntimeClasses are configured for the cluster, you can specify a runtimeClassName in the
Pod spec to use it. For example:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
runtimeClassName: myclass
# ...
This will instruct the kubelet to use the named RuntimeClass to run this pod. If the named
RuntimeClass does not exist, or the CRI cannot run the corresponding handler, the pod will
enter the Failed terminal phase. Look for a corresponding event for an error message.
containerd
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.${HANDLER_NAME}]
CRI-O
[crio.runtime.runtimes.${HANDLER_NAME}]
runtime_path = "${PATH_TO_BINARY}"
Scheduling
FEATURE STATE: Kubernetes v1.16 [beta]
By specifying the scheduling field for a RuntimeClass, you can set constraints to ensure that
Pods running with this RuntimeClass are scheduled to nodes that support it. If scheduling is not
set, this RuntimeClass is assumed to be supported by all nodes.
To ensure pods land on nodes supporting a specific RuntimeClass, that set of nodes should have
a common label which is then selected by the runtimeclass.scheduling.nodeSelector field. The
RuntimeClass's nodeSelector is merged with the pod's nodeSelector in admission, effectively
taking the intersection of the set of nodes selected by each. If there is a conflict, the pod will be
rejected.
If the supported nodes are tainted to prevent other RuntimeClass pods from running on the
node, you can add tolerations to the RuntimeClass. As with the nodeSelector, the tolerations are
merged with the pod's tolerations in admission, effectively taking the union of the set of nodes
tolerated by each.
To learn more about configuring the node selector and tolerations, see Assigning Pods to
Nodes.
Pod Overhead
You can specify overhead resources that are associated with running a Pod. Declaring overhead
allows the cluster (including the scheduler) to account for it when making decisions about Pods
and resources.
Pod overhead is defined in RuntimeClass through the overhead field. Through the use of this
field, you can specify the overhead of running pods utilizing this RuntimeClass and ensure
these overheads are accounted for in Kubernetes.
What's next
• RuntimeClass Design
• RuntimeClass Scheduling Design
• Read about the Pod Overhead concept
• PodOverhead Feature Design
Overview
Analogous to many programming language frameworks that have component lifecycle hooks,
such as Angular, Kubernetes provides Containers with lifecycle hooks. The hooks enable
Containers to be aware of events in their management lifecycle and run code implemented in a
handler when the corresponding lifecycle hook is executed.
Container hooks
There are two hooks that are exposed to Containers:
PostStart
This hook is executed immediately after a container is created. However, there is no guarantee
that the hook will execute before the container ENTRYPOINT. No parameters are passed to the
handler.
PreStop
This hook is called immediately before a container is terminated due to an API request or
management event such as a liveness/startup probe failure, preemption, resource contention
and others. A call to the PreStop hook fails if the container is already in a terminated or
completed state and the hook must complete before the TERM signal to stop the container can
be sent. The Pod's termination grace period countdown begins before the PreStop hook is
executed, so regardless of the outcome of the handler, the container will eventually terminate
within the Pod's termination grace period. No parameters are passed to the handler.
A more detailed description of the termination behavior can be found in Termination of Pods.
Hook handler implementations
Containers can access a hook by implementing and registering a handler for that hook. There
are two types of hook handlers that can be implemented for Containers:
• Exec - Executes a specific command, such as pre-stop.sh, inside the cgroups and
namespaces of the Container. Resources consumed by the command are counted against
the Container.
• HTTP - Executes an HTTP request against a specific endpoint on the Container.
When a Container lifecycle management hook is called, the Kubernetes management system
executes the handler according to the hook action, httpGet and tcpSocket are executed by the
kubelet process, and exec is executed in the container.
Hook handler calls are synchronous within the context of the Pod containing the Container.
This means that for a PostStart hook, the Container ENTRYPOINT and hook fire
asynchronously. However, if the hook takes too long to run or hangs, the Container cannot
reach a running state.
PreStop hooks are not executed asynchronously from the signal to stop the Container; the hook
must complete its execution before the TERM signal can be sent. If a PreStop hook hangs
during execution, the Pod's phase will be Terminating and remain there until the Pod is killed
after its terminationGracePeriodSeconds expires. This grace period applies to the total time it
takes for both the PreStop hook to execute and for the Container to stop normally. If, for
example, terminationGracePeriodSeconds is 60, and the hook takes 55 seconds to complete, and
the Container takes 10 seconds to stop normally after receiving the signal, then the Container
will be killed before it can stop normally, since terminationGracePeriodSeconds is less than the
total time (55+10) it takes for these two things to happen.
Users should make their hook handlers as lightweight as possible. There are cases, however,
when long running commands make sense, such as when saving state prior to stopping a
Container.
Hook delivery is intended to be at least once, which means that a hook may be called multiple
times for any given event, such as for PostStart or PreStop. It is up to the hook implementation
to handle this correctly.
Generally, only single deliveries are made. If, for example, an HTTP hook receiver is down and
is unable to take traffic, there is no attempt to resend. In some rare cases, however, double
delivery may occur. For instance, if a kubelet restarts in the middle of sending a hook, the hook
might be resent after the kubelet comes back up.
The logs for a Hook handler are not exposed in Pod events. If a handler fails for some reason, it
broadcasts an event. For PostStart, this is the FailedPostStartHook event, and for PreStop, this is
the FailedPreStopHook event. To generate a failed FailedPostStartHook event yourself, modify
the lifecycle-events.yaml file to change the postStart command to "badcommand" and apply it.
Here is some example output of the resulting events you see from running
kubectl describe pod lifecycle-demo:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7s default-scheduler Successfully assigned default/
lifecycle-demo to ip-XXX-XXX-XX-XX.us-east-2...
Normal Pulled 6s kubelet Successfully pulled image "nginx" in
229.604315ms
Normal Pulling 4s (x2 over 6s) kubelet Pulling image "nginx"
Normal Created 4s (x2 over 5s) kubelet Created container lifecycle-demo-
container
Normal Started 4s (x2 over 5s) kubelet Started container lifecycle-demo-
container
Warning FailedPostStartHook 4s (x2 over 5s) kubelet Exec lifecycle hook
([badcommand]) for Container "lifecycle-demo-container" in Pod "lifecycle-
demo_default(30229739-9651-4e5a-9a32-a8f1688862db)" failed - error: command 'badcommand'
exited with 126: , message: "OCI runtime exec failed: exec failed: container_linux.go:380:
starting container process caused: exec: \"badcommand\": executable file not found in $PATH:
unknown\r\n"
Normal Killing 4s (x2 over 5s) kubelet FailedPostStartHook
Normal Pulled 4s kubelet Successfully pulled image "nginx" in
215.66395ms
Warning BackOff 2s (x2 over 3s) kubelet Back-off restarting failed container
What's next
• Learn more about the Container environment.
• Get hands-on experience attaching handlers to Container lifecycle events.
Workloads
Understand Pods, the smallest deployable compute object in Kubernetes, and the higher-level
abstractions that help you to run them.
Kubernetes pods have a defined lifecycle. For example, once a pod is running in your cluster
then a critical fault on the node where that pod is running means that all the pods on that node
fail. Kubernetes treats that level of failure as final: you would need to create a new Pod to
recover, even if the node later becomes healthy.
However, to make life considerably easier, you don't need to manage each Pod directly. Instead,
you can use workload resources that manage a set of pods on your behalf. These resources
configure controllers that make sure the right number of the right kind of pod are running, to
match the state you specified.
Kubernetes provides several built-in workload resources:
In the wider Kubernetes ecosystem, you can find third-party workload resources that provide
additional behaviors. Using a custom resource definition, you can add in a third-party workload
resource if you want a specific behavior that's not part of Kubernetes' core. For example, if you
wanted to run a group of Pods for your application but stop work unless all the Pods are
available (perhaps for some high-throughput distributed task), then you can implement or
install an extension that does provide that feature.
What's next
As well as reading about each API kind for workload management, you can read how to do
specific tasks:
To learn about Kubernetes' mechanisms for separating code from configuration, visit
Configuration.
There are two supporting concepts that provide backgrounds about how Kubernetes manages
pods for applications:
• Garbage collection tidies up objects from your cluster after their owning resource has been
removed.
• The time-to-live after finished controller removes Jobs once a defined time has passed
since they completed.
Once your application is running, you might want to make it available on the internet as a
Service or, for web application only, using an Ingress.
Pods
Pods are the smallest deployable units of computing that you can create and manage in
Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more containers, with shared
storage and network resources, and a specification for how to run the containers. A Pod's
contents are always co-located and co-scheduled, and run in a shared context. A Pod models an
application-specific "logical host": it contains one or more application containers which are
relatively tightly coupled. In non-cloud contexts, applications executed on the same physical or
virtual machine are analogous to cloud applications executed on the same logical host.
As well as application containers, a Pod can contain init containers that run during Pod startup.
You can also inject ephemeral containers for debugging if your cluster offers this.
What is a Pod?
Note: You need to install a container runtime into each node in the cluster so that Pods can run
there.
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets
of isolation - the same things that isolate a container. Within a Pod's context, the individual
applications may have further sub-isolations applied.
A Pod is similar to a set of containers with shared namespaces and shared filesystem volumes.
Using Pods
The following is an example of a Pod which consists of a container running the image nginx:
1.14.2.
pods/simple-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
Pods are generally not created directly and are created using workload resources. See Working
with Pods for more information on how Pods are used with workload resources.
Workload resources for managing pods
Usually you don't need to create Pods directly, even singleton Pods. Instead, create them using
workload resources such as Deployment or Job. If your Pods need to track state, consider the
StatefulSet resource.
• Pods that run a single container. The "one-container-per-Pod" model is the most
common Kubernetes use case; in this case, you can think of a Pod as a wrapper around a
single container; Kubernetes manages Pods rather than managing the containers directly.
• Pods that run multiple containers that need to work together. A Pod can
encapsulate an application composed of multiple co-located containers that are tightly
coupled and need to share resources. These co-located containers form a single cohesive
unit of service—for example, one container serving data stored in a shared volume to the
public, while a separate sidecar container refreshes or updates those files. The Pod wraps
these containers, storage resources, and an ephemeral network identity together as a
single unit.
Each Pod is meant to run a single instance of a given application. If you want to scale your
application horizontally (to provide more overall resources by running more instances), you
should use multiple Pods, one for each instance. In Kubernetes, this is typically referred to as
replication. Replicated Pods are usually created and managed as a group by a workload resource
and its controller.
See Pods and controllers for more information on how Kubernetes uses workload resources,
and their controllers, to implement application scaling and auto-healing.
Pods are designed to support multiple cooperating processes (as containers) that form a
cohesive unit of service. The containers in a Pod are automatically co-located and co-scheduled
on the same physical or virtual machine in the cluster. The containers can share resources and
dependencies, communicate with one another, and coordinate when and how they are
terminated.
For example, you might have a container that acts as a web server for files in a shared volume,
and a separate "sidecar" container that updates those files from a remote source, as in the
following diagram:
Some Pods have init containers as well as app containers. By default, init containers run and
complete before the app containers are started.
Enabling the SidecarContainers feature gate allows you to specify restartPolicy: Always for init
containers. Setting the Always restart policy ensures that the init containers where you set it
are kept running during the entire lifetime of the Pod. See Sidecar containers and restartPolicy
for more details.
Pods natively provide two kinds of shared resources for their constituent containers:
networking and storage.
Note: Restarting a container in a Pod should not be confused with restarting a Pod. A Pod is
not a process, but an environment for running container(s). A Pod persists until it is deleted.
The name of a Pod must be a valid DNS subdomain value, but this can produce unexpected
results for the Pod hostname. For best compatibility, the name should follow the more
restrictive rules for a DNS label.
Pod OS
You should set the .spec.os.name field to either windows or linux to indicate the OS on which
you want the pod to run. These two are the only operating systems supported for now by
Kubernetes. In future, this list may be expanded.
In Kubernetes v1.28, the value you set for this field has no effect on scheduling of the pods.
Setting the .spec.os.name helps to identify the pod OS authoritatively and is used for validation.
The kubelet refuses to run a Pod where you have specified a Pod OS, if this isn't the same as the
operating system for the node where that kubelet is running. The Pod security standards also
use this field to avoid enforcing policies that aren't relevant to that operating system.
You can use workload resources to create and manage multiple Pods for you. A controller for
the resource handles replication and rollout and automatic healing in case of Pod failure. For
example, if a Node fails, a controller notices that Pods on that Node have stopped working and
creates a replacement Pod. The scheduler places the replacement Pod onto a healthy Node.
Here are some examples of workload resources that manage one or more Pods:
• Deployment
• StatefulSet
• DaemonSet
Pod templates
Controllers for workload resources create Pods from a pod template and manage those Pods on
your behalf.
PodTemplates are specifications for creating Pods, and are included in workload resources such
as Deployments, Jobs, and DaemonSets.
Each controller for a workload resource uses the PodTemplate inside the workload object to
make actual Pods. The PodTemplate is part of the desired state of whatever workload resource
you used to run your app.
The sample below is a manifest for a simple Job with a template that starts one container. The
container in that Pod prints a message then pauses.
apiVersion: batch/v1
kind: Job
metadata:
name: hello
spec:
template:
# This is the pod template
spec:
containers:
- name: hello
image: busybox:1.28
command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
restartPolicy: OnFailure
# The pod template ends here
Modifying the pod template or switching to a new pod template has no direct effect on the Pods
that already exist. If you change the pod template for a workload resource, that resource needs
to create replacement Pods that use the updated template.
For example, the StatefulSet controller ensures that the running Pods match the current pod
template for each StatefulSet object. If you edit the StatefulSet to change its pod template, the
StatefulSet starts to create new Pods based on the updated template. Eventually, all of the old
Pods are replaced with new Pods, and the update is complete.
Each workload resource implements its own rules for handling changes to the Pod template. If
you want to read more about StatefulSet specifically, read Update strategy in the StatefulSet
Basics tutorial.
On Nodes, the kubelet does not directly observe or manage any of the details around pod
templates and updates; those details are abstracted away. That abstraction and separation of
concerns simplifies system semantics, and makes it feasible to extend the cluster's behavior
without changing existing code.
• Most of the metadata about a Pod is immutable. For example, you cannot change the
namespace, name, uid, or creationTimestamp fields; the generation field is unique. It only
accepts updates that increment the field's current value.
• When updating the spec.activeDeadlineSeconds field, two types of updates are allowed:
Storage in Pods
A Pod can specify a set of shared storage volumes. All containers in the Pod can access the
shared volumes, allowing those containers to share data. Volumes also allow persistent data in a
Pod to survive in case one of the containers within needs to be restarted. See Storage for more
information on how Kubernetes implements shared storage and makes it available to Pods.
Pod networking
Each Pod is assigned a unique IP address for each address family. Every container in a Pod
shares the network namespace, including the IP address and network ports. Inside a Pod (and
only then), the containers that belong to the Pod can communicate with one another using
localhost. When containers in a Pod communicate with entities outside the Pod, they must
coordinate how they use the shared network resources (such as ports). Within a Pod, containers
share an IP address and port space, and can find each other via localhost. The containers in a
Pod can also communicate with each other using standard inter-process communications like
SystemV semaphores or POSIX shared memory. Containers in different Pods have distinct IP
addresses and can not communicate by OS-level IPC without special configuration. Containers
that want to interact with a container running in a different Pod can use IP networking to
communicate.
Containers within the Pod see the system hostname as being the same as the configured name
for the Pod. There's more about this in the networking section.
In Linux, any container in a Pod can enable privileged mode using the privileged (Linux) flag on
the security context of the container spec. This is useful for containers that want to use
operating system administrative capabilities such as manipulating the network stack or
accessing hardware devices.
Static Pods
Static Pods are managed directly by the kubelet daemon on a specific node, without the API
server observing them. Whereas most Pods are managed by the control plane (for example, a
Deployment), for static Pods, the kubelet directly supervises each static Pod (and restarts it if it
fails).
Static Pods are always bound to one Kubelet on a specific node. The main use for static Pods is
to run a self-hosted control plane: in other words, using the kubelet to supervise the individual
control plane components.
The kubelet automatically tries to create a mirror Pod on the Kubernetes API server for each
static Pod. This means that the Pods running on a node are visible on the API server, but cannot
be controlled from there. See the guide Create static Pods for more information.
Note: The spec of a static Pod cannot refer to other API objects (e.g., ServiceAccount,
ConfigMap, Secret, etc).
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a
diagnostic, the kubelet can invoke different actions:
You can read more about probes in the Pod Lifecycle documentation.
What's next
• Learn about the lifecycle of a Pod.
• Learn about RuntimeClass and how you can use it to configure different Pods with
different container runtime configurations.
• Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
• Pod is a top-level resource in the Kubernetes REST API. The Pod object definition
describes the object in detail.
• The Distributed System Toolkit: Patterns for Composite Containers explains common
layouts for Pods with more than one container.
• Read about Pod topology spread constraints
To understand the context for why Kubernetes wraps a common Pod API in other resources
(such as StatefulSets or Deployments), you can read about the prior art, including:
• Aurora
• Borg
• Marathon
• Omega
• Tupperware.
Pod Lifecycle
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting in the
Pending phase, moving through Running if at least one of its primary containers starts OK, and
then through either the Succeeded or Failed phases depending on whether any container in the
Pod terminated in failure.
Whilst a Pod is running, the kubelet is able to restart containers to handle some kind of faults.
Within a Pod, Kubernetes tracks different container states and determines what action to take
to make the Pod healthy again.
In the Kubernetes API, Pods have both a specification and an actual status. The status for a Pod
object consists of a set of Pod conditions. You can also inject custom readiness information into
the condition data for a Pod, if that is useful to your application.
Pods are only scheduled once in their lifetime. Once a Pod is scheduled (assigned) to a Node,
the Pod runs on that Node until it stops or is terminated.
Pod lifetime
Like individual application containers, Pods are considered to be relatively ephemeral (rather
than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes
where they remain until termination (according to restart policy) or deletion. If a Node dies, the
Pods scheduled to that node are scheduled for deletion after a timeout period.
Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is
deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node
maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the
work of managing the relatively disposable Pod instances.
A given Pod (as defined by a UID) is never "rescheduled" to a different node; instead, that Pod
can be replaced by a new, near-identical Pod, with even the same name if desired, but with a
different UID.
When something is said to have the same lifetime as a Pod, such as a volume, that means that
the thing exists as long as that specific Pod (with that exact UID) exists. If that Pod is deleted for
any reason, and even if an identical replacement is created, the related thing (a volume, in this
example) is also destroyed and created anew.
Pod diagram
A multi-container Pod that contains a file puller and a web server that uses a persistent volume
for shared storage between the containers.
Pod phase
A Pod's status field is a PodStatus object, which has a phase field.
The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle. The
phase is not intended to be a comprehensive rollup of observations of container or Pod state,
nor is it intended to be a comprehensive state machine.
The number and meanings of Pod phase values are tightly guarded. Other than what is
documented here, nothing should be assumed about Pods that have a given phase value.
Value Description
The Pod has been accepted by the Kubernetes cluster, but one or more of the
containers has not been set up and made ready to run. This includes time a Pod
Pending
spends waiting to be scheduled as well as the time spent downloading container
images over the network.
The Pod has been bound to a node, and all of the containers have been created. At
Running
least one container is still running, or is in the process of starting or restarting.
Succeeded All containers in the Pod have terminated in success, and will not be restarted.
All containers in the Pod have terminated, and at least one container has terminated
Failed in failure. That is, the container either exited with non-zero status or was terminated
by the system.
For some reason the state of the Pod could not be obtained. This phase typically
Unknown occurs due to an error in communicating with the node where the Pod should be
running.
Note: When a Pod is being deleted, it is shown as Terminating by some kubectl commands.
This Terminating status is not one of the Pod phases. A Pod is granted a term to terminate
gracefully, which defaults to 30 seconds. You can use the flag --force to terminate a Pod by
force.
Since Kubernetes 1.27, the kubelet transitions deleted Pods, except for static Pods and force-
deleted Pods without a finalizer, to a terminal phase (Failed or Succeeded depending on the exit
statuses of the pod containers) before their deletion from the API server.
If a node dies or is disconnected from the rest of the cluster, Kubernetes applies a policy for
setting the phase of all Pods on the lost node to Failed.
Container states
As well as the phase of the Pod overall, Kubernetes tracks the state of each container inside a
Pod. You can use container lifecycle hooks to trigger events to run at certain points in a
container's lifecycle.
Once the scheduler assigns a Pod to a Node, the kubelet starts creating containers for that Pod
using a container runtime. There are three possible container states: Waiting, Running, and
Terminated.
To check the state of a Pod's containers, you can use kubectl describe pod <name-of-pod>. The
output shows the state for each container within that Pod.
Waiting
If a container is not in either the Running or Terminated state, it is Waiting. A container in the
Waiting state is still running the operations it requires in order to complete start up: for
example, pulling the container image from a container image registry, or applying Secret data.
When you use kubectl to query a Pod with a container that is Waiting, you also see a Reason
field to summarize why the container is in that state.
Running
The Running status indicates that a container is executing without issues. If there was a
postStart hook configured, it has already executed and finished. When you use kubectl to query
a Pod with a container that is Running, you also see information about when the container
entered the Running state.
Terminated
A container in the Terminated state began execution and then either ran to completion or failed
for some reason. When you use kubectl to query a Pod with a container that is Terminated, you
see a reason, an exit code, and the start and finish time for that container's period of execution.
If a container has a preStop hook configured, this hook runs before the container enters the
Terminated state.
The restartPolicy applies to all containers in the Pod. restartPolicy only refers to restarts of the
containers by the kubelet on the same node. After containers in a Pod exit, the kubelet restarts
them with an exponential back-off delay (10s, 20s, 40s, ...), that is capped at five minutes. Once a
container has executed for 10 minutes without any problems, the kubelet resets the restart
backoff timer for that container.
Pod conditions
A Pod has a PodStatus, which has an array of PodConditions through which the Pod has or has
not passed. Kubelet manages the following PodConditions:
Pod readiness
Your application can inject extra feedback or signals into PodStatus: Pod readiness. To use this,
set readinessGates in the Pod's spec to specify a list of additional conditions that the kubelet
evaluates for Pod readiness.
Readiness gates are determined by the current state of status.condition fields for the Pod. If
Kubernetes cannot find such a condition in the status.conditions field of a Pod, the status of the
condition is defaulted to "False".
Here is an example:
kind: Pod
...
spec:
readinessGates:
- conditionType: "www.example.com/feature-1"
status:
conditions:
- type: Ready # a built in PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
- type: "www.example.com/feature-1" # an extra PodCondition
status: "False"
lastProbeTime: null
lastTransitionTime: 2018-01-01T00:00:00Z
containerStatuses:
- containerID: docker://abcd...
ready: true
...
The Pod conditions you add must have names that meet the Kubernetes label key format.
The kubectl patch command does not support patching object status. To set these
status.conditions for the Pod, applications and operators should use the PATCH action. You can
use a Kubernetes client library to write code that sets custom Pod conditions for Pod readiness.
For a Pod that uses custom conditions, that Pod is evaluated to be ready only when both the
following statements apply:
When a Pod's containers are Ready but at least one custom condition is missing or False, the
kubelet sets the Pod's condition to ContainersReady.
After a Pod gets scheduled on a node, it needs to be admitted by the Kubelet and have any
volumes mounted. Once these phases are complete, the Kubelet works with a container runtime
(using Container runtime interface (CRI)) to set up a runtime sandbox and configure
networking for the Pod. If the PodReadyToStartContainersCondition feature gate is enabled,
Kubelet reports whether a pod has reached this initialization milestone through the
PodReadyToStartContainers condition in the status.conditions field of a Pod.
The PodReadyToStartContainers condition is set to False by the Kubelet when it detects a Pod
does not have a runtime sandbox with networking configured. This occurs in the following
scenarios:
• Early in the lifecycle of the Pod, when the kubelet has not yet begun to set up a sandbox
for the Pod using the container runtime.
• Later in the lifecycle of the Pod, when the Pod sandbox has been destroyed due to either:
◦ the node rebooting, without the Pod getting evicted
◦ for container runtimes that use virtual machines for isolation, the Pod sandbox
virtual machine rebooting, which then requires creating a new sandbox and fresh
container network configuration.
The PodReadyToStartContainers condition is set to True by the kubelet after the successful
completion of sandbox creation and network configuration for the Pod by the runtime plugin.
The kubelet can start pulling container images and create containers after
PodReadyToStartContainers condition has been set to True.
For a Pod with init containers, the kubelet sets the Initialized condition to True after the init
containers have successfully completed (which happens after successful sandbox creation and
network configuration by the runtime plugin). For a Pod without init containers, the kubelet
sets the Initialized condition to True before sandbox creation and network configuration starts.
Container probes
A probe is a diagnostic performed periodically by the kubelet on a container. To perform a
diagnostic, the kubelet either executes code within the container, or makes a network request.
Check mechanisms
There are four different ways to check a container using a probe. Each probe must define
exactly one of these four mechanisms:
exec
Executes a specified command inside the container. The diagnostic is considered
successful if the command exits with a status code of 0.
grpc
Performs a remote procedure call using gRPC. The target should implement gRPC health
checks. The diagnostic is considered successful if the status of the response is SERVING.
httpGet
Performs an HTTP GET request against the Pod's IP address on a specified port and path.
The diagnostic is considered successful if the response has a status code greater than or
equal to 200 and less than 400.
tcpSocket
Performs a TCP check against the Pod's IP address on a specified port. The diagnostic is
considered successful if the port is open. If the remote system (the container) closes the
connection immediately after it opens, this counts as healthy.
Caution: Unlike the other mechanisms, exec probe's implementation involves the creation/
forking of multiple processes each time when executed. As a result, in case of the clusters
having higher pod densities, lower intervals of initialDelaySeconds, periodSeconds, configuring
any probe with exec mechanism might introduce an overhead on the cpu usage of the node. In
such scenarios, consider using the alternative probe mechanisms to avoid the overhead.
Probe outcome
Success
The container passed the diagnostic.
Failure
The container failed the diagnostic.
Unknown
The diagnostic failed (no action should be taken, and the kubelet will make further
checks).
Types of probe
The kubelet can optionally perform and react to three kinds of probes on running containers:
livenessProbe
Indicates whether the container is running. If the liveness probe fails, the kubelet kills the
container, and the container is subjected to its restart policy. If a container does not
provide a liveness probe, the default state is Success.
readinessProbe
Indicates whether the container is ready to respond to requests. If the readiness probe
fails, the endpoints controller removes the Pod's IP address from the endpoints of all
Services that match the Pod. The default state of readiness before the initial delay is
Failure. If a container does not provide a readiness probe, the default state is Success.
startupProbe
Indicates whether the application within the container is started. All other probes are
disabled if a startup probe is provided, until it succeeds. If the startup probe fails, the
kubelet kills the container, and the container is subjected to its restart policy. If a
container does not provide a startup probe, the default state is Success.
For more information about how to set up a liveness, readiness, or startup probe, see Configure
Liveness, Readiness and Startup Probes.
If the process in your container is able to crash on its own whenever it encounters an issue or
becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically
perform the correct action in accordance with the Pod's restartPolicy.
If you'd like your container to be killed and restarted if a probe fails, then specify a liveness
probe, and specify a restartPolicy of Always or OnFailure.
If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness
probe. In this case, the readiness probe might be the same as the liveness probe, but the
existence of the readiness probe in the spec means that the Pod will start without receiving any
traffic and only start receiving traffic after the probe starts succeeding.
If you want your container to be able to take itself down for maintenance, you can specify a
readiness probe that checks an endpoint specific to readiness that is different from the liveness
probe.
If your app has a strict dependency on back-end services, you can implement both a liveness
and a readiness probe. The liveness probe passes when the app itself is healthy, but the
readiness probe additionally checks that each required back-end service is available. This helps
you avoid directing traffic to Pods that can only respond with error messages.
If your container needs to work on loading large data, configuration files, or migrations during
startup, you can use a startup probe. However, if you want to detect the difference between an
app that has failed and an app that is still processing its startup data, you might prefer a
readiness probe.
Note: If you want to be able to drain requests when the Pod is deleted, you do not necessarily
need a readiness probe; on deletion, the Pod automatically puts itself into an unready state
regardless of whether the readiness probe exists. The Pod remains in the unready state while it
waits for the containers in the Pod to stop.
Startup probes are useful for Pods that have containers that take a long time to come into
service. Rather than set a long liveness interval, you can configure a separate configuration for
probing the container as it starts up, allowing a time longer than the liveness interval would
allow.
Termination of Pods
Because Pods represent processes running on nodes in the cluster, it is important to allow those
processes to gracefully terminate when they are no longer needed (rather than being abruptly
stopped with a KILL signal and having no chance to clean up).
The design aim is for you to be able to request deletion and know when processes terminate,
but also be able to ensure that deletes eventually complete. When you request deletion of a Pod,
the cluster records and tracks the intended grace period before the Pod is allowed to be
forcefully killed. With that forceful shutdown tracking in place, the kubelet attempts graceful
shutdown.
Typically, with this graceful termination of the pod, kubelet makes requests to the container
runtime to attempt to stop the containers in the pod by first sending a TERM (aka. SIGTERM)
signal, with a grace period timeout, to the main process in each container. The requests to stop
the containers are processed by the container runtime asynchronously. There is no guarantee to
the order of processing for these requests. Many container runtimes respect the STOPSIGNAL
value defined in the container image and, if different, send the container image configured
STOPSIGNAL instead of TERM. Once the grace period has expired, the KILL signal is sent to
any remaining processes, and the Pod is then deleted from the API Server. If the kubelet or the
container runtime's management service is restarted while waiting for processes to terminate,
the cluster retries from the start including the full original grace period.
An example flow:
1. You use the kubectl tool to manually delete a specific Pod, with the default grace period
(30 seconds).
2. The Pod in the API server is updated with the time beyond which the Pod is considered
"dead" along with the grace period. If you use kubectl describe to check the Pod you're
deleting, that Pod shows up as "Terminating". On the node where the Pod is running: as
soon as the kubelet sees that a Pod has been marked as terminating (a graceful shutdown
duration has been set), the kubelet begins the local Pod shutdown process.
1. If one of the Pod's containers has defined a preStop hook and the
terminationGracePeriodSeconds in the Pod spec is not set to 0, the kubelet runs
that hook inside of the container. The default terminationGracePeriodSeconds
setting is 30 seconds.
If the preStop hook is still running after the grace period expires, the kubelet
requests a small, one-off grace period extension of 2 seconds.
Note: If the preStop hook needs longer to complete than the default grace period
allows, you must modify terminationGracePeriodSeconds to suit this.
2. The kubelet triggers the container runtime to send a TERM signal to process 1
inside each container.
Note: The containers in the Pod receive the TERM signal at different times and in
an arbitrary order. If the order of shutdowns matters, consider using a preStop
hook to synchronize.
3. At the same time as the kubelet is starting graceful shutdown of the Pod, the control
plane evaluates whether to remove that shutting-down Pod from EndpointSlice (and
Endpoints) objects, where those objects represent a Service with a configured selector.
ReplicaSets and other workload resources no longer treat the shutting-down Pod as a
valid, in-service replica.
Pods that shut down slowly should not continue to serve regular traffic and should start
terminating and finish processing open connections. Some applications need to go
beyond finishing open connections and need more graceful termination, for example,
session draining and completion.
Any endpoints that represent the terminating Pods are not immediately removed from
EndpointSlices, and a status indicating terminating state is exposed from the
EndpointSlice API (and the legacy Endpoints API). Terminating endpoints always have
their ready status as false (for backward compatibility with versions before 1.26), so load
balancers will not use it for regular traffic.
If traffic draining on terminating Pod is needed, the actual readiness can be checked as a
condition serving. You can find more details on how to implement connections draining
in the tutorial Pods And Endpoints Termination Flow
Note: If you don't have the EndpointSliceTerminatingCondition feature gate enabled in your
cluster (the gate is on by default from Kubernetes 1.22, and locked to default in 1.26), then the
Kubernetes control plane removes a Pod from any relevant EndpointSlices as soon as the Pod's
termination grace period begins. The behavior above is described when the feature gate
EndpointSliceTerminatingCondition is enabled.
1. When the grace period expires, the kubelet triggers forcible shutdown. The container
runtime sends SIGKILL to any processes still running in any container in the Pod. The
kubelet also cleans up a hidden pause container if that container runtime uses one.
2. The kubelet transitions the Pod into a terminal phase (Failed or Succeeded depending on
the end state of its containers). This step is guaranteed since version 1.27.
3. The kubelet triggers forcible removal of Pod object from the API server, by setting grace
period to 0 (immediate deletion).
4. The API server deletes the Pod's API object, which is then no longer visible from any
client.
Caution: Forced deletions can be potentially disruptive for some workloads and their Pods.
By default, all deletes are graceful within 30 seconds. The kubectl delete command supports the
--grace-period=<seconds> option which allows you to override the default and specify your
own value.
Setting the grace period to 0 forcibly and immediately deletes the Pod from the API server. If
the Pod was still running on a node, that forcible deletion triggers the kubelet to begin
immediate cleanup.
Note: You must specify an additional flag --force along with --grace-period=0 in order to
perform force deletions.
When a force deletion is performed, the API server does not wait for confirmation from the
kubelet that the Pod has been terminated on the node it was running on. It removes the Pod in
the API immediately so a new Pod can be created with the same name. On the node, Pods that
are set to terminate immediately will still be given a small grace period before being force
killed.
Caution: Immediate deletion does not wait for confirmation that the running resource has
been terminated. The resource may continue to run on the cluster indefinitely.
If you need to force-delete Pods that are part of a StatefulSet, refer to the task documentation
for deleting Pods from a StatefulSet.
For failed Pods, the API objects remain in the cluster's API until a human or controller process
explicitly removes them.
The Pod garbage collector (PodGC), which is a controller in the control plane, cleans up
terminated Pods (with a phase of Succeeded or Failed), when the number of Pods exceeds the
configured threshold (determined by terminated-pod-gc-threshold in the kube-controller-
manager). This avoids a resource leak as Pods are created and terminated over time.
Additionally, PodGC cleans up any Pods which satisfy any of the following conditions:
When the PodDisruptionConditions feature gate is enabled, along with cleaning up the Pods,
PodGC will also mark them as failed if they are in a non-terminal phase. Also, PodGC adds a
Pod disruption condition when cleaning up an orphan Pod. See Pod disruption conditions for
more details.
What's next
• Get hands-on experience attaching handlers to container lifecycle events.
• For detailed information about Pod and container status in the API, see the API reference
documentation covering status for Pod.
Init Containers
This page provides an overview of init containers: specialized containers that run before app
containers in a Pod. Init containers can contain utilities or setup scripts not present in an app
image.
You can specify init containers in the Pod specification alongside the containers array (which
describes app containers).
If a Pod's init container fails, the kubelet repeatedly restarts that init container until it succeeds.
However, if the Pod has a restartPolicy of Never, and an init container fails during startup of
that Pod, Kubernetes treats the overall Pod as failed.
To specify an init container for a Pod, add the initContainers field into the Pod specification, as
an array of container items (similar to the app containers field and its contents). See Container
in the API reference for more details.
Init containers support all the fields and features of app containers, including resource limits,
volumes, and security settings. However, the resource requests and limits for an init container
are handled differently, as documented in Resource sharing within containers.
• Init containers can contain utilities or custom code for setup that are not present in an
app image. For example, there is no need to make an image FROM another image just to
use a tool like sed, awk, python, or dig during setup.
• The application image builder and deployer roles can work independently without the
need to jointly build a single app image.
• Init containers can run with a different view of the filesystem than app containers in the
same Pod. Consequently, they can be given access to Secrets that app containers cannot
access.
• Because init containers run to completion before any app containers start, init containers
offer a mechanism to block or delay app container startup until a set of preconditions are
met. Once preconditions are met, all of the app containers in a Pod can start in parallel.
• Init containers can securely run utilities or custom code that would otherwise make an
app container image less secure. By keeping unnecessary tools separate you can limit the
attack surface of your app container image.
Examples
for i in {1..100}; do sleep 1; if nslookup myservice; then exit 0; fi; done; exit 1
• Register this Pod with a remote server from the downward API with a command like:
• Wait for some time before starting the app container with a command like
sleep 60
• Place values into a configuration file and run a template tool to dynamically generate a
configuration file for the main app container. For example, place the POD_IP value in a
configuration and generate the main app configuration file using Jinja.
This example defines a simple Pod that has two init containers. The first waits for myservice,
and the second waits for mydb. Once both init containers complete, the Pod runs the app
container from its spec section.
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app.kubernetes.io/name: MyApp
spec:
containers:
- name: myapp-container
image: busybox:1.28
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/
serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
- name: init-mydb
image: busybox:1.28
command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/
serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]
pod/myapp-pod created
Name: myapp-pod
Namespace: default
[...]
Labels: app.kubernetes.io/name=MyApp
Status: Pending
[...]
Init Containers:
init-myservice:
[...]
State: Running
[...]
init-mydb:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Containers:
myapp-container:
[...]
State: Waiting
Reason: PodInitializing
Ready: False
[...]
Events:
FirstSeen LastSeen Count From SubObjectPath Type
Reason Message
--------- -------- ----- ---- ------------- -------- ------
-------
16s 16s 1 {default-scheduler } Normal
Scheduled Successfully assigned myapp-pod to 172.17.4.201
16s 16s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice}
Normal Pulling pulling image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice}
Normal Pulled Successfully pulled image "busybox"
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice}
Normal Created Created container init-myservice
13s 13s 1 {kubelet 172.17.4.201} spec.initContainers{init-myservice}
Normal Started Started container init-myservice
At this point, those init containers will be waiting to discover Services named mydb and
myservice.
---
apiVersion: v1
kind: Service
metadata:
name: myservice
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
---
apiVersion: v1
kind: Service
metadata:
name: mydb
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9377
service/myservice created
service/mydb created
You'll then see that those init containers complete, and that the myapp-pod Pod moves into the
Running state:
This simple example should provide some inspiration for you to create your own init
containers. What's next contains a link to a more detailed example.
Detailed behavior
During Pod startup, the kubelet delays running init containers until the networking and storage
are ready. Then the kubelet runs the Pod's init containers in the order they appear in the Pod's
spec.
Each init container must exit successfully before the next container starts. If a container fails to
start due to the runtime or exits with failure, it is retried according to the Pod restartPolicy.
However, if the Pod restartPolicy is set to Always, the init containers use restartPolicy
OnFailure.
A Pod cannot be Ready until all init containers have succeeded. The ports on an init container
are not aggregated under a Service. A Pod that is initializing is in the Pending state but should
have a condition Initialized set to false.
If the Pod restarts, or is restarted, all init containers must execute again.
Changes to the init container spec are limited to the container image field. Altering an init
container image field is equivalent to restarting the Pod.
Because init containers can be restarted, retried, or re-executed, init container code should be
idempotent. In particular, code that writes to files on EmptyDirs should be prepared for the
possibility that an output file already exists.
Init containers have all of the fields of an app container. However, Kubernetes prohibits
readinessProbe from being used because init containers cannot define readiness distinct from
completion. This is enforced during validation.
Use activeDeadlineSeconds on the Pod to prevent init containers from failing forever. The active
deadline includes init containers. However it is recommended to use activeDeadlineSeconds
only if teams deploy their application as a Job, because activeDeadlineSeconds has an effect
even after initContainer finished. The Pod which is already running correctly would be killed by
activeDeadlineSeconds if you set.
The name of each app and init container in a Pod must be unique; a validation error is thrown
for any container sharing a name with another.
Starting with Kubernetes 1.28 in alpha, a feature gate named SidecarContainers allows you to
specify a restartPolicy for init containers which is independent of the Pod and other init
containers. Container probes can also be added to control their lifecycle.
If an init container is created with its restartPolicy set to Always, it will start and remain
running during the entire life of the Pod, which is useful for running supporting services
separated from the main application containers.
If a readinessProbe is specified for this init container, its result will be used to determine the
ready state of the Pod.
Since these containers are defined as init containers, they benefit from the same ordering and
sequential guarantees as other init containers, allowing them to be mixed with other init
containers into complex Pod initialization flows.
Compared to regular init containers, sidecar-style init containers continue to run and the next
init container can begin starting once the kubelet has set the started container status for the
sidecar-style init container to true. That status either becomes true because there is a process
running in the container and no startup probe defined, or as a result of its startupProbe
succeeding.
This feature can be used to implement the sidecar container pattern in a more robust way, as
the kubelet always restarts a sidecar container if it fails.
application/deployment-sidecar.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
replicas: 1
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: alpine:latest
command: ['sh', '-c', 'while true; do echo "logging" >> /opt/logs.txt; sleep 1; done']
volumeMounts:
- name: data
mountPath: /opt
initContainers:
- name: logshipper
image: alpine:latest
restartPolicy: Always
command: ['sh', '-c', 'tail -F /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
volumes:
- name: data
emptyDir: {}
This feature is also useful for running Jobs with sidecars, as the sidecar container will not
prevent the Job from completing after the main container has finished.
application/job/job-sidecar.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
template:
spec:
containers:
- name: myjob
image: alpine:latest
command: ['sh', '-c', 'echo "logging" > /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
initContainers:
- name: logshipper
image: alpine:latest
restartPolicy: Always
command: ['sh', '-c', 'tail -F /opt/logs.txt']
volumeMounts:
- name: data
mountPath: /opt
restartPolicy: Never
volumes:
- name: data
emptyDir: {}
Given the ordering and execution for init containers, the following rules for resource usage
apply:
• The highest of any particular resource request or limit defined on all init containers is the
effective init request/limit. If any resource has no resource limit specified this is considered
as the highest limit.
• The Pod's effective request/limit for a resource is the higher of:
◦ the sum of all app containers request/limit for a resource
◦ the effective init request/limit for a resource
• Scheduling is done based on effective requests/limits, which means init containers can
reserve resources for initialization that are not used during the life of the Pod.
• The QoS (quality of service) tier of the Pod's effective QoS tier is the QoS tier for init
containers and app containers alike.
Quota and limits are applied based on the effective Pod request and limit.
Pod level control groups (cgroups) are based on the effective Pod request and limit, the same as
the scheduler.
A Pod can restart, causing re-execution of init containers, for the following reasons:
• The Pod infrastructure container is restarted. This is uncommon and would have to be
done by someone with root access to nodes.
• All containers in a Pod are terminated while restartPolicy is set to Always, forcing a
restart, and the init container completion record has been lost due to garbage collection.
The Pod will not be restarted when the init container image is changed, or the init container
completion record has been lost due to garbage collection. This applies for Kubernetes v1.20 and
later. If you are using an earlier version of Kubernetes, consult the documentation for the
version you are using.
What's next
• Read about creating a Pod that has an init container
• Learn how to debug init containers
• Read about an overview of kubelet and kubectl
• Learn about the types of probes: liveness, readiness, startup probe.
Disruptions
This guide is for application owners who want to build highly available applications, and thus
need to understand what types of disruptions can happen to Pods.
It is also for cluster administrators who want to perform automated cluster actions, like
upgrading and autoscaling clusters.
Except for the out-of-resources condition, all these conditions should be familiar to most users;
they are not specific to Kubernetes.
We call other cases voluntary disruptions. These include both actions initiated by the application
owner and those initiated by a Cluster Administrator. Typical application owner actions
include:
These actions might be taken directly by the cluster administrator, or by automation run by the
cluster administrator, or by your cluster hosting provider.
Ask your cluster administrator or consult your cloud provider or distribution documentation to
determine if any sources of voluntary disruptions are enabled for your cluster. If none are
enabled, you can skip creating Pod Disruption Budgets.
Caution: Not all voluntary disruptions are constrained by Pod Disruption Budgets. For
example, deleting deployments or pods bypasses Pod Disruption Budgets.
Dealing with disruptions
Here are some ways to mitigate involuntary disruptions:
The frequency of voluntary disruptions varies. On a basic Kubernetes cluster, there are no
automated voluntary disruptions (only user-triggered ones). However, your cluster
administrator or hosting provider may run some additional services which cause voluntary
disruptions. For example, rolling out node software updates can cause voluntary disruptions.
Also, some implementations of cluster (node) autoscaling may cause voluntary disruptions to
defragment and compact nodes. Your cluster administrator or hosting provider should have
documented what level of voluntary disruptions, if any, to expect. Certain configuration
options, such as using PriorityClasses in your pod spec can also cause voluntary (and
involuntary) disruptions.
Kubernetes offers features to help you run highly available applications even when you
introduce frequent voluntary disruptions.
As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A
PDB limits the number of Pods of a replicated application that are down simultaneously from
voluntary disruptions. For example, a quorum-based application would like to ensure that the
number of replicas running is never brought below the number needed for a quorum. A web
front end might want to ensure that the number of replicas serving load never falls below a
certain percentage of the total.
Cluster managers and hosting providers should use tools which respect PodDisruptionBudgets
by calling the Eviction API instead of directly deleting pods or deployments.
For example, the kubectl drain subcommand lets you mark a node as going out of service.
When you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out
of service. The eviction request that kubectl submits on your behalf may be temporarily
rejected, so the tool periodically retries all failed requests until all Pods on the target node are
terminated, or until a configurable timeout is reached.
A PDB specifies the number of replicas that an application can tolerate having, relative to how
many it is intended to have. For example, a Deployment which has a .spec.replicas: 5 is
supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time, then the
Eviction API will allow voluntary disruption of one (but not two) pods at a time.
The group of pods that comprise the application is specified using a label selector, the same as
the one used by the application's controller (deployment, stateful-set, etc).
The "intended" number of pods is computed from the .spec.replicas of the workload resource
that is managing those pods. The control plane discovers the owning workload resource by
examining the .metadata.ownerReferences of the Pod.
Involuntary disruptions cannot be prevented by PDBs; however they do count against the
budget.
Pods which are deleted or unavailable due to a rolling upgrade to an application do count
against the disruption budget, but workload resources (such as Deployment and StatefulSet) are
not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during
application updates is configured in the spec for the specific workload resource.
When a pod is evicted using the eviction API, it is gracefully terminated, honoring the
terminationGracePeriodSeconds setting in its PodSpec.
PodDisruptionBudget example
Consider a cluster with 3 nodes, node-1 through node-3. The cluster is running several
applications. One of them has 3 replicas initially called pod-a, pod-b, and pod-c. Another,
unrelated pod without a PDB, called pod-x, is also shown. Initially, the pods are laid out as
follows:
All 3 pods are part of a deployment, and they collectively have a PDB which requires there be at
least 2 of the 3 pods to be available at all times.
For example, assume the cluster administrator wants to reboot into a new kernel version to fix
a bug in the kernel. The cluster administrator first tries to drain node-1 using the kubectl drain
command. That tool tries to evict pod-a and pod-x. This succeeds immediately. Both pods go
into the terminating state at the same time. This puts the cluster in this state:
The deployment notices that one of the pods is terminating, so it creates a replacement called
pod-d. Since node-1 is cordoned, it lands on another node. Something has also created pod-y as
a replacement for pod-x.
(Note: for a StatefulSet, pod-a, which would be called something like pod-0, would need to
terminate completely before its replacement, which is also called pod-0 but has a different UID,
could be created. Otherwise, the example applies to a StatefulSet as well.)
At some point, the pods terminate, and the cluster looks like this:
At this point, if an impatient cluster administrator tries to drain node-2 or node-3, the drain
command will block, because there are only 2 available pods for the deployment, and its PDB
requires at least 2. After some time passes, pod-d becomes available.
Now, the cluster administrator tries to drain node-2. The drain command will try to evict the
two pods in some order, say pod-b first and then pod-d. It will succeed at evicting pod-b. But,
when it tries to evict pod-d, it will be refused because that would leave only one pod available
for the deployment.
The deployment creates a replacement for pod-b called pod-e. Because there are not enough
resources in the cluster to schedule pod-e the drain will again block. The cluster may end up in
this state:
At this point, the cluster administrator needs to add a node back to the cluster to proceed with
the upgrade.
You can see how Kubernetes varies the rate at which disruptions can happen, according to:
PreemptionByScheduler
Pod is due to be preempted by a scheduler in order to accommodate a new Pod with a
higher priority. For more information, see Pod priority preemption.
DeletionByTaintManager
Pod is due to be deleted by Taint Manager (which is part of the node lifecycle controller
within kube-controller-manager) due to a NoExecute taint that the Pod does not tolerate;
see taint-based evictions.
EvictionByEvictionAPI
Pod has been marked for eviction using the Kubernetes API .
DeletionByPodGC
Pod, that is bound to a no longer existing Node, is due to be deleted by Pod garbage
collection.
TerminationByKubelet
Pod has been terminated by the kubelet, because of either node pressure eviction or the
graceful node shutdown.
Note: A Pod disruption might be interrupted. The control plane might re-attempt to continue
the disruption of the same Pod, but it is not guaranteed. As a result, the DisruptionTarget
condition might be added to a Pod, but that Pod might then not actually be deleted. In such a
situation, after some time, the Pod disruption condition will be cleared.
When the PodDisruptionConditions feature gate is enabled, along with cleaning up the pods,
the Pod garbage collector (PodGC) will also mark them as failed if they are in a non-terminal
phase (see also Pod garbage collection).
When using a Job (or CronJob), you may want to use these Pod disruption conditions as part of
your Job's Pod failure policy.
• when there are many application teams sharing a Kubernetes cluster, and there is natural
specialization of roles
• when third-party tools or services are used to automate cluster management
Pod Disruption Budgets support this separation of roles by providing an interface between the
roles.
If you do not have such a separation of responsibilities in your organization, you may not need
to use Pod Disruption Budgets.
How to perform Disruptive Actions on your Cluster
If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes
in your cluster, such as a node or system software upgrade, here are some options:
What's next
• Follow steps to protect your application by configuring a Pod Disruption Budget.
• Learn about updating a deployment including steps to maintain its availability during the
rollout.
Ephemeral Containers
FEATURE STATE: Kubernetes v1.25 [stable]
This page provides an overview of ephemeral containers: a special type of container that runs
temporarily in an existing Pod to accomplish user-initiated actions such as troubleshooting. You
use ephemeral containers to inspect services rather than to build applications.
Sometimes it's necessary to inspect the state of an existing Pod, however, for example to
troubleshoot a hard-to-reproduce bug. In these cases you can run an ephemeral container in an
existing Pod to inspect its state and run arbitrary commands.
Ephemeral containers differ from other containers in that they lack guarantees for resources or
execution, and they will never be automatically restarted, so they are not appropriate for
building applications. Ephemeral containers are described using the same ContainerSpec as
regular containers, but many fields are incompatible and disallowed for ephemeral containers.
• Ephemeral containers may not have ports, so fields such as ports, livenessProbe,
readinessProbe are disallowed.
• Pod resource allocations are immutable, so setting resources is disallowed.
• For a complete list of allowed fields, see the EphemeralContainer reference
documentation.
Ephemeral containers are created using a special ephemeralcontainers handler in the API rather
than by adding them directly to pod.spec, so it's not possible to add an ephemeral container
using kubectl edit.
Like regular containers, you may not change or remove an ephemeral container after you have
added it to a Pod.
In particular, distroless images enable you to deploy minimal container images that reduce
attack surface and exposure to bugs and vulnerabilities. Since distroless images do not include a
shell or any debugging utilities, it's difficult to troubleshoot distroless images using kubectl exec
alone.
When using ephemeral containers, it's helpful to enable process namespace sharing so you can
view processes in other containers.
What's next
• Learn how to debug pods using ephemeral containers.
Guaranteed
Pods that are Guaranteed have the strictest resource limits and are least likely to face eviction.
They are guaranteed not to be killed until they exceed their limits or there are no lower-priority
Pods that can be preempted from the Node. They may not acquire resources beyond their
specified limits. These Pods can also make use of exclusive CPUs using the static CPU
management policy.
Criteria
• Every Container in the Pod must have a memory limit and a memory request.
• For every Container in the Pod, the memory limit must equal the memory request.
• Every Container in the Pod must have a CPU limit and a CPU request.
• For every Container in the Pod, the CPU limit must equal the CPU request.
Burstable
Pods that are Burstable have some lower-bound resource guarantees based on the request, but
do not require a specific limit. If a limit is not specified, it defaults to a limit equivalent to the
capacity of the Node, which allows the Pods to flexibly increase their resources if resources are
available. In the event of Pod eviction due to Node resource pressure, these Pods are evicted
only after all BestEffort Pods are evicted. Because a Burstable Pod can include a Container that
has no resource limits or requests, a Pod that is Burstable can try to use any amount of node
resources.
Criteria
• The Pod does not meet the criteria for QoS class Guaranteed.
• At least one Container in the Pod has a memory or CPU request or limit.
BestEffort
Pods in the BestEffort QoS class can use node resources that aren't specifically assigned to Pods
in other QoS classes. For example, if you have a node with 16 CPU cores available to the
kubelet, and you assign 4 CPU cores to a Guaranteed Pod, then a Pod in the BestEffort QoS
class can try to use any amount of the remaining 12 CPU cores.
The kubelet prefers to evict BestEffort Pods if the node comes under resource pressure.
Criteria
A Pod has a QoS class of BestEffort if it doesn't meet the criteria for either Guaranteed or
Burstable. In other words, a Pod is BestEffort only if none of the Containers in the Pod have a
memory limit or a memory request, and none of the Containers in the Pod have a CPU limit or
a CPU request. Containers in a Pod can request other resources (not CPU or memory) and still
be classified as BestEffort.
Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in
Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces
memory.min and memory.high provided by the memory controller. When memory.min is set to
memory requests, memory resources are reserved and never reclaimed by the kernel; this is
how Memory QoS ensures memory availability for Kubernetes pods. And if memory limits are
set in the container, this means that the system needs to limit container memory usage;
Memory QoS uses memory.high to throttle workload approaching its memory limit, ensuring
that the system is not overwhelmed by instantaneous memory allocation.
Memory QoS relies on QoS class to determine which settings to apply; however, these are
different mechanisms that both provide controls over quality of service.
• Any Container exceeding a resource limit will be killed and restarted by the kubelet
without affecting other Containers in that Pod.
• If a Container exceeds its resource request and the node it runs on faces resource
pressure, the Pod it is in becomes a candidate for eviction. If this occurs, all Containers in
the Pod will be terminated. Kubernetes may create a replacement Pod, usually on a
different node.
• The resource request of a Pod is equal to the sum of the resource requests of its
component Containers, and the resource limit of a Pod is equal to the sum of the resource
limits of its component Containers.
• The kube-scheduler does not consider QoS class when selecting which Pods to preempt.
Preemption can occur when a cluster does not have enough resources to run all the Pods
you defined.
What's next
• Learn about resource management for Pods and Containers.
• Learn about Node-pressure eviction.
• Learn about Pod priority and preemption.
• Learn about Pod disruptions.
• Learn how to assign memory resources to containers and pods.
• Learn how to assign CPU resources to containers and pods.
• Learn how to configure Quality of Service for Pods.
User Namespaces
FEATURE STATE: Kubernetes v1.25 [alpha]
This page explains how user namespaces are used in Kubernetes pods. A user namespace
isolates the user running inside the container from the one in the host.
A process running as root in a container can run as a different (non-root) user in the host; in
other words, the process has full privileges for operations inside the user namespace, but is
unprivileged for operations outside the namespace.
You can use this feature to reduce the damage a compromised container can do to the host or
other pods in the same node. There are several security vulnerabilities rated either HIGH or
CRITICAL that were not exploitable when user namespaces is active. It is expected user
namespace will mitigate some future vulnerabilities too.
This is a Linux-only feature and support is needed in Linux for idmap mounts on the
filesystems used. This means:
• On the node, the filesystem you use for /var/lib/kubelet/pods/, or the custom directory
you configure for this, needs idmap mount support.
• All the filesystems used in the pod's volumes must support idmap mounts.
In practice this means you need at least Linux 6.3, as tmpfs started supporting idmap mounts in
that version. This is usually needed as several Kubernetes features use tmpfs (the service
account token that is mounted by default uses a tmpfs, Secrets use a tmpfs, etc.)
Some popular filesystems that support idmap mounts in Linux 6.3 are: btrfs, ext4, xfs, fat, tmpfs,
overlayfs.
In addition, support is needed in the container runtime to use this feature with Kubernetes
pods:
• CRI-O: version 1.25 (and later) supports user namespaces for containers.
containerd v1.7 is not compatible with the userns support in Kubernetes v1.27 to v1.28.
Kubernetes v1.25 and v1.26 used an earlier implementation that is compatible with containerd
v1.7, in terms of userns support. If you are using a version of Kubernetes other than 1.28, check
the documentation for that version of Kubernetes for the most relevant information. If there is a
newer release of containerd than v1.7 available for use, also check the containerd
documentation for compatibility information.
You can see the status of user namespaces support in cri-dockerd tracked in an issue on GitHub.
Introduction
User namespaces is a Linux feature that allows to map users in the container to different users
in the host. Furthermore, the capabilities granted to a pod in a user namespace are valid only in
the namespace and void outside of it.
A pod can opt-in to use user namespaces by setting the pod.spec.hostUsers field to false.
The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way to guarantee
that no two pods on the same node use the same mapping.
The runAsUser, runAsGroup, fsGroup, etc. fields in the pod.spec always refer to the user inside
the container.
The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This applies to files and
processes (runAsUser, runAsGroup, etc.).
Files using a UID/GID outside this range will be seen as belonging to the overflow ID, usually
65534 (configured in /proc/sys/kernel/overflowuid and /proc/sys/kernel/overflowgid). However,
it is not possible to modify those files, even by running as the 65534 user/group.
Most applications that need to run as root but don't access other host namespaces or resources,
should continue to run fine without any changes needed if user namespaces is activated.
When creating a pod, by default, several new namespaces are used for isolation: a network
namespace to isolate the network of the container, a PID namespace to isolate the view of
processes, etc. If a user namespace is used, this will isolate the users in the container from the
users in the node.
This means containers can run as root and be mapped to a non-root user on the host. Inside the
container the process will think it is running as root (and therefore tools like apt, yum, etc.
work fine), while in reality the process doesn't have privileges on the host. You can verify this,
for example, if you check which user the container process is running by executing ps aux from
the host. The user ps shows is not the same as the user you see if you execute inside the
container the command id.
This abstraction limits what can happen, for example, if the container manages to escape to the
host. Given that the container is running as a non-privileged user on the host, it is limited what
it can do to the host.
Furthermore, as users on each pod will be mapped to different non-overlapping users in the
host, it is limited what they can do to other pods too.
Capabilities granted to a pod are also limited to the pod user namespace and mostly invalid out
of it, some are even completely void. Here are two examples:
• CAP_SYS_MODULE does not have any effect if granted to a pod using user namespaces,
the pod isn't able to load kernel modules.
• CAP_SYS_ADMIN is limited to the pod's user namespace and invalid outside of it.
Without using a user namespace a container running as root, in the case of a container
breakout, has root privileges on the node. And if some capability were granted to the container,
the capabilities are valid on the host too. None of this is true when we use user namespaces.
If you want to know more details about what changes when user namespaces are in use, see
man 7 user_namespaces.
The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to guarantee as much
isolation as possible, the UIDs/GIDs used by the host's files and host's processes should be in
the range 0-65535.
Note that this recommendation is important to mitigate the impact of CVEs like
CVE-2021-25741, where a pod can potentially read arbitrary files in the hosts. If the UIDs/GIDs
of the pod and the host don't overlap, it is limited what a pod would be able to do: the pod UID/
GID won't match the host's file owner/group.
Limitations
When using a user namespace for the pod, it is disallowed to use other host namespaces. In
particular, if you set hostUsers: false then you are not allowed to set any of:
• hostNetwork: true
• hostIPC: true
• hostPID: true
What's next
• Take a look at Use a User Namespace With a Pod
Downward API
There are two ways to expose Pod and container fields to a running container: environment
variables, and as files that are populated by a special volume type. Together, these two ways of
exposing Pod and container fields are called the downward API.
It is sometimes useful for a container to have information about itself, without being overly
coupled to Kubernetes. The downward API allows containers to consume information about
themselves or the cluster without using the Kubernetes client or API server.
An example is an existing application that assumes a particular well-known environment
variable holds a unique identifier. One possibility is to wrap the application, but that is tedious
and error-prone, and it violates the goal of low coupling. A better option would be to use the
Pod's name as an identifier, and inject the Pod's name into the well-known environment
variable.
In Kubernetes, there are two ways to expose Pod and container fields to a running container:
• as environment variables
• as files in a downwardAPI volume
Together, these two ways of exposing Pod and container fields are called the downward API.
Available fields
Only some Kubernetes API fields are available through the downward API. This section lists
which fields you can make available.
You can pass information from available Pod-level fields using fieldRef. At the API level, the
spec for a Pod always defines at least one Container. You can pass information from available
Container-level fields using resourceFieldRef.
For some Pod-level fields, you can provide them to a container either as an environment
variable or using a downwardAPI volume. The fields available via either mechanism are:
metadata.name
the pod's name
metadata.namespace
the pod's namespace
metadata.uid
the pod's unique ID
metadata.annotations['<KEY>']
the value of the pod's annotation named <KEY> (for example,
metadata.annotations['myannotation'])
metadata.labels['<KEY>']
the text value of the pod's label named <KEY> (for example, metadata.labels['mylabel'])
spec.serviceAccountName
the name of the pod's service account
spec.nodeName
the name of the node where the Pod is executing
status.hostIP
the primary IP address of the node to which the Pod is assigned
status.hostIPs
the IP addresses is a dual-stack version of status.hostIP, the first is always the same as
status.hostIP. The field is available if you enable the PodHostIPs feature gate.
status.podIP
the pod's primary IP address (usually, its IPv4 address)
status.podIPs
the IP addresses is a dual-stack version of status.podIP, the first is always the same as
status.podIP
The following information is available through a downwardAPI volume fieldRef, but not as
environment variables:
metadata.labels
all of the pod's labels, formatted as label-key="escaped-label-value" with one label per line
metadata.annotations
all of the pod's annotations, formatted as annotation-key="escaped-annotation-value"
with one annotation per line
These container-level fields allow you to provide information about requests and limits for
resources such as CPU and memory.
resource: limits.cpu
A container's CPU limit
resource: requests.cpu
A container's CPU request
resource: limits.memory
A container's memory limit
resource: requests.memory
A container's memory request
resource: limits.hugepages-*
A container's hugepages limit
resource: requests.hugepages-*
A container's hugepages request
resource: limits.ephemeral-storage
A container's ephemeral-storage limit
resource: requests.ephemeral-storage
A container's ephemeral-storage request
If CPU and memory limits are not specified for a container, and you use the downward API to
try to expose that information, then the kubelet defaults to exposing the maximum allocatable
value for CPU and memory based on the node allocatable calculation.
What's next
You can read about downwardAPI volumes.
You can try using the downward API to expose container- or Pod-level information:
• as environment variables
• as files in downwardAPI volume
Workload Resources
Kubernetes provides several built-in APIs for declarative management of your workloads and
the components of those workloads.
Ultimately, your applications run as containers inside Pods; however, managing individual Pods
would be a lot of effort. For example, if a Pod fails, you probably want to run a new Pod to
replace it. Kubernetes can do that for you.
You use the Kubernetes API to create a workload object that represents a higher abstraction
level than a Pod, and then the Kubernetes control plane automatically manages Pod objects on
your behalf, based on the specification for the workload object you defined.
Deployment (and, indirectly, ReplicaSet), the most common way to run an application on your
cluster. Deployment is a good fit for managing a stateless application workload on your cluster,
where any Pod in the Deployment is interchangeable and can be replaced if needed.
(Deployments are a replacement for the legacy ReplicationController API).
A StatefulSet lets you manage one or more Pods – all running the same application code –
where the Pods rely on having a distinct identity. This is different from a Deployment where the
Pods are expected to be interchangeable. The most common use for a StatefulSet is to be able to
make a link between its Pods and their persistent storage. For example, you can run a
StatefulSet that associates each Pod with a PersistentVolume. If one of the Pods in the
StatefulSet fails, Kubernetes makes a replacement Pod that is connected to the same
PersistentVolume.
A DaemonSet defines Pods that provide facilities that are local to a specific node; for example, a
driver that lets containers on that node access a storage system. You use a DaemonSet when the
driver, or other node-level service, has to run on the node where it's useful. Each Pod in a
DaemonSet performs a role similar to a system daemon on a classic Unix / POSIX server. A
DaemonSet might be fundamental to the operation of your cluster, such as a plugin to let that
node access cluster networking, it might help you to manage the node, or it could provide less
essential facilities that enhance the container platform you are running. You can run
DaemonSets (and their pods) across every node in your cluster, or across just a subset (for
example, only install the GPU accelerator driver on nodes that have a GPU installed).
You can use a Job and / or a CronJob to define tasks that run to completion and then stop. A Job
represents a one-off task, whereas each CronJob repeats according to a schedule.
Deployments
A Deployment manages a set of Pods to run an application workload, usually one that doesn't
maintain state.
Note: Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the
main Kubernetes repository if your use case is not covered below.
Use Case
The following are typical use cases for Deployments:
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx
Pods:
controllers/nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
• The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the
.spec.replicas field.
• The .spec.selector field defines how the created ReplicaSet finds which Pods to manage. In
this case, you select a label that is defined in the Pod template (app: nginx). However,
more sophisticated selection rules are possible, as long as the Pod template itself satisfies
the rule.
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given
below to create the above Deployment:
If the Deployment is still being created, the output is similar to the following:
When you inspect the Deployments in your cluster, the following fields are displayed:
3. To see the Deployment rollout status, run kubectl rollout status deployment/nginx-
deployment.
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
deployment "nginx-deployment" successfully rolled out
4. Run the kubectl get deployments again a few seconds later. The output is similar to this:
Notice that the Deployment has created all three replicas, and all replicas are up-to-date
(they contain the latest Pod template) and available.
5. To see the ReplicaSet (rs) created by the Deployment, run kubectl get rs. The output is
similar to this:
The HASH string is the same as the pod-template-hash label on the ReplicaSet.
6. To see the labels automatically generated for each Pod, run kubectl get pods --show-
labels. The output is similar to:
The created ReplicaSet ensures that there are three nginx Pods.
Note:
You must specify an appropriate selector and Pod template labels in a Deployment (in this case,
app: nginx).
Do not overlap labels or selectors with other controllers (including other Deployments and
StatefulSets). Kubernetes doesn't stop you from overlapping, and if multiple controllers have
overlapping selectors those controllers might conflict and behave unexpectedly.
Pod-template-hash label
The pod-template-hash label is added by the Deployment controller to every ReplicaSet that a
Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by
hashing the PodTemplate of the ReplicaSet and using the resulting hash as the label value that
is added to the ReplicaSet selector, Pod template labels, and in any existing Pods that the
ReplicaSet might have.
Updating a Deployment
Note: A Deployment's rollout is triggered if and only if the Deployment's Pod template (that
is, .spec.template) is changed, for example if the labels or container images of the template are
updated. Other updates, such as scaling the Deployment, do not trigger a rollout.
1. Let's update the nginx Pods to use the nginx:1.16.1 image instead of the nginx:1.14.2
image.
deployment.apps/nginx-deployment edited
or
• After the rollout succeeds, you can view the Deployment by running kubectl get
deployments. The output is similar to this:
• Run kubectl get rs to see that the Deployment updated the Pods by creating a new
ReplicaSet and scaling it up to 3 replicas, as well as scaling down the old ReplicaSet to 0
replicas.
kubectl get rs
• Running get pods should now show only the new Pods:
Next time you want to update these Pods, you only need to update the Deployment's Pod
template again.
Deployment ensures that only a certain number of Pods are down while they are being
updated. By default, it ensures that at least 75% of the desired number of Pods are up (25%
max unavailable).
Deployment also ensures that only a certain number of Pods are created above the
desired number of Pods. By default, it ensures that at most 125% of the desired number of
Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates
a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods
until a sufficient number of new Pods have come up, and does not create new Pods until a
sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are
available and that at max 4 Pods in total are available. In case of a Deployment with 4
replicas, the number of Pods would be between 3 and 5.
Get details of your Deployment:
•
kubectl describe deployments
Name: nginx-deployment
Namespace: default
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=2
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-
deployment-2035384211 to 3
Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-
deployment-1564180365 to 1
Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-
deployment-2035384211 to 2
Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-
deployment-1564180365 to 2
Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-
deployment-2035384211 to 1
Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-
deployment-1564180365 to 3
Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-
deployment-2035384211 to 0
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-
deployment-2035384211) and scaled it up to 3 replicas directly. When you updated the
Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and scaled it up
to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled
up the new ReplicaSet to 2 so that at least 3 Pods were available and at most 4 Pods were
created at all times. It then continued scaling up and down the new and the old
ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas
in the new ReplicaSet, and the old ReplicaSet is scaled down to 0.
Note: Kubernetes doesn't count terminating Pods when calculating the number of
availableReplicas, which must be between replicas - maxUnavailable and replicas + maxSurge.
As a result, you might notice that there are more Pods than expected during a rollout, and that
the total resources consumed by the Deployment is more than replicas + maxSurge until the
terminationGracePeriodSeconds of the terminating Pods expires.
Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created
to bring up the desired Pods. If the Deployment is updated, the existing ReplicaSet that controls
Pods whose labels match .spec.selector but whose template does not match .spec.template are
scaled down. Eventually, the new ReplicaSet is scaled to .spec.replicas and all old ReplicaSets is
scaled to 0.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a
new ReplicaSet as per the update and start scaling that up, and rolls over the ReplicaSet that it
was scaling up previously -- it will add it to its list of old ReplicaSets and start scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2, but then
update the Deployment to create 5 replicas of nginx:1.16.1, when only 3 replicas of nginx:1.14.2
had been created. In that case, the Deployment immediately starts killing the 3 nginx:1.14.2
Pods that it had created, and starts creating nginx:1.16.1 Pods. It does not wait for the 5 replicas
of nginx:1.14.2 to be created before changing course.
It is generally discouraged to make label selector updates and it is suggested to plan your
selectors up front. In any case, if you need to perform a label selector update, exercise great
caution and make sure you have grasped all of the implications.
Note: In API version apps/v1, a Deployment's label selector is immutable after it gets created.
• Selector additions require the Pod template labels in the Deployment spec to be updated
with the new label too, otherwise a validation error is returned. This change is a non-
overlapping one, meaning that the new selector does not select ReplicaSets and Pods
created with the old selector, resulting in orphaning all old ReplicaSets and creating a
new ReplicaSet.
• Selector updates changes the existing value in a selector key -- result in the same
behavior as additions.
• Selector removals removes an existing key from the Deployment selector -- do not
require any changes in the Pod template labels. Existing ReplicaSets are not orphaned,
and a new ReplicaSet is not created, but note that the removed label still exists in any
existing Pods and ReplicaSets.
Rolling Back a Deployment
Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not
stable, such as crash looping. By default, all of the Deployment's rollout history is kept in the
system so that you can rollback anytime you want (you can change that by modifying revision
history limit).
Note: A Deployment's revision is created when a Deployment's rollout is triggered. This means
that the new revision is created if and only if the Deployment's Pod template (.spec.template) is
changed, for example if you update the labels or container images of the template. Other
updates, such as scaling the Deployment, do not create a Deployment revision, so that you can
facilitate simultaneous manual- or auto-scaling. This means that when you roll back to an
earlier revision, only the Deployment's Pod template part is rolled back.
• Suppose that you made a typo while updating the Deployment, by putting the image
name as nginx:1.161 instead of nginx:1.16.1:
• The rollout gets stuck. You can verify it by checking the rollout status:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
• Press Ctrl-C to stop the above rollout status watch. For more information on stuck
rollouts, read more here.
• You see that the number of old replicas (adding the replica count from nginx-
deployment-1564180365 and nginx-deployment-2035384211) is 3, and the number of new
replicas (from nginx-deployment-3066724191) is 1.
kubectl get rs
• Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an
image pull loop.
Note: The Deployment controller stops the bad rollout automatically, and stops scaling
up the new ReplicaSet. This depends on the rollingUpdate parameters (maxUnavailable
specifically) that you have specified. Kubernetes by default sets the value to 25%.
Name: nginx-deployment
Namespace: default
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
Labels: app=nginx
Selector: app=nginx
Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.161
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created)
NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason
Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {deployment-controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-2035384211 to 3
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 1
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 2
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 2
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 1
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-1564180365 to 3
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled down replica set nginx-deployment-2035384211 to 0
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet
Scaled up replica set nginx-deployment-3066724191 to 1
To fix this, you need to rollback to a previous revision of Deployment that is stable.
deployments "nginx-deployment"
REVISION CHANGE-CAUSE
1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-
deployment.yaml
2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161
Follow the steps given below to rollback the Deployment from the current version to the
previous version, which is version 2.
1. Now you've decided to undo the current rollout and rollback to the previous revision:
For more details about rollout related commands, read kubectl rollout.
The Deployment is now rolled back to a previous stable revision. As you can see, a
DeploymentRollback event for rolling back to revision 2 is generated from Deployment
controller.
2. Check if the rollback was successful and the Deployment is running as expected, run:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl set image deployment/nginx-
deployment nginx=nginx:1.16.1
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-
deployment-75675f5897 to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-
deployment-c4747d96c to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-
deployment-75675f5897 to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-
deployment-c4747d96c to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-
deployment-75675f5897 to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-
deployment-c4747d96c to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-
deployment-75675f5897 to 0
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-
deployment-595696685f to 1
Normal DeploymentRollback 15s deployment-controller Rolled back deployment
"nginx-deployment" to revision 2
Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-
deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler for
your Deployment and choose the minimum and maximum number of Pods you want to run
based on the CPU utilization of your existing Pods.
deployment.apps/nginx-deployment scaled
Proportional scaling
For example, you are running a Deployment with 10 replicas, maxSurge=3, and
maxUnavailable=2.
• You update to a new image which happens to be unresolvable from inside the cluster.
• The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but
it's blocked due to the maxUnavailable requirement that you mentioned above. Check out
the rollout status:
kubectl get rs
• Then a new scaling request for the Deployment comes along. The autoscaler increments
the Deployment replicas to 15. The Deployment controller needs to decide where to add
these new 5 replicas. If you weren't using proportional scaling, all 5 of them would be
added in the new ReplicaSet. With proportional scaling, you spread the additional
replicas across all ReplicaSets. Bigger proportions go to the ReplicaSets with the most
replicas and lower proportions go to ReplicaSets with less replicas. Any leftovers are
added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not
scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the
new ReplicaSet. The rollout process should eventually move all replicas to the new ReplicaSet,
assuming the new replicas become healthy. To confirm this, run:
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
kubectl get rs
deployment.apps/nginx-deployment paused
deployments "nginx"
REVISION CHANGE-CAUSE
1 <none>
• Get the rollout status to verify that the existing ReplicaSet has not changed:
kubectl get rs
• You can make as many updates as you wish, for example, update the resources that will
be used:
The initial state of the Deployment prior to pausing its rollout will continue its function,
but new updates to the Deployment will not have any effect as long as the Deployment
rollout is paused.
• Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with
all the new updates:
deployment.apps/nginx-deployment resumed
Watch the status of the rollout until it's done.
•
kubectl get rs -w
kubectl get rs
Note: You cannot rollback a paused Deployment until you resume it.
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a
new ReplicaSet, it can be complete, or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
When the rollout becomes “progressing”, the Deployment controller adds a condition with the
following attributes to the Deployment's .status.conditions:
• type: Progressing
• status: "True"
• reason: NewReplicaSetCreated | reason: FoundNewReplicaSet | reason: ReplicaSetUpdated
You can monitor the progress for a Deployment by using kubectl rollout status.
Complete Deployment
• All of the replicas associated with the Deployment have been updated to the latest
version you've specified, meaning any updates you've requested have been completed.
• All of the replicas associated with the Deployment are available.
• No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the
following attributes to the Deployment's .status.conditions:
• type: Progressing
• status: "True"
• reason: NewReplicaSetAvailable
This Progressing condition will retain a status value of "True" until a new rollout is initiated.
The condition holds even when availability of replicas changes (which does instead affect the
Available condition).
You can check if a Deployment has completed by using kubectl rollout status. If the rollout
completed successfully, kubectl rollout status returns a zero exit code.
echo $?
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever
completing. This can occur due to some of the following factors:
• Insufficient quota
• Readiness probe failures
• Image pull errors
• Insufficient permissions
• Limit ranges
• Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment
spec: (.spec.progressDeadlineSeconds). .spec.progressDeadlineSeconds denotes the number of
seconds the Deployment controller waits before indicating (in the Deployment status) that the
Deployment progress has stalled.
The following kubectl command sets the spec with progressDeadlineSeconds to make the
controller report lack of progress of a rollout for a Deployment after 10 minutes:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition
with the following attributes to the Deployment's .status.conditions:
• type: Progressing
• status: "False"
• reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False" due to reasons as
ReplicaSetCreateError. Also, the deadline is not taken into account anymore once the
Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
Note: Kubernetes takes no action on a stalled Deployment other than to report a status
condition with reason: ProgressDeadlineExceeded. Higher level orchestrators can take
advantage of it and act accordingly, for example, rollback the Deployment to its previous
version.
Note: If you pause a Deployment rollout, Kubernetes does not check progress against your
specified deadline. You can safely pause a Deployment rollout in the middle of a rollout and
resume without triggering the condition for exceeding the deadline.
You may experience transient errors with your Deployments, either due to a low timeout that
you have set or due to any other kind of error that can be treated as transient. For example, let's
suppose you have insufficient quota. If you describe the Deployment you will notice the
following section:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml, the Deployment status is similar
to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status
and the reason for the Progressing condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling
down other controllers you may be running, or by increasing quota in your namespace. If you
satisfy the quota conditions and the Deployment controller then completes the Deployment
rollout, you'll see the Deployment's status update with a successful condition (status: "True" and
reason: NewReplicaSetAvailable).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
type: Available with status: "True" means that your Deployment has minimum availability.
Minimum availability is dictated by the parameters specified in the deployment strategy. type:
Progressing with status: "True" means that your Deployment is either in the middle of a rollout
and it is progressing or that it has successfully completed its progress and the minimum
required new replicas are available (see the Reason of the condition for the particulars - in our
case reason: NewReplicaSetAvailable means that the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status. kubectl
rollout status returns a non-zero exit code if the Deployment has exceeded the progression
deadline.
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
echo $?
All actions that apply to a complete Deployment also apply to a failed Deployment. You can
scale it up/down, roll back to a previous revision, or even pause it if you need to apply multiple
tweaks in the Deployment Pod template.
Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old
ReplicaSets for this Deployment you want to retain. The rest will be garbage-collected in the
background. By default, it is 10.
Note: Explicitly setting this field to 0, will result in cleaning up all the history of your
Deployment thus that Deployment will not be able to roll back.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can
create multiple Deployments, one for each release, following the canary pattern described in
managing resources.
When the control plane creates new Pods for a Deployment, the .metadata.name of the
Deployment is part of the basis for naming those Pods. The name of a Deployment must be a
valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames.
For best compatibility, the name should follow the more restrictive rules for a DNS label.
A Deployment also needs a .spec section.
Pod Template
The .spec.template and .spec.selector are the only required fields of the .spec.
The .spec.template is a Pod template. It has exactly the same schema as a Pod, except it is nested
and does not have an apiVersion or kind.
In addition to required fields for a Pod, a Pod template in a Deployment must specify
appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with
other controllers. See selector.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment --
replicas=X, and then you update that Deployment based on a manifest (for example: by
running kubectl apply -f deployment.yaml), then applying that manifest overwrites the manual
scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for
a Deployment, don't set .spec.replicas.
Instead, allow the Kubernetes control plane to manage the .spec.replicas field automatically.
Selector
.spec.selector is a required field that specifies a label selector for the Pods targeted by this
Deployment.
A Deployment may terminate Pods whose labels match the selector if their template is different
from .spec.template or if the total number of such Pods exceeds .spec.replicas. It brings up new
Pods with .spec.template if the number of Pods is less than the desired number.
Note: You should not create other Pods whose labels match this selector, either directly, by
creating another Deployment, or by creating another controller such as a ReplicaSet or a
ReplicationController. If you do so, the first Deployment thinks that it created these other Pods.
Kubernetes does not stop you from doing this.
If you have multiple controllers that have overlapping selectors, the controllers will fight with
each other and won't behave correctly.
Strategy
.spec.strategy specifies the strategy used to replace old Pods by new ones. .spec.strategy.type
can be "Recreate" or "RollingUpdate". "RollingUpdate" is the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate.
Note: This will only guarantee Pod termination previous to creation for upgrades. If you
upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful
removal is awaited before any Pod of the new revision is created. If you manually delete a Pod,
the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately
(even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your
Pods, you should consider using a StatefulSet.
Max Unavailable
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of
desired Pods immediately when the rolling update starts. Once new Pods are ready, old
ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that
the total number of Pods available at all times during the update is at least 70% of the desired
Pods.
Max Surge
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately
when the rolling update starts, such that the total number of old and new Pods does not exceed
130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up
further, ensuring that the total number of Pods running at any time during the update is at
most 130% of desired Pods.
Here are some Rolling Update Deployment examples that use the maxUnavailable and
maxSurge:
• Max Unavailable
• Max Surge
• Hybrid
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for
which a newly created Pod should be ready without any of its containers crashing, for it to be
considered available. This defaults to 0 (the Pod will be considered available as soon as it is
ready). To learn more about when a Pod is considered ready, see Container Probes.
Revision History Limit
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be
cleaned up. In this case, a new Deployment rollout cannot be undone, since its revision history
is cleaned up.
Paused
.spec.paused is an optional boolean field for pausing and resuming a Deployment. The only
difference between a paused Deployment and one that is not paused, is that any changes into
the PodTemplateSpec of the paused Deployment will not trigger new rollouts as long as it is
paused. A Deployment is not paused by default when it is created.
What's next
• Learn more about Pods.
• Run a stateless application using a Deployment.
• Read the Deployment to understand the Deployment API.
• Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
• Use kubectl to create a Deployment.
ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time.
Usually, you define a Deployment and let that Deployment manage ReplicaSets automatically.
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As
such, it is often used to guarantee the availability of a specified number of identical Pods.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies
what resource the current object is owned by. All Pods acquired by a ReplicaSet have their
owning ReplicaSet's identifying information within their ownerReferences field. It's through
this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans
accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no
OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet's
selector, it will be immediately acquired by said ReplicaSet.
This actually means that you may never need to manipulate ReplicaSet objects: use a
Deployment instead, and define your application in the spec section.
Example
controllers/frontend.yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: gcr.io/google_samples/gb-frontend:v3
Saving this manifest into frontend.yaml and submitting it to a Kubernetes cluster will create the
defined ReplicaSet and the Pods that it manages.
kubectl get rs
And see the frontend one you created:
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"apps/v1","kind":"ReplicaSet","metadata":{"annotations":{},"labels":
{"app":"guestbook","tier":"frontend"},"name":"frontend",...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: gcr.io/google_samples/gb-frontend:v3
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 117s replicaset-controller Created pod: frontend-wtsmm
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-b2zdv
Normal SuccessfulCreate 116s replicaset-controller Created pod: frontend-vcmts
And lastly you can check for the Pods brought up:
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To
do this, get the yaml of one of the Pods running:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2020-02-12T07:06:16Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-b2zdv
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: f391f6db-bb9b-4c09-ae74-6a1f77f3d5cf
...
Take the previous frontend ReplicaSet example, and the Pods specified in the following
manifest:
pods/pod-rs.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the
selector of the frontend ReplicaSet, they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its
initial Pod replicas to fulfill its replica count requirement:
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the
ReplicaSet would be over its desired count.
The output shows that the new Pods are either already terminated, or in the process of being
terminated:
You shall see that the ReplicaSet has acquired the Pods and has only created new ones
according to its spec until the number of its new Pods and the original matches its desired
count. As fetching the Pods:
When the control plane creates new Pods for a ReplicaSet, the .metadata.name of the ReplicaSet
is part of the basis for naming those Pods. The name of a ReplicaSet must be a valid DNS
subdomain value, but this can produce unexpected results for the Pod hostnames. For best
compatibility, the name should follow the more restrictive rules for a DNS label.
Pod Template
The .spec.template is a pod template which is also required to have labels in place. In our
frontend.yaml example we had one label: tier: frontend. Be careful not to overlap with the
selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field, .spec.template.spec.restartPolicy, the only allowed value
is Always, which is the default.
Pod Selector
The .spec.selector field is a label selector. As discussed earlier these are the labels used to
identify potential Pods to acquire. In our frontend.yaml example, the selector was:
matchLabels:
tier: frontend
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas. The
ReplicaSet will create/delete its Pods to match this number.
To delete a ReplicaSet and all of its Pods, use kubectl delete. The Garbage collector
automatically deletes all of the dependent Pods by default.
When using the REST API or the client-go library, you must set propagationPolicy to
Background or Foreground in the -d option. For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
-H "Content-Type: application/json"
You can delete a ReplicaSet without affecting any of its Pods using kubectl delete with the --
cascade=orphan option. When using the REST API or the client-go library, you must set
propagationPolicy to Orphan. For example:
Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old
and new .spec.selector are the same, then the new one will adopt the old Pods. However, it will
not make any effort to make existing Pods match a new, different pod template. To update Pods
to a new spec in a controlled way, use a Deployment, as ReplicaSets do not support a rolling
update directly.
You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to
remove Pods from service for debugging, data recovery, etc. Pods that are removed in this way
will be replaced automatically ( assuming that the number of replicas is not also changed).
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas field. The
ReplicaSet controller ensures that a desired number of Pods with a matching label selector are
available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the
available pods to prioritize scaling down pods based on the following general algorithm:
The implicit value for this annotation for pods that don't set it is 0; negative values are
permitted. Invalid values will be rejected by the API server.
This feature is beta and enabled by default. You can disable it using the feature gate
PodDeletionCost in both kube-apiserver and kube-controller-manager.
Note:
• This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion
order.
• Users should avoid updating the annotation frequently, such as updating it based on a
metric value, because doing so will generate a significant number of pod updates on the
apiserver.
The different pods of an application could have different utilization levels. On scale down, the
application may prefer to remove the pods with lower utilization. To avoid frequently updating
the pods, the application should update controller.kubernetes.io/pod-deletion-cost once before
issuing a scale down (setting the annotation to a value proportional to pod utilization level).
This works if the application itself controls the down scaling; for example, the driver pod of a
Spark deployment.
A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can
be auto-scaled by an HPA. Here is an example HPA targeting the ReplicaSet we created in the
previous example.
controllers/hpa-rs.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml and submitting it to a Kubernetes cluster should create
the defined HPA that autoscales the target ReplicaSet depending on the CPU usage of the
replicated Pods.
Alternatives to ReplicaSet
Deployment (recommended)
Deployment is an object which can own ReplicaSets and update them and their Pods via
declarative, server-side rolling updates. While ReplicaSets can be used independently, today
they're mainly used by Deployments as a mechanism to orchestrate Pod creation, deletion and
updates. When you use Deployments you don't have to worry about managing the ReplicaSets
that they create. Deployments own and manage their ReplicaSets. As such, it is recommended
to use Deployments when you want ReplicaSets.
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted
or terminated for any reason, such as in the case of node failure or disruptive node
maintenance, such as a kernel upgrade. For this reason, we recommend that you use a
ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process
supervisor, only it supervises multiple Pods across multiple nodes instead of individual
processes on a single node. A ReplicaSet delegates local container restarts to some agent on the
node such as Kubelet.
Job
Use a Job instead of a ReplicaSet for Pods that are expected to terminate on their own (that is,
batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicaSet for Pods that provide a machine-level function, such as
machine monitoring or machine logging. These Pods have a lifetime that is tied to a machine
lifetime: the Pod needs to be running on the machine before other Pods start, and are safe to
terminate when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose, and
behave similarly, except that a ReplicationController does not support set-based selector
requirements as described in the labels user guide. As such, ReplicaSets are preferred over
ReplicationControllers
What's next
• Learn about Pods.
• Learn about Deployments.
• Run a Stateless Application Using a Deployment, which relies on ReplicaSets to work.
• ReplicaSet is a top-level resource in the Kubernetes REST API. Read the ReplicaSet object
definition to understand the API for replica sets.
• Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
StatefulSets
A StatefulSet runs a group of Pods, and maintains a sticky identity for each of those Pods. This
is useful for managing applications that need persistent storage or a stable, unique network
identity.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering
and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec.
Unlike a Deployment, a StatefulSet maintains a sticky identity for each of its Pods. These pods
are created from the same spec, but are not interchangeable: each has a persistent identifier that
it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a
StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to
failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods
that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application
doesn't require any stable identifiers or ordered deployment, deletion, or scaling, you should
deploy your application using a workload object that provides a set of stateless replicas.
Deployment or ReplicaSet may be better suited to your stateless needs.
Limitations
• The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner
based on the requested storage class, or pre-provisioned by an admin.
• Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the
StatefulSet. This is done to ensure data safety, which is generally more valuable than an
automatic purge of all related StatefulSet resources.
• StatefulSets currently require a Headless Service to be responsible for the network
identity of the Pods. You are responsible for creating this Service.
• StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet
is deleted. To achieve ordered and graceful termination of the pods in the StatefulSet, it is
possible to scale the StatefulSet down to 0 prior to deletion.
• When using Rolling Updates with the default Pod Management Policy (OrderedReady),
it's possible to get into a broken state that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
Pod Selector
You must set the .spec.selector field of a StatefulSet to match the labels of
its .spec.template.metadata.labels. Failing to specify a matching Pod Selector will result in a
validation error during StatefulSet creation.
You can set the .spec.volumeClaimTemplates which can provide stable storage using
PersistentVolumes provisioned by a PersistentVolume Provisioner.
.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for
which a newly created Pod should be running and ready without any of its containers crashing,
for it to be considered available. This is used to check progression of a rollout when using a
Rolling Update strategy. This field defaults to 0 (the Pod will be considered available as soon as
it is ready). To learn more about when a Pod is considered ready, see Container Probes.
Pod Identity
StatefulSet Pods have a unique identity that consists of an ordinal, a stable network identity,
and stable storage. The identity sticks to the Pod, regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal,
that is unique over the Set. By default, pods will be assigned ordinals from 0 up through N-1.
The StatefulSet controller will also add a pod label with this index: apps.kubernetes.io/pod-
index.
Start ordinal
.spec.ordinals is an optional field that allows you to configure the integer ordinals assigned to
each Pod. It defaults to nil. You must enable the StatefulSetStartOrdinal feature gate to use this
field. Once enabled, you can configure the following options:
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal
of the Pod. The pattern for the constructed hostname is $(statefulset name)-$(ordinal). The
example above will create three Pods named web-0,web-1,web-2. A StatefulSet can use a
Headless Service to control the domain of its Pods. The domain managed by this Service takes
the form: $(service name).$(namespace).svc.cluster.local, where "cluster.local" is the cluster
domain. As each Pod is created, it gets a matching DNS subdomain, taking the form: $
(podname).$(governing service domain), where the governing service is defined by the
serviceName field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS
name for a newly-run Pod immediately. This behavior can occur when other clients in the
cluster have already sent queries for the hostname of the Pod before it was created. Negative
caching (normal in DNS) means that the results of previous failed lookups are remembered and
reused, even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few options:
• Query the Kubernetes API directly (for example, using a watch) rather than relying on
DNS lookups.
• Decrease the time of caching in your Kubernetes DNS provider (typically this means
editing the config map for CoreDNS, which currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service
responsible for the network identity of the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and
how that affects the DNS names for the StatefulSet's Pods.
Service
Cluster StatefulSet Pod
(ns/ StatefulSet Domain Pod DNS
Domain (ns/name) Hostnam
name)
default/ web- web-
cluster.local default/web nginx.default.svc.cluster.local
nginx {0..N-1}.nginx.default.svc.cluster.local {0..N-1}
foo/ web- web-
cluster.local foo/web nginx.foo.svc.cluster.local
nginx {0..N-1}.nginx.foo.svc.cluster.local {0..N-1}
foo/ web-
kube.local foo/web nginx.foo.svc.kube.local web-{0..N-1}.nginx.foo.svc.kube.local
nginx {0..N-1}
Note: Cluster Domain will be set to cluster.local unless otherwise configured.
Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one
PersistentVolumeClaim. In the nginx example above, each Pod receives a single
PersistentVolume with a StorageClass of my-storage-class and 1 GiB of provisioned storage. If
no StorageClass is specified, then the default StorageClass will be used. When a Pod is
(re)scheduled onto a node, its volumeMounts mount the PersistentVolumes associated with its
PersistentVolume Claims. Note that, the PersistentVolumes associated with the Pods'
PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted. This must be
done manually.
When the StatefulSet controller creates a Pod, the new Pod is labelled with apps.kubernetes.io/
pod-index. The value of this label is the ordinal index of the Pod. This label allows you to route
traffic to a particular pod index, filter logs/metrics using the pod index label, and more. Note the
feature gate PodIndexLabel must be enabled for this feature, and it is enabled by default.
When the nginx example above is created, three Pods will be deployed in the order web-0,
web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will
not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running
and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully
relaunched and becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that replicas=1,
web-2 would be terminated first. web-1 would not be terminated until web-2 is fully shutdown
and deleted. If web-0 were to fail after web-2 has been terminated and is completely shutdown,
but prior to web-1's termination, web-1 would not be terminated until web-0 is Running and
Ready.
Pod Management Policies
StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and
identity guarantees via its .spec.podManagementPolicy field.
OrderedReady pod management is the default for StatefulSets. It implements the behavior
described above.
Parallel pod management tells the StatefulSet controller to launch or terminate all Pods in
parallel, and to not wait for Pods to become Running and Ready or completely terminated prior
to launching or terminating another Pod. This option only affects the behavior for scaling
operations. Updates are not affected.
Update strategies
A StatefulSet's .spec.updateStrategy field allows you to configure and disable automated rolling
updates for containers, labels, resource request/limits, and annotations for the Pods in a
StatefulSet. There are two possible values:
OnDelete
When a StatefulSet's .spec.updateStrategy.type is set to OnDelete, the StatefulSet
controller will not automatically update the Pods in a StatefulSet. Users must manually
delete Pods to cause the controller to create new Pods that reflect modifications made to a
StatefulSet's .spec.template.
RollingUpdate
The RollingUpdate update strategy implements automated, rolling updates for the Pods in
a StatefulSet. This is the default update strategy.
Rolling Updates
When a StatefulSet's .spec.updateStrategy.type is set to RollingUpdate, the StatefulSet controller
will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod
termination (from the largest ordinal to the smallest), updating each Pod one at a time.
The Kubernetes control plane waits until an updated Pod is Running and Ready prior to
updating its predecessor. If you have set .spec.minReadySeconds (see Minimum Ready Seconds),
the control plane additionally waits that amount of time after the Pod turns ready, before
moving on.
You can control the maximum number of Pods that can be unavailable during an update by
specifying the .spec.updateStrategy.rollingUpdate.maxUnavailable field. The value can be an
absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). Absolute
number is calculated from the percentage value by rounding it up. This field cannot be 0. The
default setting is 1.
This field applies to all Pods in the range 0 to replicas - 1. If there is any unavailable Pod in the
range 0 to replicas - 1, it will be counted towards maxUnavailable.
Note: The maxUnavailable field is in Alpha stage and it is honored only by API servers that are
running with the MaxUnavailableStatefulSet feature gate enabled.
Forced rollback
When using Rolling Updates with the default Pod Management Policy (OrderedReady), it's
possible to get into a broken state that requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for
example, due to a bad binary or application-level configuration error), StatefulSet will stop the
rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a known
issue, StatefulSet will continue to wait for the broken Pod to become Ready (which never
happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already
attempted to run with the bad configuration. StatefulSet will then begin to recreate the Pods
using the reverted template.
PersistentVolumeClaim retention
FEATURE STATE: Kubernetes v1.27 [beta]
whenDeleted
configures the volume retention behavior that applies when the StatefulSet is deleted
whenScaled
configures the volume retention behavior that applies when the replica count of the
StatefulSet is reduced; for example, when scaling down the set.
For each policy that you can configure, you can set the value to either Delete or Retain.
Delete
The PVCs created from the StatefulSet volumeClaimTemplate are deleted for each Pod
affected by the policy. With the whenDeleted policy all PVCs from the
volumeClaimTemplate are deleted after their Pods have been deleted. With the
whenScaled policy, only PVCs corresponding to Pod replicas being scaled down are
deleted, after their Pods have been deleted.
Retain (default)
PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is
the behavior before this new feature.
Bear in mind that these policies only apply when Pods are being removed due to the
StatefulSet being deleted or scaled down. For example, if a Pod associated with a StatefulSet
fails due to node failure, and the control plane creates a replacement Pod, the StatefulSet retains
the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node
where the new Pod is about to launch.
The default for policies is Retain, matching the StatefulSet behavior before this new feature.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete
...
The StatefulSet controller adds owner references to its PVCs, which are then deleted by the
garbage collector after the Pod is terminated. This enables the Pod to cleanly unmount all
volumes before the PVCs are deleted (and before the backing PV and volume are deleted,
depending on the retain policy). When you set the whenDeleted policy to Delete, an owner
reference to the StatefulSet instance is placed on all PVCs associated with that StatefulSet.
The whenScaled policy must delete PVCs only when a Pod is scaled down, and not when a Pod
is deleted for another reason. When reconciling, the StatefulSet controller compares its desired
replica count to the actual Pods present on the cluster. Any StatefulSet Pod whose id greater
than the replica count is condemned and marked for deletion. If the whenScaled policy is
Delete, the condemned Pods are first set as owners to the associated StatefulSet template PVCs,
before the Pod is deleted. This causes the PVCs to be garbage collected after only the
condemned Pods have terminated.
This means that if the controller crashes and restarts, no Pod will be deleted before its owner
reference has been updated appropriate to the policy. If a condemned Pod is force-deleted while
the controller is down, the owner reference may or may not have been set up, depending on
when the controller crashed. It may take several reconcile loops to update the owner references,
so some condemned Pods may have set up owner references and others may not. For this
reason we recommend waiting for the controller to come back up, which will verify owner
references before terminating Pods. If that is not possible, the operator should verify the owner
references on PVCs to ensure the expected objects are deleted when Pods are force-deleted.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a deployment, example via kubectl scale statefulset statefulset --
replicas=X, and then you update that StatefulSet based on a manifest (for example: by running
kubectl apply -f statefulset.yaml), then applying that manifest overwrites the manual scaling
that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for
a Statefulset, don't set .spec.replicas. Instead, allow the Kubernetes control plane to manage
the .spec.replicas field automatically.
What's next
• Learn about Pods.
• Find out how to use StatefulSets
◦ Follow an example of deploying a stateful application.
◦ Follow an example of deploying Cassandra with Stateful Sets.
◦ Follow an example of running a replicated stateful application.
◦ Learn how to scale a StatefulSet.
◦ Learn what's involved when you delete a StatefulSet.
◦ Learn how to configure a Pod to use a volume for storage.
◦ Learn how to configure a Pod to use a PersistentVolume for storage.
• StatefulSet is a top-level resource in the Kubernetes REST API. Read the StatefulSet object
definition to understand the API for stateful sets.
• Read about PodDisruptionBudget and how you can use it to manage application
availability during disruptions.
DaemonSet
A DaemonSet defines Pods that provide node-local facilities. These might be fundamental to the
operation of your cluster, such as a networking helper tool, or be part of an add-on.
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the
cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage
collected. Deleting a DaemonSet will clean up the Pods it created.
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A
more complex setup might use multiple DaemonSets for a single type of daemon, but with
different flags and/or different memory and cpu requests for different hardware types.
Writing a DaemonSet Spec
Create a DaemonSet
You can describe a DaemonSet in a YAML file. For example, the daemonset.yaml file below
describes a DaemonSet that runs the fluentd-elasticsearch Docker image:
controllers/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# these tolerations are to have the daemonset runnable on control plane nodes
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
# it may be desirable to set a high priority class to ensure that a DaemonSet Pod
# preempts running Pods
# priorityClassName: important
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion, kind, and metadata fields.
For general information about working with config files, see running stateless applications and
object management using kubectl.
Pod Template
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested
and does not have an apiVersion or kind.
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify
appropriate labels (see pod selector).
Pod Selector
The .spec.selector field is a pod selector. It works the same as the .spec.selector of a Job.
You must specify a pod selector that matches the labels of the .spec.template. Also, once a
DaemonSet is created, its .spec.selector can not be mutated. Mutating the pod selector can lead
to the unintentional orphaning of Pods, and it was found to be confusing to users.
The .spec.selector must match the .spec.template.metadata.labels. Config with these two not
matching will be rejected by the API.
Note: If it's important that the DaemonSet pod run on each node, it's often desirable to set
the .spec.template.spec.priorityClassName of the DaemonSet to a PriorityClass with a higher
priority to ensure that this eviction occurs.
The user can specify a different scheduler for the Pods of the DaemonSet, by setting the
.spec.template.spec.schedulerName field of the DaemonSet.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
You can add your own tolerations to the Pods of a DaemonSet as well, by defining these in the
Pod template of the DaemonSet.
• Push: Pods in the DaemonSet are configured to send updates to another service, such as
a stats database. They do not have clients.
• NodeIP and Known Port: Pods in the DaemonSet can use a hostPort, so that the pods
are reachable via the node IPs. Clients know the list of node IPs somehow, and know the
port by convention.
• DNS: Create a headless service with the same pod selector, and then discover
DaemonSets using the endpoints resource or retrieve multiple A records from DNS.
• Service: Create a service with the same Pod selector, and use the service to reach a
daemon on a random node. (No way to reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes
and delete Pods from newly not-matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be
updated. Also, the DaemonSet controller will use the original template the next time a node
(even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan with kubectl, then the Pods will
be left on the nodes. If you subsequently create a new DaemonSet with the same selector, the
new DaemonSet adopts the existing Pods. If any Pods need replacing the DaemonSet replaces
them according to its updateStrategy.
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using
init, upstartd, or systemd). This is perfectly fine. However, there are several advantages to
running such processes via a DaemonSet:
• Ability to monitor and manage logs for daemons in the same way as applications.
• Same config language and tools (e.g. Pod templates, kubectl) for daemons and
applications.
• Running daemons in containers with resource limits increases isolation between daemons
from app containers. However, this can also be accomplished by running the daemons in
a container but not in a Pod.
Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a
DaemonSet replaces Pods that are deleted or terminated for any reason, such as in the case of
node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, you
should use a DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These
are called static pods. Unlike DaemonSet, static Pods cannot be managed with kubectl or other
Kubernetes API clients. Static Pods do not depend on the apiserver, making them useful in
cluster bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have
processes which are not expected to terminate (e.g. web servers, storage servers).
Use a Deployment for stateless services, like frontends, where scaling up and down the number
of replicas and rolling out updates are more important than controlling exactly which host the
Pod runs on. Use a DaemonSet when it is important that a copy of a Pod always run on all or
certain hosts, if the DaemonSet provides node-level functionality that allows other Pods to run
correctly on that particular node.
For example, network plugins often include a component that runs as a DaemonSet. The
DaemonSet component makes sure that the node where it's running has working cluster
networking.
What's next
• Learn about Pods.
◦ Learn about static Pods, which are useful for running Kubernetes control plane
components.
• Find out how to use DaemonSets
◦ Perform a rolling update on a DaemonSet
◦ Perform a rollback on a DaemonSet (for example, if a roll out didn't work how you
expected).
• Understand how Kubernetes assigns Pods to Nodes.
• Learn about device plugins and add ons, which often run as DaemonSets.
• DaemonSet is a top-level resource in the Kubernetes REST API. Read the DaemonSet
object definition to understand the API for daemon sets.
Jobs
Jobs represent one-off tasks that run to completion and then stop.
A Job creates one or more Pods and will continue to retry execution of the Pods until a
specified number of them successfully terminate. As pods successfully complete, the Job tracks
the successful completions. When a specified number of successful completions is reached, the
task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will
delete its active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job
object will start a new Pod if the first Pod fails or is deleted (for example due to a node
hardware failure or a node reboot).
If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.
controllers/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
Name: pi
Namespace: default
Selector: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
batch.kubernetes.io/job-name=pi
...
Annotations: batch.kubernetes.io/job-tracking: ""
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
batch.kubernetes.io/job-name=pi
Containers:
pi:
Image: perl:5.34.0
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 21s job-controller Created pod: pi-xf9p4
Normal Completed 18s job-controller Job completed
apiVersion: batch/v1
kind: Job
metadata:
annotations: batch.kubernetes.io/job-tracking: ""
...
creationTimestamp: "2022-11-10T17:53:53Z"
generation: 1
labels:
batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223
batch.kubernetes.io/job-name: pi
name: pi
namespace: default
resourceVersion: "4751"
uid: 204fb678-040b-497f-9266-35ffa8716d14
spec:
backoffLimit: 4
completionMode: NonIndexed
completions: 1
parallelism: 1
selector:
matchLabels:
batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223
suspend: false
template:
metadata:
creationTimestamp: null
labels:
batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223
batch.kubernetes.io/job-name: pi
spec:
containers:
- command:
- perl
- -Mbignum=bpi
- -wle
- print bpi(2000)
image: perl:5.34.0
imagePullPolicy: IfNotPresent
name: pi
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
active: 1
ready: 0
startTime: "2022-11-10T17:53:57Z"
uncountedTerminatedPods: {}
To list all the Pods that belong to a Job in a machine readable form, you can use a command like
this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --output=jsonpath option specifies
an expression with the name from each Pod in the returned list.
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986
280348253421170679821480865132823066470938446095505822317253594081284811174502841027
019385211055596446229489549303819644288109756659334461284756482337867831652712019091
456485669234603486104543266482133936072602491412737245870066063155881748815209209628
292540917153643678925903600113305305488204665213841469519415116094330572703657595919
530921861173819326117931051185480744623799627495673518857527248912279381830119491298
336733624406566430860213949463952247371907021798609437027705392171762931767523846748
184676694051320005681271452635608277857713427577896091736371787214684409012249534301
465495853710507922796892589235420199561121290219608640344181598136297747713099605187
072113499999983729780499510597317328160963185950244594553469083026425223082533446850
352619311881710100031378387528865875332083814206171776691473035982534904287554687311
595628638823537875937519577818577805321712268066130019278766111959092164201989380952
572010654858632788659361533818279682303019520353018529689957736225994138912497217752
834791315155748572424541506959508295331168617278558890750983817546374649393192550604
009277016711390098488240128583616035637076601047101819429555961989467678374494482553
797747268471040475346462080466842590694912933136770289891521047521620569660240580381
501935112533824300355876402474964732639141992726042699227967823547816360093417216412
199245863150302861829745557067498385054945885869269956909272107975093029553211653449
872027559602364806654991198818347977535663698074265425278625518184175746728909777727
938000816470600161452491921732172147723501414419735685481613611573525521334757418494
684385233239073941433345477624168625189835694855620992192221842725502542568876717904
946016534668049886272327917860857843838279679766814541009538837863609506800642251252
051173929848960841284886269456042419652850222106611863067442786220391949450471237137
869609563643719172874677646575739624138908658326459958133904780275901
When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the
basis for naming those Pods. The name of a Job must be a valid DNS subdomain value, but this
can produce unexpected results for the Pod hostnames. For best compatibility, the name should
follow the more restrictive rules for a DNS label. Even when the name is a DNS subdomain, the
name must be no longer than 63 characters.
Job labels will have batch.kubernetes.io/ prefix for job-name and controller-uid.
Pod Template
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested
and does not have an apiVersion or kind.
In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels
(see pod selector) and an appropriate restart policy.
Pod selector
The .spec.selector field is optional. In almost all cases you should not specify it. See section
specifying your own pod selector.
1. Non-parallel Jobs
◦ normally, only one Pod is started, unless the Pod fails.
◦ the Job is complete as soon as its Pod terminates successfully.
2. Parallel Jobs with a fixed completion count:
◦ specify a non-zero positive value for .spec.completions.
◦ the Job represents the overall task, and is complete when there are
.spec.completions successful Pods.
◦ when using .spec.completionMode="Indexed", each Pod gets a different index in the
range 0 to .spec.completions-1.
3. Parallel Jobs with a work queue:
◦ do not specify .spec.completions, default to .spec.parallelism.
◦ the Pods must coordinate amongst themselves or an external service to determine
what each should work on. For example, a Pod might fetch a batch of up to N items
from the work queue.
◦ each Pod is independently capable of determining whether or not all its peers are
done, and thus that the entire Job is done.
◦ when any Pod from the Job terminates with success, no new Pods are created.
◦ once at least one Pod has terminated with success and all Pods are terminated, then
the Job is completed with success.
◦ once any Pod has exited with success, no other Pod should still be doing any work
for this task or writing any output. They should all be in the process of exiting.
For a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset. When
both are unset, both are defaulted to 1.
For a fixed completion count Job, you should set .spec.completions to the number of completions
needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a
non-negative integer.
For more information about how to make use of the different types of job, see the job patterns
section.
Controlling parallelism
Actual parallelism (number of pods running at any instant) may be more or less than requested
parallelism, for a variety of reasons:
• For fixed completion count Jobs, the actual number of pods running in parallel will not
exceed the number of remaining completions. Higher values of .spec.parallelism are
effectively ignored.
• For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining
Pods are allowed to complete, however.
• If the Job Controller has not had time to react.
• If the Job controller failed to create Pods for any reason (lack of ResourceQuota, lack of
permission, etc.), then there may be fewer pods than requested.
• The Job controller may throttle new Pod creation due to excessive previous pod failures in
the same Job.
• When a Pod is gracefully shut down, it takes time to stop.
Completion mode
Jobs with fixed completion count - that is, jobs that have non null .spec.completions - can have a
completion mode that is specified in .spec.completionMode:
Note: Although rare, more than one Pod could be started for the same index (due to various
reasons such as node failures, kubelet restarts, or Pod evictions). In this case, only the first Pod
that completes successfully will count towards the completion count and update the status of
the Job. The other Pods that are running or completed for the same index will be deleted by the
Job controller once they are detected.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the
node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
.spec.template.spec.restartPolicy = "Never". When a Pod fails, then the Job controller starts a
new Pod. This means that your application needs to handle the case when it is restarted in a
new pod. In particular, it needs to handle temporary files, locks, incomplete output and the like
caused by previous runs.
By default, each pod failure is counted towards the .spec.backoffLimit limit, see pod backoff
failure policy. However, you can customize handling of pod failures by setting the Job's pod
failure policy.
Additionally, you can choose to count the pod failures independently for each index of an
Indexed Job by setting the .spec.backoffLimitPerIndex field (for more information, see backoff
limit per index).
If you do specify .spec.parallelism and .spec.completions both greater than 1, then there may be
multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
When the feature gates PodDisruptionConditions and JobPodFailurePolicy are both enabled,
and the .spec.podFailurePolicy field is set, the Job controller does not consider a terminating
Pod (a pod that has a .metadata.deletionTimestamp field set) as a failure until that Pod is
terminal (its .status.phase is Failed or Succeeded). However, the Job controller creates a
replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the
Job controller evaluates .backoffLimit and .podFailurePolicy for the relevant Job, taking this
now-terminated Pod into consideration.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an
immediate failure, even if that Pod later terminates with phase: "Succeeded".
Pod backoff failure policy
There are situations where you want to fail a Job after some amount of retries due to a logical
error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries
before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated
with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s
...) capped at six minutes.
If either of the calculations reaches the .spec.backoffLimit, the Job is considered failed.
Note: If your job has restartPolicy = "OnFailure", keep in mind that your Pod running the Job
will be terminated once the job backoff limit has been reached. This can make debugging the
Job's executable more difficult. We suggest setting restartPolicy = "Never" when debugging the
Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.
When you run an indexed Job, you can choose to handle retries for pod failures independently
for each index. To do so, set the .spec.backoffLimitPerIndex to specify the maximal number of
pod failures per index.
When the per-index backoff limit is exceeded for an index, Kuberentes considers the index as
failed and adds it to the .status.failedIndexes field. The succeeded indexes, those with a
successfully executed pods, are recorded in the .status.completedIndexes field, regardless of
whether you set the backoffLimitPerIndex field.
Note that a failing index does not interrupt execution of other indexes. Once all indexes finish
for a Job where you specified a backoff limit per index, if at least one of those indexes did fail,
the Job controller marks the overall Job as failed, by setting the Failed condition in the status.
The Job gets marked as failed even if some, potentially nearly all, of the indexes were processed
successfully.
You can additionally limit the maximal number of indexes marked failed by setting the
.spec.maxFailedIndexes field. When the number of failed indexes exceeds the maxFailedIndexes
field, the Job controller triggers termination of all remaining running Pods for that Job. Once all
pods are terminated, the entire Job is marked failed by the Job controller, by setting the Failed
condition in the Job status.
/controllers/job-backoff-limit-per-index-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-backoff-limit-per-index-example
spec:
completions: 10
parallelism: 3
completionMode: Indexed # required for the feature
backoffLimitPerIndex: 1 # maximal number of failures per index
maxFailedIndexes: 5 # maximal number of failed indexes before terminating the Job
execution
template:
spec:
restartPolicy: Never # required for the feature
containers:
- name: example
image: python
command: # The jobs fails as there is at least one failed index
# (all even indexes fail in here), yet all indexes
# are executed as maxFailedIndexes is not exceeded.
- python3
- -c
-|
import os, sys
print("Hello world")
if int(os.environ.get("JOB_COMPLETION_INDEX")) % 2 == 0:
sys.exit(1)
In the example above, the Job controller allows for one restart for each of the indexes. When the
total number of failed indexes exceeds 5, then the entire Job is terminated.
status:
completedIndexes: 1,3,5,7,9
failedIndexes: 0,2,4,6,8
succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes
failed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes
conditions:
- message: Job has failed indexes
reason: FailedIndexes
status: "True"
type: Failed
Additionally, you may want to use the per-index backoff along with a pod failure policy. When
using per-index backoff, there is a new FailIndex action available which allows you to avoid
unnecessary retries within an index.
A Pod failure policy, defined with the .spec.podFailurePolicy field, enables your cluster to
handle Pod failures based on the container exit codes and the Pod conditions.
In some situations, you may want to have a better control when handling Pod failures than the
control provided by the Pod backoff failure policy, which is based on the
Job's .spec.backoffLimit. These are some examples of use cases:
• To optimize costs of running workloads by avoiding unnecessary Pod restarts, you can
terminate a Job as soon as one of its Pods fails with an exit code indicating a software
bug.
• To guarantee that your Job finishes even if there are disruptions, you can ignore Pod
failures caused by disruptions (such as preemption, API-initiated eviction or taint-based
eviction) so that they don't count towards the .spec.backoffLimit limit of retries.
You can configure a Pod failure policy, in the .spec.podFailurePolicy field, to meet the above use
cases. This policy can handle Pod failures based on the container exit codes and the Pod
conditions.
/controllers/job-pod-failure-policy-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-pod-failure-policy-example
spec:
completions: 12
parallelism: 3
template:
spec:
restartPolicy: Never
containers:
- name: main
image: docker.io/library/bash:5
command: ["bash"] # example command simulating a bug which triggers the FailJob
action
args:
- -c
- echo "Hello world!" && sleep 5 && exit 42
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main # optional
operator: In # one of: In, NotIn
values: [42]
- action: Ignore # one of: Ignore, FailJob, Count
onPodConditions:
- type: DisruptionTarget # indicates Pod disruption
In the example above, the first rule of the Pod failure policy specifies that the Job should be
marked failed if the main container fails with the 42 exit code. The following are the rules for
the main container specifically:
Note: Because the Pod template specifies a restartPolicy: Never, the kubelet does not restart
the main container in that particular Pod.
The second rule of the Pod failure policy, specifying the Ignore action for failed Pods with
condition DisruptionTarget excludes Pod disruptions from being counted towards
the .spec.backoffLimit limit of retries.
Note: If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job
is running multiple Pods, Kubernetes terminates all the Pods in that Job that are still Pending or
Running.
• if you want to use a .spec.podFailurePolicy field for a Job, you must also define that Job's
pod template with .spec.restartPolicy set to Never.
• the Pod failure policy rules you specify under spec.podFailurePolicy.rules are evaluated in
order. Once a rule matches a Pod failure, the remaining rules are ignored. When no rule
matches the Pod failure, the default handling applies.
• you may want to restrict a rule to a specific container by specifying its name
inspec.podFailurePolicy.rules[*].onExitCodes.containerName. When not specified the rule
applies to all containers. When specified, it should match one the container or
initContainer names in the Pod template.
• you may specify the action taken when a Pod failure policy is matched by
spec.podFailurePolicy.rules[*].action. Possible values are:
◦ FailJob: use to indicate that the Pod's job should be marked as Failed and all
running Pods should be terminated.
◦ Ignore: use to indicate that the counter towards the .spec.backoffLimit should not
be incremented and a replacement Pod should be created.
◦ Count: use to indicate that the Pod should be handled in the default way. The
counter towards the .spec.backoffLimit should be incremented.
◦ FailIndex: use this action along with backoff limit per index to avoid unnecessary
retries within the index of a failed pod.
Note: When you use a podFailurePolicy, the job controller only matches Pods in the Failed
phase. Pods with a deletion timestamp that are not in a terminal phase (Failed or Succeeded) are
considered still terminating. This implies that terminating pods retain a tracking finalizer until
they reach a terminal phase. Since Kubernetes 1.27, Kubelet transitions deleted pods to a
terminal phase (see: Pod Phase). This ensures that deleted pods have their finalizers removed by
the Job controller.
Note: Starting with Kubernetes v1.28, when Pod failure policy is used, the Job controller
recreates terminating Pods only once these Pods reach the terminal Failed phase. This behavior
is similar to podReplacementPolicy: Failed. For more information, see Pod replacement policy.
By default, a Job will run uninterrupted unless a Pod fails (restartPolicy=Never) or a Container
exits in error (restartPolicy=OnFailure), at which point the Job defers to the .spec.backoffLimit
described above. Once .spec.backoffLimit has been reached the Job will be marked as failed and
any running Pods will be terminated.
Another way to terminate a Job is by setting an active deadline. Do this by setting the
.spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds
applies to the duration of the job, no matter how many Pods are created. Once a Job reaches
activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become
type: Failed with reason: DeadlineExceeded.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have an
activeDeadlineSeconds field. Ensure that you set this field at the proper level.
Keep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no
automatic Job restart once the Job status is type: Failed. That is, the Job termination
mechanisms activated with .spec.activeDeadlineSeconds and .spec.backoffLimit result in a
permanent Job failure that requires manual intervention to resolve.
Clean up finished jobs automatically
Finished Jobs are usually no longer needed in the system. Keeping them around in the system
will put pressure on the API server. If the Jobs are managed directly by a higher level controller,
such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based
cleanup policy.
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL
mechanism provided by a TTL controller for finished resources, by specifying
the .spec.ttlSecondsAfterFinished field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its
dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its
lifecycle guarantees, such as finalizers, will be honored.
For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.
If the field is set to 0, the Job will be eligible to be automatically deleted immediately after it
finishes. If the field is unset, this Job won't be cleaned up by the TTL controller after it finishes.
Note:
It is recommended to set ttlSecondsAfterFinished field because unmanaged jobs (Jobs that you
created directly, and not indirectly through other workload APIs such as CronJob) have a
default deletion policy of orphanDependents causing Pods created by an unmanaged Job to be
left around after that Job is fully deleted. Even though the control plane eventually garbage
collects the Pods from a deleted Job after they either fail or complete, sometimes those lingering
pods may cause cluster performance degradation or in worst case cause the cluster to go offline
due to this degradation.
You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a
particular namespace can consume.
Job patterns
The Job object can be used to process a set of independent but related work items. These might
be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a NoSQL
database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just
considering one set of work items that the user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and
weaknesses. The tradeoffs are:
• One Job object for each work item, versus a single Job object for all work items. One Job
per work item creates some overhead for the user and for the system to manage large
numbers of Job objects. A single Job for all work items is better for large numbers of
items.
• Number of Pods created equals number of work items, versus each Pod can process
multiple work items. When the number of Pods equals the number of work items, the
Pods typically requires less modification to existing code and containers. Having each
Pod process multiple work items is better for large numbers of items.
• Several approaches use a work queue. This requires running a queue service, and
modifications to the existing program or container to make it use the work queue. Other
approaches are easier to adapt to an existing containerised application.
• When the Job is associated with a headless Service, you can enable the Pods within a Job
to communicate with each other to collaborate in a computation.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs.
The pattern names are also links to examples and more detailed description.
When you specify completions with .spec.completions, each Pod created by the Job controller
has an identical spec. This means that all pods for a task will have the same command line and
the same image, the same volumes, and (almost) the same environment variables. These
patterns are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism and .spec.completions for each of the
patterns. Here, W is the number of work items.
Advanced usage
Suspending a Job
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the
Job's requirements and will continue to do so until the Job is complete. However, you may want
to temporarily suspend a Job's execution and resume it later, or start Jobs in suspended state
and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you
want to resume it again, update it to false. Creating a Job with .spec.suspend set to true will
create it in the suspended state.
When a Job is resumed from suspension, its .status.startTime field will be reset to the current
time. This means that the .spec.activeDeadlineSeconds timer will be stopped and reset when a
Job is suspended and resumed.
When you suspend a Job, any running Pods that don't have a status of Completed will be
terminated. with a SIGTERM signal. The Pod's graceful termination period will be honored and
your Pod must handle this signal in this period. This may involve saving progress for later or
undoing changes. Pods terminated this way will not count towards the Job's completions count.
apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
suspend: true
parallelism: 1
completions: 5
template:
spec:
...
You can also toggle Job suspension by patching the Job using the command line.
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"
The Job condition of type "Suspended" with status "True" means the Job is suspended; the
lastTransitionTime field can be used to determine how long the Job has been suspended for. If
the status of that condition is "False", then the Job was previously suspended and is now
running. If such a condition does not exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed
The last four events, particularly the "Suspended" and "Resumed" events, are directly a result of
toggling the .spec.suspend field. In the time between these two events, we see that no Pods
were created, but Pod creation restarted as soon as the Job was resumed.
In most cases, a parallel job will want the pods to run with constraints, like all in the same zone,
or all either on GPU model x or y but not a mix of both.
The suspend field is the first step towards achieving those semantics. Suspend allows a custom
queue controller to decide when a job should start; However, once a job is unsuspended, a
custom queue controller has no influence on where the pods of a job will actually land.
This feature allows updating a Job's scheduling directives before it starts, which gives custom
queue controllers the ability to influence pod placement while at the same time offloading
actual pod-to-node assignment to kube-scheduler. This is allowed only for suspended Jobs that
have never been unsuspended before.
The fields in a Job's pod template that can be updated are node affinity, node selector,
tolerations, labels, annotations and scheduling gates.
Normally, when you create a Job object, you do not specify .spec.selector. The system defaulting
logic adds this field when the Job is created. It picks a selector value that will not overlap with
any other jobs.
However, in some cases, you might need to override this automatically set selector. To do this,
you can specify the .spec.selector of the Job.
Be very careful when doing this. If you specify a label selector which is not unique to the pods
of that Job, and which matches unrelated Pods, then pods of the unrelated job may be deleted,
or this Job may count other Pods as completing it, or one or both Jobs may refuse to create Pods
or run to completion. If a non-unique selector is chosen, then other controllers (e.g.
ReplicationController) and their Pods may behave in unpredictable ways too. Kubernetes will
not stop you from making a mistake when specifying .spec.selector.
Here is an example of a case when you might want to use this feature.
Say Job old is already running. You want existing Pods to keep running, but you want the rest
of the Pods it creates to use a different pod template and for the Job to have a new name. You
cannot update the Job because these fields are not updatable. Therefore, you delete Job old but
leave its pods running, using kubectl delete jobs/old --cascade=orphan. Before deleting it, you
make a note of what selector it uses:
kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
Then you create a new Job with name new and you explicitly specify the same selector. Since
the existing Pods have label batch.kubernetes.io/controller-uid=a8f3d00d-
c6d2-11e5-9f87-42010af00002, they are controlled by Job new as well.
You need to specify manualSelector: true in the new Job since you are not using the selector
that the system normally generates for you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002. Setting
manualSelector: true tells the system that you know what you are doing and to allow this
mismatch.
The control plane keeps track of the Pods that belong to any Job and notices if any such Pod is
removed from the API server. To do that, the Job controller creates Pods with the finalizer
batch.kubernetes.io/job-tracking. The controller removes the finalizer only after the Pod has
been accounted for in the Job status, allowing the Pod to be removed by other controllers or
users.
Note: See My pod stays terminating if you observe that pods from a Job are stucked with the
tracking finalizer.
Use cases for elastic Indexed Jobs include batch workloads which require scaling an indexed
Job, such as MPI, Horovord, Ray, and PyTorch training jobs.
By default, the Job controller recreates Pods as soon they either fail or are terminating (have a
deletion timestamp). This means that, at a given time, when some of the Pods are terminating,
the number of running Pods for a Job can be greater than parallelism or greater than one Pod
per index (if you are using an Indexed Job).
You may choose to create replacement Pods only when the terminating Pod is fully terminal
(has status.phase: Failed). To do this, set the .spec.podReplacementPolicy: Failed. The default
replacement policy depends on whether the Job has a podFailurePolicy set. With no Pod failure
policy defined for a Job, omitting the podReplacementPolicy field selects the
TerminatingOrFailed replacement policy: the control plane creates replacement Pods
immediately upon Pod deletion (as soon as the control plane sees that a Pod for this Job has
deletionTimestamp set). For Jobs with a Pod failure policy set, the default
podReplacementPolicy is Failed, and no other value is permitted. See Pod failure policy to learn
more about Pod failure policies for Jobs.
kind: Job
metadata:
name: new
...
spec:
podReplacementPolicy: Failed
...
Provided your cluster has the feature gate enabled, you can inspect the .status.terminating field
of a Job. The value of the field is the number of Pods owned by the Job that are currently
terminating.
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be
restarted. However, a Job will create new Pods to replace terminated ones. For this reason, we
recommend that you use a Job rather than a bare Pod, even if your application requires only a
single Pod.
Replication Controller
As discussed in Pod Lifecycle, Job is only appropriate for pods with RestartPolicy equal to
OnFailure or Never. (Note: If RestartPolicy is not set, the default value is Always.)
Another pattern is for a single Job to create a Pod which then creates other Pods, acting as a
sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat
complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn
starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job
object, but maintains complete control over what Pods are created and how work is assigned to
them.
What's next
• Learn about Pods.
• Read about different ways of running Jobs:
◦ Coarse Parallel Processing Using a Work Queue
◦ Fine Parallel Processing Using a Work Queue
◦ Use an indexed Job for parallel processing with static work assignment
◦ Create multiple Jobs based on a template: Parallel Processing using Expansions
• Follow the links within Clean up finished jobs automatically to learn more about how
your cluster can clean up completed and / or failed tasks.
• Job is part of the Kubernetes REST API. Read the Job object definition to understand the
API for jobs.
• Read about CronJob, which you can use to define a series of Jobs that will run based on a
schedule, similar to the UNIX tool cron.
• Practice how to configure handling of retriable and non-retriable pod failures using
podFailurePolicy, based on the step-by-step examples.
When your Job has finished, it's useful to keep that Job in the API (and not immediately delete
the Job) so that you can tell whether the Job succeeded or failed.
Kubernetes' TTL-after-finished controller provides a TTL (time to live) mechanism to limit the
lifetime of Job objects that have finished execution.
The TTL-after-finished controller assumes that a Job is eligible to be cleaned up TTL seconds
after the Job has finished. The timer starts once the status condition of the Job changes to show
that the Job is either Complete or Failed; once the TTL has expired, that Job becomes eligible
for cascading removal. When the TTL-after-finished controller cleans up a job, it will delete it
cascadingly, that is to say it will delete its dependent objects together with it.
Kubernetes honors object lifecycle guarantees on the Job, such as waiting for finalizers.
You can set the TTL seconds at any time. Here are some examples for setting the
.spec.ttlSecondsAfterFinished field of a Job:
• Specify this field in the Job manifest, so that a Job can be cleaned up automatically some
time after it finishes.
• Manually set this field of existing, already finished Jobs, so that they become eligible for
cleanup.
• Use a mutating admission webhook to set this field dynamically at Job creation time.
Cluster administrators can use this to enforce a TTL policy for finished jobs.
• Use a mutating admission webhook to set this field dynamically after the Job has finished,
and choose different TTL values based on job status, labels. For this case, the webhook
needs to detect changes to the .status of the Job and only set a TTL when the Job is being
marked as completed.
• Write your own controller to manage the cleanup TTL for Jobs that match a particular
selector-selector.
Caveats
Updating TTL for finished Jobs
You can modify the TTL period, e.g. .spec.ttlSecondsAfterFinished field of Jobs, after the job is
created or has finished. If you extend the TTL period after the existing ttlSecondsAfterFinished
period has expired, Kubernetes doesn't guarantee to retain that Job, even if an update to extend
the TTL returns a successful API response.
Time skew
Because the TTL-after-finished controller uses timestamps stored in the Kubernetes jobs to
determine whether the TTL has expired or not, this feature is sensitive to time skew in your
cluster, which may cause the control plane to clean up Job objects at the wrong time.
Clocks aren't always correct, but the difference should be very small. Please be aware of this
risk when setting a non-zero TTL.
What's next
• Read Clean up Jobs automatically
• Refer to the Kubernetes Enhancement Proposal (KEP) for adding this mechanism.
CronJob
A CronJob starts one-time Jobs on a repeating schedule.
FEATURE STATE: Kubernetes v1.21 [stable]
CronJob is meant for performing regular scheduled actions such as backups, report generation,
and so on. One CronJob object is like one line of a crontab (cron table) file on a Unix system. It
runs a job periodically on a given schedule, written in Cron format.
CronJobs have limitations and idiosyncrasies. For example, in certain circumstances, a single
CronJob can create multiple concurrent Jobs. See the limitations below.
When the control plane creates new Jobs and (indirectly) Pods for a CronJob, the
.metadata.name of the CronJob is part of the basis for naming those Pods. The name of a
CronJob must be a valid DNS subdomain value, but this can produce unexpected results for the
Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a
DNS label. Even when the name is a DNS subdomain, the name must be no longer than 52
characters. This is because the CronJob controller will automatically append 11 characters to
the name you provide and there is a constraint that the length of a Job name is no more than 63
characters.
Example
This example CronJob manifest prints the current time and a hello message every minute:
application/job/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in more detail).
The .spec.schedule field is required. The value of that field follows the Cron syntax:
The format also includes extended "Vixie cron" step values. As explained in the FreeBSD
manual:
Step values can be used in conjunction with ranges. Following a range with /
<number> specifies skips of the number's value through the range. For example,
0-23/2 can be used in the hours field to specify command execution every other
hour (the alternative in the V7 standard is 0,2,4,6,8,10,12,14,16,18,20,22). Steps are
also permitted after an asterisk, so if you want to say "every two hours", just use */
2.
Note: A question mark (?) in the schedule has the same meaning as an asterisk *, that is, it
stands for any of available value for a given field.
Other than the standard syntax, some macros like @monthly can also be used:
Equivalent
Entry Description
to
@yearly (or
Run once a year at midnight of 1 January 0011*
@annually)
Run once a month at midnight of the first day of the
@monthly 001**
month
@weekly Run once a week at midnight on Sunday morning 00**0
@daily (or @midnight) Run once a day at midnight 00***
@hourly Run once an hour at the beginning of the hour 0****
To generate CronJob schedule expressions, you can also use web tools like crontab.guru.
Job template
The .spec.jobTemplate defines a template for the Jobs that the CronJob creates, and it is
required. It has exactly the same schema as a Job, except that it is nested and does not have an
apiVersion or kind. You can specify common metadata for the templated Jobs, such as labels or
annotations. For information about writing a Job .spec, see Writing a Job Spec.
The .spec.startingDeadlineSeconds field is optional. This field defines a deadline (in whole
seconds) for starting the Job, if that Job misses its scheduled time for any reason.
After missing the deadline, the CronJob skips that instance of the Job (future occurrences are
still scheduled). For example, if you have a backup job that runs twice a day, you might allow it
to start up to 8 hours late, but no later, because a backup taken any later wouldn't be useful: you
would instead prefer to wait for the next scheduled run.
For Jobs that miss their configured deadline, Kubernetes treats them as failed Jobs. If you don't
specify startingDeadlineSeconds for a CronJob, the Job occurrences have no deadline.
If the .spec.startingDeadlineSeconds field is set (not null), the CronJob controller measures the
time between when a job is expected to be created and now. If the difference is higher than that
limit, it will skip this execution.
For example, if it is set to 200, it allows a job to be created for up to 200 seconds after the actual
schedule.
Concurrency policy
Note that concurrency policy only applies to the jobs created by the same cron job. If there are
multiple CronJobs, their respective jobs are always allowed to run concurrently.
Schedule suspension
You can suspend execution of Jobs for a CronJob, by setting the optional .spec.suspend field to
true. The field defaults to false.
This setting does not affect Jobs that the CronJob has already started.
If you do set that field to true, all subsequent executions are suspended (they remain scheduled,
but the CronJob controller does not start the Jobs to run the tasks) until you unsuspend the
CronJob.
Caution: Executions that are suspended during their scheduled time count as missed jobs.
When .spec.suspend changes from true to false on an existing CronJob without a starting
deadline, the missed jobs are scheduled immediately.
For another way to clean up jobs automatically, see Clean up finished jobs automatically.
Time zones
For CronJobs with no time zone specified, the kube-controller-manager interprets schedules
relative to its local time zone.
You can specify a time zone for a CronJob by setting .spec.timeZone to the name of a valid time
zone. For example, setting .spec.timeZone: "Etc/UTC" instructs Kubernetes to interpret the
schedule relative to Coordinated Universal Time.
A time zone database from the Go standard library is included in the binaries and used as a
fallback in case an external database is not available on the system.
CronJob limitations
Unsupported TimeZone specification
The implementation of the CronJob API in Kubernetes 1.28 lets you set the .spec.schedule field
to include a timezone; for example: CRON_TZ=UTC * * * * * or TZ=UTC * * * * *.
Specifying a timezone that way is not officially supported (and never has been).
If you try to set a schedule that includes TZ or CRON_TZ timezone specification, Kubernetes
reports a warning to the client. Future versions of Kubernetes will prevent setting the unofficial
timezone mechanism entirely.
Modifying a CronJob
By design, a CronJob contains a template for new Jobs. If you modify an existing CronJob, the
changes you make will apply to new Jobs that start to run after your modification is complete.
Jobs (and their Pods) that have already started continue to run without changes. That is, the
CronJob does not update existing Jobs, even if those remain running.
Job creation
A CronJob creates a Job object approximately once per execution time of its schedule. The
scheduling is approximate because there are certain circumstances where two Jobs might be
created, or no Job might be created. Kubernetes tries to avoid those situations, but does not
completely prevent them. Therefore, the Jobs that you define should be idempotent.
Caution: If startingDeadlineSeconds is set to a value less than 10 seconds, the CronJob may not
be scheduled. This is because the CronJob controller checks things every 10 seconds.
For every CronJob, the CronJob Controller checks how many schedules it missed in the
duration from its last scheduled time until now. If there are more than 100 missed schedules,
then it does not start the job and logs the error.
Cannot determine if job needs to be started. Too many missed start time (> 100). Set or
decrease .spec.startingDeadlineSeconds or check clock skew.
It is important to note that if the startingDeadlineSeconds field is set (not nil), the controller
counts how many missed jobs occurred from the value of startingDeadlineSeconds until now
rather than from the last scheduled time until now. For example, if startingDeadlineSeconds is
200, the controller counts how many missed jobs occurred in the last 200 seconds.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example,
if concurrencyPolicy is set to Forbid and a CronJob was attempted to be scheduled when there
was a previous schedule still running, then it would count as missed.
For example, suppose a CronJob is set to schedule a new Job every one minute beginning at
08:30:00, and its startingDeadlineSeconds field is not set. If the CronJob controller happens to be
down from 08:29:00 to 10:21:00, the job will not start as the number of missed jobs which
missed their schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one
minute beginning at 08:30:00, and its startingDeadlineSeconds is set to 200 seconds. If the
CronJob controller happens to be down for the same period as the previous example (08:29:00
to 10:21:00,) the Job will still start at 10:22:00. This happens as the controller now checks how
many missed schedules happened in the last 200 seconds (i.e., 3 missed schedules), rather than
from the last scheduled time until now.
The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is
responsible for the management of the Pods it represents.
What's next
• Learn about Pods and Jobs, two concepts that CronJobs rely upon.
• Read about the detailed format of CronJob .spec.schedule fields.
• For instructions on creating and working with CronJobs, and for an example of a CronJob
manifest, see Running automated tasks with CronJobs.
• CronJob is part of the Kubernetes REST API. Read the CronJob API reference for more
details.
ReplicationController
Legacy API for managing workloads that can scale horizontally. Superseded by the Deployment
and ReplicaSet APIs.
Note: A Deployment that configures a ReplicaSet is now the recommended way to set up
replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one
time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of
pods is always up and available.
A simple case is to create one ReplicationController object to reliably run one instance of a Pod
indefinitely. A more complex use case is to run several identical replicas of a replicated service,
such as web servers.
controllers/replication.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this command:
replicationcontroller/nginx created
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason
Message
--------- -------- ----- ---- ------------- ---- ------ -------
20s 20s 1 {replication-controller } Normal SuccessfulCreate
Created pod: nginx-qrm3m
20s 20s 1 {replication-controller } Normal SuccessfulCreate
Created pod: nginx-3ntk0
20s 20s 1 {replication-controller } Normal SuccessfulCreate
Created pod: nginx-4ok8v
Here, three pods are created, but none is running yet, perhaps because the image is being
pulled. A little later, the same command may show:
To list all the pods that belong to the ReplicationController in a machine readable form, you can
use a command like this:
Here, the selector is the same as the selector for the ReplicationController (seen in the kubectl
describe output), and in a different form in replication.yaml. The --output=jsonpath option
specifies an expression with the name from each pod in the returned list.
When the control plane creates new Pods for a ReplicationController, the .metadata.name of the
ReplicationController is part of the basis for naming those Pods. The name of a
ReplicationController must be a valid DNS subdomain value, but this can produce unexpected
results for the Pod hostnames. For best compatibility, the name should follow the more
restrictive rules for a DNS label.
For general information about working with configuration files, see object management.
A ReplicationController also needs a .spec section.
Pod Template
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested
and does not have an apiVersion or kind.
In addition to required fields for a Pod, a pod template in a ReplicationController must specify
appropriate labels and an appropriate restart policy. For labels, make sure not to overlap with
other controllers. See pod selector.
For local container restarts, ReplicationControllers delegate to an agent on the node, for
example the Kubelet.
The ReplicationController can itself have labels (.metadata.labels). Typically, you would set
these the same as the .spec.template.metadata.labels; if .metadata.labels is not specified then it
defaults to .spec.template.metadata.labels. However, they are allowed to be different, and
the .metadata.labels do not affect the behavior of the ReplicationController.
Pod Selector
The .spec.selector field is a label selector. A ReplicationController manages all the pods with
labels that match the selector. It does not distinguish between pods that it created or deleted
and pods that another person or process created or deleted. This allows the
ReplicationController to be replaced without affecting the running pods.
Also you should not normally create any pods whose labels match this selector, either directly,
with another ReplicationController, or with another controller such as Job. If you do so, the
ReplicationController thinks that it created the other pods. Kubernetes does not stop you from
doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to
manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas to the
number of pods you would like to have running concurrently. The number running at any time
may be higher or lower, such as if the replicas were just increased or decreased, or if a pod is
gracefully shutdown, and a replacement starts early.
To delete a ReplicationController and all its pods, use kubectl delete. Kubectl will scale the
ReplicationController to zero and wait for it to delete each pod before deleting the
ReplicationController itself. If this kubectl command is interrupted, it can be restarted.
When using the REST API or client library, you need to do the steps explicitly (scale replicas to
0, wait for pod deletions, then delete the ReplicationController).
When using the REST API or client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long as
the old and new .spec.selector are the same, then the new one will adopt the old pods. However,
it will not make any effort to make existing pods match a new, different pod template. To
update pods to a new spec in a controlled way, use a rolling update.
Pods may be removed from a ReplicationController's target set by changing their labels. This
technique may be used to remove pods from service for debugging and data recovery. Pods that
are removed in this way will be replaced automatically (assuming that the number of replicas is
not also changed).
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a
ReplicationController will ensure that the specified number of pods exists, even in the event of
node failure or pod termination (for example, due to an action by another control agent).
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually
or by an auto-scaling control agent, by updating the replicas field.
Rolling updates
Ideally, the rolling update controller would take application readiness into account, and would
ensure that a sufficient number of pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label,
such as the image tag of the primary container of the pod, since it is typically image updates
that motivate rolling updates.
For instance, a service might target all pods with tier in (frontend), environment in (prod). Now
say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a
new version of this component. You could set up a ReplicationController with replicas set to 9
for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable, and
another ReplicationController with replicas set to 1 for the canary, with labels tier=frontend,
environment=prod, track=canary. Now the service is covering both the canary and non-canary
pods. But you can mess with the ReplicationControllers separately to test things out, monitor
the results, etc.
Multiple ReplicationControllers can sit behind a single service, so that, for example, some traffic
goes to the old version, and some goes to the new version.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived
as services. Services may be composed of pods controlled by multiple ReplicationControllers,
and it is expected that many ReplicationControllers may be created and destroyed over the
lifetime of a ser