官方文档

Kubernetes Scheduler https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

调度分两步，过滤+打分。

过滤 Filtering

PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by it hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
PodMatchNodeSelector: Checks if a Pod’s Node Selector matches the Node’s label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that’s over a configured limit.
CheckNodeMemoryPressure: If a Node is reporting memory pressure, and there’s no configured exception, the Pod won’t be scheduled there.
CheckNodePIDPressure: If a Node is reporting that process IDs are scarce, and there’s no configured exception, the Pod won’t be scheduled there.
CheckNodeDiskPressure: If a Node is reporting storage pressure (a filesystem that is full or nearly full), and there’s no configured exception, the Pod won’t be scheduled there.
CheckNodeCondition: Nodes can report that they have a completely full filesystem, that networking isn’t available or that kubelet is otherwise not ready to run Pods. If such a condition is set for a Node, and there’s no configured exception, the Pod won’t be scheduled there.
PodToleratesNodeTaints: checks if a Pod’s tolerations can tolerate the Node’s taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs

简单来看就是，端口可用、是否制定了hostname、资源充足、NodeSelector符合、存储卷相关检查、磁盘PID内存压力、节点状态、污点和容忍设置。

打分 Scoring

SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belonging to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Computes a sum by iterating through the elements of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Priorities nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn’t run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node’s rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favouring scheduling onto nodes that don’t have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
CalculateAntiAffinityPriorityMap: This policy helps implement pod anti-affinity.
EqualPriorityMap: Gives an equal weight of one to all nodes.

尽量分散调度、Pod亲和性、Node亲和性 Prefer设置、节点资源使用量、污点、已经有相应的镜像的节点加分等。

大规模集群的调度速度优化

Scheduler Performance Tuning： https://kubernetes.io/docs/concepts/scheduling/scheduler-perf-tuning/

Percentage of Nodes to Score 就是说在集群很大的情况下，我们没必要把所有节点进行打分，1000台机器的集群，如果有500台符合调度条件，我们只要打分其中10-20台，选择一个合适的节点调度就可以了。这个打分的范围可以用百分比进行调整。用户可以根据集群规模动态调整这个值。

区域分布不均的问题解决方法 Pod Topology Spread Constraints

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar

上面的例子表示，根据zone 进行偏移量检查，如果有两个区 A、B，每个区都有若干台机器，然后 A区2个pod、B区1个pod。

那么调度的时候，如果 maxSkew 为1，那么下一个pod 一定会调度到 B区。调整 maxSkew 为2或者3，下一个pod 才有可能调度到 A区。

已知的一个问题：

缩容的时候可能导致容器组分布不均 Scaling down a Deployment may result in imbalanced Pods distribution.

Pod Overhead 容器组额外开销计算

https://kubernetes.io/docs/concepts/configuration/pod-overhead/

Pods have some resource overhead. In our traditional linux container (Docker) approach, the accounted overhead is limited to the infra (pause) container, but also invokes some overhead accounted to various system components including: Kubelet (control loops), Docker, kernel (various resources), fluentd (logs).

主要计算了 kubelet、docker、kernel 的额外开销，这个feature启动需要在 kubelet 以及 scheduler 等多处配置feature gate。

kube-scheduler 启动参数

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/

此外 Kubernetes 支持多个 scheduler，可以在 pod 上制定调度器

https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/

如果default 调度器不符合您的调度需求，可以自己实现调度器，并在集群内配置你的调度器，或者配置多个调度器，默认情况用 default-scheduler，特定的一些 pod 通过您定制的 scheduler 进行调度。

Kubernetes 原生的调度能力与扩展能力

基本的调度策略：先过滤调度条件找到合适的主机列表，然后进行打分选择最高分。

下面是一个案例，案例来源 https://itnext.io/keep-you-kubernetes-cluster-balanced-the-secret-to-high-availability-17edf60d9cb7 强烈推荐

调度优化的开源项目

https://github.com/topics/k8s-sig-scheduling 目前有三个调度相关的项目： poseidon 、 kube-batch 、descheduler。还有个个人项目 resbalancer。

他们的适用场景各不相同，kube-batch 是批量调度场景，descheduler 是再次平衡调度类似二次平衡，poseidon 试图通过网络流量的数据影响调度让调度更合理。

Descheduler

Descheduler 的出现就是为了解决 Kubernetes 自身调度（一次性调度）不足的问题。它以定时任务方式运行，根据已实现的策略，重新去平衡 pod 在集群中的分布。

这个重新调度的任务可以作为 Kubernetes Job 执行，比如我们认为业务流量在凌晨2点最小，可以选择在这个时间点执行这个 Job，比如每周运行一次，保证集群的调度始终保持一个比较平均的效果。或者在上线日之后的几个小时进行 Descheduler。

类似的项目还有 https://github.com/pusher/k8s-spot-rescheduler，该项目主要做的事情是在 AWS 的 kuber 集群中，把压力较大的节点上的 pod 重新调度到新的一组节点上，大概相当于给两组节点打上label，从一组重新调度到新的一组label上。