kubernetes 调度器: kube-scheduler 学习

官方文档

Kubernetes Scheduler https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/

调度分两步,过滤+打分。

过滤 Filtering

  • PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
  • PodFitsHost: Checks if a Pod specifies a specific Node by it hostname.
  • PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
  • PodMatchNodeSelector: Checks if a Pod’s Node Selector matches the Node’s label(s).
  • NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
  • NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
  • MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that’s over a configured limit.
  • CheckNodeMemoryPressure: If a Node is reporting memory pressure, and there’s no configured exception, the Pod won’t be scheduled there.
  • CheckNodePIDPressure: If a Node is reporting that process IDs are scarce, and there’s no configured exception, the Pod won’t be scheduled there.
  • CheckNodeDiskPressure: If a Node is reporting storage pressure (a filesystem that is full or nearly full), and there’s no configured exception, the Pod won’t be scheduled there.
  • CheckNodeCondition: Nodes can report that they have a completely full filesystem, that networking isn’t available or that kubelet is otherwise not ready to run Pods. If such a condition is set for a Node, and there’s no configured exception, the Pod won’t be scheduled there.
  • PodToleratesNodeTaints: checks if a Pod’s tolerations can tolerate the Node’s taints.
  • CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs

简单来看就是,端口可用、是否制定了hostname、资源充足、NodeSelector符合、存储卷相关检查、磁盘PID内存压力、节点状态、污点和容忍设置。

打分 Scoring

  • SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belonging to the same ServiceStatefulSet or ReplicaSet.
  • InterPodAffinityPriority: Computes a sum by iterating through the elements of weightedPodAffinityTerm and adding “weight” to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
  • LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
  • MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
  • RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
  • BalancedResourceAllocation: Favors nodes with balanced resource usage.
  • NodePreferAvoidPodsPriority: Priorities nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn’t run on the same Node.
  • NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes
  • TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node’s rank taking that list into account.
  • ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
  • ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favouring scheduling onto nodes that don’t have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
  • CalculateAntiAffinityPriorityMap: This policy helps implement pod anti-affinity.
  • EqualPriorityMap: Gives an equal weight of one to all nodes.

尽量分散调度、Pod亲和性、Node亲和性 Prefer设置、节点资源使用量、污点、已经有相应的镜像的节点加分等。

大规模集群的调度速度优化

Scheduler Performance Tuning: https://kubernetes.io/docs/concepts/scheduling/scheduler-perf-tuning/

Percentage of Nodes to Score 就是说在集群很大的情况下,我们没必要把所有节点进行打分,1000台机器的集群,如果有500台符合调度条件,我们只要打分其中10-20台,选择一个合适的节点调度就可以了。这个打分的范围可以用百分比进行调整。用户可以根据集群规模动态调整这个值。

区域分布不均的问题解决方法 Pod Topology Spread Constraints

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        foo: bar

上面的例子表示,根据zone 进行偏移量检查,如果有两个区 A、B,每个区都有若干台机器,然后 A区2个pod、B区1个pod。

那么调度的时候,如果 maxSkew 为1, 那么下一个pod 一定会调度到 B区。调整 maxSkew 为2或者3,下一个pod 才有可能调度到 A区。

已知的一个问题:

  • 缩容的时候可能导致容器组分布不均 Scaling down a Deployment may result in imbalanced Pods distribution.

Pod Overhead 容器组额外开销计算

https://kubernetes.io/docs/concepts/configuration/pod-overhead/

Pods have some resource overhead. In our traditional linux container (Docker) approach, the accounted overhead is limited to the infra (pause) container, but also invokes some overhead accounted to various system components including: Kubelet (control loops), Docker, kernel (various resources), fluentd (logs).

主要计算了 kubelet、docker、kernel 的额外开销,这个feature启动需要在 kubelet 以及 scheduler 等多处配置feature gate。

kube-scheduler 启动参数

https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/

此外 Kubernetes 支持多个 scheduler,可以在 pod 上制定调度器

https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/

如果default 调度器不符合您的调度需求,可以自己实现调度器,并在集群内配置你的调度器,或者配置多个调度器,默认情况用 default-scheduler,特定的一些 pod 通过您定制的 scheduler 进行调度。

Kubernetes 原生的调度能力与扩展能力

基本的调度策略: 先过滤调度条件找到合适的主机列表,然后进行打分选择最高分。

下面是一个案例,案例来源 https://itnext.io/keep-you-kubernetes-cluster-balanced-the-secret-to-high-availability-17edf60d9cb7 强烈推荐

 

调度优化的开源项目

https://github.com/topics/k8s-sig-scheduling 目前有三个调度相关的项目: poseidon 、 kube-batch 、descheduler。 还有个个人项目 resbalancer

他们的适用场景各不相同,kube-batch 是批量调度场景,descheduler 是再次平衡调度类似二次平衡,poseidon 试图通过网络流量的数据影响调度让调度更合理。

Descheduler

Descheduler 的出现就是为了解决 Kubernetes 自身调度(一次性调度)不足的问题。它以定时任务方式运行,根据已实现的策略,重新去平衡 pod 在集群中的分布。

这个重新调度的任务可以作为 Kubernetes Job 执行,比如我们认为业务流量在凌晨2点最小,可以选择在这个时间点执行这个 Job,比如每周运行一次,保证集群的调度始终保持一个比较平均的效果。或者在上线日之后的几个小时进行 Descheduler。

类似的项目还有 https://github.com/pusher/k8s-spot-rescheduler, 该项目主要做的事情是在 AWS 的 kuber 集群中,把压力较大的节点上的 pod 重新调度到新的一组节点上,大概相当于给两组节点打上label,从一组重新调度到新的一组label上。

Kube-Batch

面向机器学习 / 大数据 /HPC 的批调度器(batch scheduler)。kubeflow中gang scheduler的实现就使用的是kube-batch。

https://www.jianshu.com/p/042692685cf4

kube-batch

在此基础上华为推出了 https://github.com/volcano-sh/scheduler

volcano

Poseidon (alpha https://github.com/kubernetes-sigs/poseidon/releases/tag/v0.8 五月份发布 alpha 版本之后没有更新,主分支上次更新时间 2019 Apr 4)

Kubernetes是支持第三方调度器插件的,而Firmament本身是用C++写的,Kubernetes是用Golang写的,所以Poseidon起的是桥梁的作用,把Firmament调度器集成到Kubernetes中。

Firmament是基于网络流的调度程序,它使用了高效的批处理技术,即用最小费用最大流的算法来进行优化,这种优化再加上Firmament的调度策略可以达到很好的pod放置效果。

https://zhuanlan.zhihu.com/p/35161270

如上面案例,我们此时需要 de-scheduler

https://github.com/kubernetes-sigs/descheduler

kubernetes 调度器: kube-scheduler 学习