Skip to content

rook-ceph-osd Init:CrashLoopBackOff after ungraceful K8s node restart #5027

@birkb

Description

@birkb

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
The related osd will not be added to the cluster again after ungraceful K8s node restart.

Expected behavior:

  • during K8s node outage the ceph cluster status should be degraded
  • after the K8s node has been started again the osd should be integrated again in the ceph cluster
  • no data loss, minimal replication effort

How to reproduce it (minimal and precise):

  • Shutdown the K8s worker node: virsh destroy --graceful --domain k8s-worker-04
    • ceph status will report HEALTH_WARN after a while
    • one mon and one osd reported as lost
    • reduced total volume capacity
    • ceph volume access still possible
  • Start the K8s worker node: virsh start --domain k8s-worker-04
    • ceph status still HEALTH_WARN
    • one osd reported as lost
    • reduced total volume capacity
    • ceph volume access still possible
    • rook-ceph-osd-2-xyz in Init:CrashLoopBackOff because of
Controlled By:  ReplicaSet/rook-ceph-osd-2-85967dc998
Init Containers:
  activate-osd:
    Container ID:  docker://60636a068071e44c9600d251a0015f873352faf6b79394c99ed5910f52160073
    Image:         ceph/ceph:v14.2.8
    Image ID:      docker-pullable://ceph/ceph@sha256:a3d6360ee9685447bb316b1e4ce10229580ba81e37d111c479788446e7233eef
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c

      set -ex

      OSD_ID=2
      OSD_UUID=de214744-b37b-44ff-a5f0-5522102babb5
      OSD_STORE_FLAG="--bluestore"
      TMP_DIR=$(mktemp -d)
      OSD_DATA_DIR=/var/lib/ceph/osd/ceph-"$OSD_ID"

      # active the osd with ceph-volume
      ceph-volume lvm activate --no-systemd "$OSD_STORE_FLAG" "$OSD_ID" "$OSD_UUID"

      # copy the tmpfs directory to a temporary directory
      # this is needed because when the init container exits, the tmpfs goes away and its content with it
      # this will result in the emptydir to be empty when accessed by the main osd container
      cp --verbose --no-dereference "$OSD_DATA_DIR"/* "$TMP_DIR"/

      # unmount the tmpfs since we don't need it anymore
      umount "$OSD_DATA_DIR"

      # copy back the content of the tmpfs into the original osd directory
      cp --verbose --no-dereference "$TMP_DIR"/* "$OSD_DATA_DIR"

      # retain ownership of files to the ceph user/group
      chown --verbose --recursive ceph:ceph "$OSD_DATA_DIR"

      # remove the temporary directory
      rm --recursive --force "$TMP_DIR"

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Mar 2020 18:06:29 +0100
      Finished:     Sun, 15 Mar 2020 18:06:30 +0100
    Ready:          False
    Restart Count:  5

File(s) to submit:

  • kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/common.yaml
  • kubectl apply -f ceph-operator.yaml
    • based on https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/operator.yaml
    • rke specific kubelet path added
- name: ROOK_CSI_KUBELET_DIR_PATH
  value: "/opt/rke/var/lib/kubelet"
  • kubectl apply -f ./rke/rook.io/ceph-cluster.yaml
    • based on https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/cluster.yaml
    • filter for k8s worker Ceph disks added
storage:
    deviceFilter: "^vd[b]"
  • enable pod disruption budgets
disruptionManagement:
    managePodBudgets: true
  • kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.2.6/cluster/examples/kubernetes/ceph/enable-csi-2.0-rbac.yaml
  • kubectl apply -f ceph-storageclass-erasurecoding.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block-erasurecoding
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    # If you change this namespace, also change the namespace below where the secret namespaces are defined
    clusterID: rook-ceph

    # If you want to use erasure coded pool with RBD, you need to create
    # two pools. one erasure coded and one replicated.
    # You need to specify the replicated pool here in the `pool` parameter, it is
    # used for the metadata of the images.
    # The erasure coded pool must be set as the `dataPool` parameter below.
    dataPool: ec-data-pool
    pool: replicated-metadata-pool

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering

    # The secrets contain Ceph admin credentials. These are generated automatically by the operator
    # in the same namespace as the cluster.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`.
    csi.storage.k8s.io/fstype: xfs
# uncomment the following to use rbd-nbd as mounter on supported nodes
# **IMPORTANT**: If you are using rbd-nbd as the mounter, during upgrade you will be hit a ceph-csi
# issue that causes the mount to be disconnected. You will need to follow special upgrade steps
# to restart your application pods. Therefore, this option is not recommended.
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete
  • kubectl apply -f ceph-erasurecodingpool.yaml
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicated-metadata-pool
  namespace: rook-ceph
spec:
  replicated:
    size: 2
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: ec-data-pool
  namespace: rook-ceph
spec:
  # Make sure you have enough nodes and OSDs running bluestore to support the replica size or erasure code chunks.
  # For the below settings, you need at least 3 OSDs on different nodes (because the `failureDomain` is `host` by default).
  erasureCoded:
    dataChunks: 2
    codingChunks: 1
  • kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/dashboard-loadbalancer.yaml
  • kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/toolbox.yaml

Environment:

  • OS: RancherOS 1.5.5
  • Kernel: 4.14.138-rancher
  • hardware configuration:
    • one physical server (8 cores, 64GB RAM, 2x SSD)
    • KVM
    • 3x Master Node VMs
    • 4x Worker Node VMs
    • same SSD for all VMs
  • Rook version: 1.2.5 and later 1.2.6
  • Storage backend version: 14.2.7 and later 14.2.8
  • Kubernetes version: v1.15.9-rancher1-1 and later v1.16.6-rancher1-2
  • Kubernetes cluster type: rke 1.0.4
    • kubelet settings for rook paths added
kubelet:
    extra_args:
      volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      root-dir: /opt/rke/var/lib/kubelet
    extra_binds:
      - "/usr/libexec/kubernetes/kubelet-plugins/volume/exec:/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
      - "/var/lib/kubelet/plugins_registry:/var/lib/kubelet/plugins_registry"
      - "/var/lib/kubelet/pods:/var/lib/kubelet/pods:shared,z"
      - "/opt/rke/var/lib/kubelet:/opt/rke/var/lib/kubelet:shared,z"
  • Storage backend status: HEALTH_WARN

Troubleshooting:

  • i have tried to clean up the ceph config on the worker node, but without success
sudo shred -n 1 -z /dev/vdb
sudo lvremove --select lv_name=~'osd-.*'
sudo vgremove --select vg_name=~'ceph-.*'
sudo pvremove /dev/vdb
sudo rm -rfv /dev/ceph-*
sudo rm -rfv /var/lib/rook
  • I have also tried to remove the osd from the ceph cluster config, but then the operator complains about the missing osd
ceph osd out osd.2
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm osd.2
kubectl -n rook-ceph delete rook-ceph-osd-2-xyz
  • deleting the rook-ceph-osd-2 deployment also has not fixed the problem
kubectl -n rook-ceph delete deployment rook-ceph-osd-2

Sorry for the lengthly description, but there are a lot of moving parts involved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions