rook-ceph-osd Init:CrashLoopBackOff after ungraceful K8s node restart

**Is this a bug report or feature request?**
* Bug Report

**Deviation from expected behavior:**
The related osd will not be added to the cluster again after ungraceful K8s node restart.

**Expected behavior:**
* during K8s node outage the ceph cluster status should be degraded
* after the K8s node has been started again the osd should be integrated again in the ceph cluster
* no data loss, minimal replication effort

**How to reproduce it (minimal and precise):**
* Shutdown the K8s worker node: `virsh destroy --graceful --domain k8s-worker-04`
  * ceph status will report `HEALTH_WARN` after a while
  * one mon and one osd reported as lost
  * reduced total volume capacity
  * ceph volume access still possible
* Start the K8s worker node: `virsh start --domain k8s-worker-04`
  * ceph status still `HEALTH_WARN`
  * one osd reported as lost
  * reduced total volume capacity
  * ceph volume access still possible
  * `rook-ceph-osd-2-xyz` in Init:CrashLoopBackOff because of
```bash
Controlled By:  ReplicaSet/rook-ceph-osd-2-85967dc998
Init Containers:
  activate-osd:
    Container ID:  docker://60636a068071e44c9600d251a0015f873352faf6b79394c99ed5910f52160073
    Image:         ceph/ceph:v14.2.8
    Image ID:      docker-pullable://ceph/ceph@sha256:a3d6360ee9685447bb316b1e4ce10229580ba81e37d111c479788446e7233eef
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c

      set -ex

      OSD_ID=2
      OSD_UUID=de214744-b37b-44ff-a5f0-5522102babb5
      OSD_STORE_FLAG="--bluestore"
      TMP_DIR=$(mktemp -d)
      OSD_DATA_DIR=/var/lib/ceph/osd/ceph-"$OSD_ID"

      # active the osd with ceph-volume
      ceph-volume lvm activate --no-systemd "$OSD_STORE_FLAG" "$OSD_ID" "$OSD_UUID"

      # copy the tmpfs directory to a temporary directory
      # this is needed because when the init container exits, the tmpfs goes away and its content with it
      # this will result in the emptydir to be empty when accessed by the main osd container
      cp --verbose --no-dereference "$OSD_DATA_DIR"/* "$TMP_DIR"/

      # unmount the tmpfs since we don't need it anymore
      umount "$OSD_DATA_DIR"

      # copy back the content of the tmpfs into the original osd directory
      cp --verbose --no-dereference "$TMP_DIR"/* "$OSD_DATA_DIR"

      # retain ownership of files to the ceph user/group
      chown --verbose --recursive ceph:ceph "$OSD_DATA_DIR"

      # remove the temporary directory
      rm --recursive --force "$TMP_DIR"

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 15 Mar 2020 18:06:29 +0100
      Finished:     Sun, 15 Mar 2020 18:06:30 +0100
    Ready:          False
    Restart Count:  5
```

**File(s) to submit**:
* `kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/common.yaml`
* `kubectl apply -f ceph-operator.yaml` 
  * based on `https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/operator.yaml`
  * rke specific kubelet path added
```yaml
- name: ROOK_CSI_KUBELET_DIR_PATH
  value: "/opt/rke/var/lib/kubelet"
```
* `kubectl apply -f ./rke/rook.io/ceph-cluster.yaml`
  * based on `https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/cluster.yaml`
  * filter for k8s worker Ceph disks added
```yaml
storage:
    deviceFilter: "^vd[b]"
```
* enable pod disruption budgets
```yaml
disruptionManagement:
    managePodBudgets: true
```
* `kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.2.6/cluster/examples/kubernetes/ceph/enable-csi-2.0-rbac.yaml`
* `kubectl apply -f ceph-storageclass-erasurecoding.yaml`
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-ceph-block-erasurecoding
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    # If you change this namespace, also change the namespace below where the secret namespaces are defined
    clusterID: rook-ceph

    # If you want to use erasure coded pool with RBD, you need to create
    # two pools. one erasure coded and one replicated.
    # You need to specify the replicated pool here in the `pool` parameter, it is
    # used for the metadata of the images.
    # The erasure coded pool must be set as the `dataPool` parameter below.
    dataPool: ec-data-pool
    pool: replicated-metadata-pool

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering

    # The secrets contain Ceph admin credentials. These are generated automatically by the operator
    # in the same namespace as the cluster.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`.
    csi.storage.k8s.io/fstype: xfs
# uncomment the following to use rbd-nbd as mounter on supported nodes
# **IMPORTANT**: If you are using rbd-nbd as the mounter, during upgrade you will be hit a ceph-csi
# issue that causes the mount to be disconnected. You will need to follow special upgrade steps
# to restart your application pods. Therefore, this option is not recommended.
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete
```
* `kubectl apply -f ceph-erasurecodingpool.yaml`
```yaml
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicated-metadata-pool
  namespace: rook-ceph
spec:
  replicated:
    size: 2
---
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: ec-data-pool
  namespace: rook-ceph
spec:
  # Make sure you have enough nodes and OSDs running bluestore to support the replica size or erasure code chunks.
  # For the below settings, you need at least 3 OSDs on different nodes (because the `failureDomain` is `host` by default).
  erasureCoded:
    dataChunks: 2
    codingChunks: 1
```
* `kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/dashboard-loadbalancer.yaml`
* `kubectl apply -f https://raw.githubusercontent.com/rook/rook/[v1.2.5|v1.2.6]/cluster/examples/kubernetes/ceph/toolbox.yaml`

**Environment**:
* OS: `RancherOS 1.5.5`
* Kernel: `4.14.138-rancher`
* hardware configuration: 
  * one physical server (8 cores, 64GB RAM, 2x SSD)
  * KVM
  * 3x Master Node VMs 
  * 4x Worker Node VMs
  * same SSD for all VMs
* Rook version: `1.2.5` and later `1.2.6`
* Storage backend version: `14.2.7` and later `14.2.8`
* Kubernetes version: `v1.15.9-rancher1-1` and later `v1.16.6-rancher1-2`
* Kubernetes cluster type: `rke 1.0.4`
  * kubelet settings for rook paths added
```yaml
kubelet:
    extra_args:
      volume-plugin-dir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec
      root-dir: /opt/rke/var/lib/kubelet
    extra_binds:
      - "/usr/libexec/kubernetes/kubelet-plugins/volume/exec:/usr/libexec/kubernetes/kubelet-plugins/volume/exec"
      - "/var/lib/kubelet/plugins_registry:/var/lib/kubelet/plugins_registry"
      - "/var/lib/kubelet/pods:/var/lib/kubelet/pods:shared,z"
      - "/opt/rke/var/lib/kubelet:/opt/rke/var/lib/kubelet:shared,z"
```
* Storage backend status: `HEALTH_WARN`

**Troubleshooting**:
* i have tried to clean up the ceph config on the worker node, but without success
```bash
sudo shred -n 1 -z /dev/vdb
sudo lvremove --select lv_name=~'osd-.*'
sudo vgremove --select vg_name=~'ceph-.*'
sudo pvremove /dev/vdb
sudo rm -rfv /dev/ceph-*
sudo rm -rfv /var/lib/rook
```
* I have also tried to remove the osd from the ceph cluster config, but then the operator complains about the missing osd
```bash
ceph osd out osd.2
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm osd.2
kubectl -n rook-ceph delete rook-ceph-osd-2-xyz
```
* deleting the `rook-ceph-osd-2` deployment also has not fixed the problem
```bash
kubectl -n rook-ceph delete deployment rook-ceph-osd-2
```

Sorry for the lengthly description, but there are a lot of moving parts involved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rook-ceph-osd Init:CrashLoopBackOff after ungraceful K8s node restart #5027

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rook-ceph-osd Init:CrashLoopBackOff after ungraceful K8s node restart #5027

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions