-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
we are using openebs/zfs and we have multiple reports where users are having issues that boils down to the docker with zfs storage driver.
Pods are stuck in terminating state:
container <ID> driver "zfs" failed to remove root filesystem: exit status 1: "zfs fs destroy -r <dataset-path>" => cannot open '<dataset-path>': dataset does not exist"
Reports:
- https://jira.ixsystems.com/browse/NAS-112418
- https://www.truenas.com/community/threads/openebs-zfs-driver-removal-in-progress-for-4-weeks.96192/
- https://help.nextcloud.com/t/solved-cannot-start-service-db-error-creating-zfs-mount-after-zfs-snapshot-rollback/61695
This seems to have a workaround where you manually create missing datasets to make docker happy and things come back to normal.. which is tedious.
Steps to reproduce the issue:
Unable to find concrete steps to reproduce, it seems to be a raise condition or non-atomic operation where driver cloudn't finish the removal of the layers datasets but gets into crashloop and most of times its on system restart / power cut when pods are stuck in terminating state giving the following logs:
Describe the results you received:
truenas# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d39f9d831680 openebs/zfs-driver "/usr/local/bin/zfs-…" 4 weeks ago Removal In Progress k8s_openebs-zfs-plugin_openebs-zfs-node-ds5sd_kube-system_d692ac39-ac25-4da6-adbb-93615d9cef3b_203
4f7053116229 rancher/pause:3.1 "/pause" 4 weeks ago Exited (255) 4 weeks ago k8s_POD_openebs-zfs-node-ds5sd_kube-system_d692ac39-ac25-4da6-adbb-93615d9cef3b_27
truenas# k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system openebs-zfs-node-ds5sd 0/2 Terminating 315 106d
kube-system coredns-7448499f4d-hmsv9 1/1 Running 0 8h
kube-system openebs-zfs-controller-0 5/5 Running 0 8h
#######
## LOGS:
#######
Oct 27 01:36:10 truenas k3s[8629]: I1027 01:36:10.703743 8629 scope.go:111] "RemoveContainer" containerID="d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b"
Oct 27 01:36:10 truenas dockerd[8071]: time="2021-10-27T01:36:10.739172443+02:00" level=error msg="Error removing mounted layer d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b: exit status 1: \"/usr/sbin/zfs fs destroy -r storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24\" => cannot open 'storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24': dataset does not exist\n"
Oct 27 01:36:10 truenas dockerd[8071]: time="2021-10-27T01:36:10.739870694+02:00" level=error msg="Handler for DELETE /v1.41/containers/d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b returned error: container d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b: driver \"zfs\" failed to remove root filesystem: exit status 1: \"/usr/sbin/zfs fs destroy -r storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24\" => cannot open 'storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24': dataset does not exist\n"
Oct 27 01:36:10 truenas k3s[8629]: E1027 01:36:10.740889 8629 remote_runtime.go:296] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b\": Error response from daemon: container d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b: driver \"zfs\" failed to remove root filesystem: exit status 1: \"/usr/sbin/zfs fs destroy -r storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24\" => cannot open 'storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24': dataset does not exist" containerID="d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b"
Oct 27 01:36:10 truenas k3s[8629]: E1027 01:36:10.741271 8629 kuberuntime_gc.go:146] "Failed to remove container" err="rpc error: code = Unknown desc = failed to remove container \"d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b\": Error response from daemon: container d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b: driver \"zfs\" failed to remove root filesystem: exit status 1: \"/usr/sbin/zfs fs destroy -r storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24\" => cannot open 'storage_404/ix-applications/docker/3114cdddca0eb01f1de848925eba4d027796b424b91f12d843673e174667bd24': dataset does not exist" containerID="d39f9d83168090d640c8473d2311ec52be722e58951549aef32f187e59f90f8b"Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version:
Client: Docker Engine - Community
Version: 20.10.11
API version: 1.41
Go version: go1.16.9
Git commit: dea9396
Built: Thu Nov 18 00:37:22 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.11
API version: 1.41 (minimum version 1.12)
Go version: go1.16.9
Git commit: 847da18
Built: Thu Nov 18 00:35:31 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.6.3-docker)
scan: Docker Scan (Docker Inc., v0.9.0)
Server:
Containers: 36
Running: 33
Paused: 0
Stopped: 3
Images: 23
Server Version: 20.10.11
Storage Driver: zfs
Zpool: evo
Zpool Health: ONLINE
Parent Dataset: evo/ix-applications/docker
Space Used By Parent: 2543288320
Space Available: 16198967296
Parent Quota: no
Compression: lz4
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc version: v1.0.2-0-g52b36a2
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.10.70+truenas
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 13.53GiB
Name: truenas.local
ID: NDIM:5V6V:MQ5E:QNFP:VPAB:4EYO:YKNL:EF55:WW5R:2IQM:ARAF:5NIG
Docker Root Dir: /mnt/evo/ix-applications/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):