This issue occurs when trying to deploy new hosts using 18.03.1-ce. Previously workloads would run correctly using ~17.06.
There are a bunch of goroutines stuck on image remove. There is also one stuck on container remove that seems to be holding the lock:
1: select [728 minutes] [Created by http.(*Server).Serve @ server.go:2720]
transport transport.go:256 (*Stream).Header(*Stream(#502), #659, #264, #689)
grpc call.go:64 recvResponse(Context(#721), dialOptions(#189), ClientTransport(#685), *callInfo(0x0), *Stream(0x0), interface{}(0x0), ...)
grpc call.go:279 invoke(Context(#721), string(#642, len=40), interface{}(#651), interface{}(#654), *ClientConn(#150), CallOption(0x0), ...)
containerd grpc.go:18 namespaceInterceptor.unary(Context(#636), string(#720, len=842350551064), interface{}(#642), interface{}(#651), *ClientConn(#654), UnaryInvoker(#693), ...)
containerd grpc.go:34 unary)-fm(string(#720, len=842350551064), #642, 0x28, #651, #517, #654, #693, #150, #658, ...)
grpc call.go:141 Invoke(Context(#720), string(#642, len=40), interface{}(#651), interface{}(#654), *ClientConn(#150), CallOption(0x0), ...)
v1 tasks.pb.go:468 (*tasksClient).Kill(*tasksClient(#429), Context(#720), *KillRequest(#517), CallOption(0x0), CallOption(0x0), CallOption(#396))
containerd task.go:181 (*task).Kill(*task(#533), Context(#687), Signal(0x9), KillOpts(0x0), KillOpts(0x0))
libcontainerd client_daemon.go:389 (*client).SignalProcess(*client(#132), Context(#687), string(#316, len=64), string(#635, len=4), 9, #305, #434)
daemon kill.go:179 (*Daemon).kill(*Daemon(#33), *Container(#548), 9, 0x2, 0x2)
daemon kill.go:99 (*Daemon).killWithSignal(*Daemon(#33), *Container(#548), 9, 0x0, 0x0)
daemon kill.go:169 (*Daemon).killPossiblyDeadProcess(*Daemon(#33), *Container(#548), 9, #265, #615)
daemon kill.go:129 (*Daemon).Kill(*Daemon(#33), *Container(#548), 0x0, 0x0)
daemon delete.go:93 (*Daemon).cleanupContainer(*Daemon(#33), *Container(#548), bool(0x101), bool(0x0))
daemon delete.go:48 (*Daemon).ContainerRm(*Daemon(#33), string(#567, len=64), *ContainerRmConfig(#237), 0x0, 0x0)
container container_routes.go:494 (*containerRouter).deleteContainers(*containerRouter(#208), Context(#721), ResponseWriter(#684), *Request(#315), map[string]string(#514), 0x5)
container container.go:68 deleteContainers)-fm(*containerRouter(#721), #516, #684, #274, #315, #514, #721, #516)
middleware experimental.go:27 ExperimentalMiddleware.WrapHandler.func1(func(#721), #684, #274, #315, #514, #721, #516)
middleware version.go:62 VersionMiddleware.WrapHandler.func1(func(#721), #684, #274, #315, #514, #606, 0x40)
authorization middleware.go:59 (*Middleware).WrapHandler.func1(*Middleware(#721), func(#515), #274, #315, #514, #721, #515)
server server.go:137 (*Server).makeHTTPHandler.func1(*Server(#684), APIFunc(#274))
http server.go:1918 HandlerFunc.ServeHTTP(ReadCloser(#203), func(#274))
mux mux.go:103 (*Router).ServeHTTP(*Router(#42), ResponseWriter(#684), *Request(#315))
server router_swapper.go:29 (*routerSwapper).ServeHTTP(*routerSwapper(#238), ResponseWriter(#684), *Request(#315))
http server.go:2619 serverHandler.ServeHTTP(*Server(#43), #684, #274, #315)
http server.go:1801 (*conn).serve(#508, #686, #466)
1: semacquire [16 minutes] [Created by http.(*Server).Serve @ server.go:2720]
sync sema.go:71 runtime_SemacquireMutex(*uint32(#491), bool(0x0))
sync mutex.go:134 (*Mutex).Lock(*Mutex(#490))
container state.go:241 (*State).IsRunning(*State(#490), 0x47)
daemon image_delete.go:378 (*Daemon).checkImageDeleteConflict.func1(*Daemon(#548), ID(#579))
container memory_store.go:62 (*memoryStore).First(*memoryStore(#188), StoreFilter(#547))
daemon image_delete.go:380 (*Daemon).checkImageDeleteConflict(*Daemon(#33), ID(#538), conflictType(0xf))
daemon image_delete.go:313 (*Daemon).imageDeleteHelper(*Daemon(#33), ID(#538), *<unknown>(#523), bool(0x10100), bool(0x1))
daemon image_delete.go:178 (*Daemon).ImageDelete(*Daemon(#33), string(#504, len=11), bool(0x100), bool(#616), #644, #554)
image image_routes.go:199 (*imageRouter).deleteImages(*imageRouter(#227), Context(#721), ResponseWriter(#684), *Request(#79), map[string]string(#476), 0x5)
image image.go:42 deleteImages)-fm(*imageRouter(#721), #478, #684, #266, #79, #476, #721, #478)
middleware experimental.go:27 ExperimentalMiddleware.WrapHandler.func1(func(#721), #684, #266, #79, #476, #721, #478)
middleware version.go:62 VersionMiddleware.WrapHandler.func1(func(#721), #684, #266, #79, #476, #700, 0x40)
authorization middleware.go:59 (*Middleware).WrapHandler.func1(*Middleware(#721), func(#477), #266, #79, #476, #721, #477)
server server.go:137 (*Server).makeHTTPHandler.func1(*Server(#684), APIFunc(#266))
http server.go:1918 HandlerFunc.ServeHTTP(ReadCloser(#74), func(#266))
mux mux.go:103 (*Router).ServeHTTP(*Router(#42), ResponseWriter(#684), *Request(#79))
server router_swapper.go:29 (*routerSwapper).ServeHTTP(*routerSwapper(#238), ResponseWriter(#684), *Request(#79))
http server.go:2619 serverHandler.ServeHTTP(*Server(#43), #684, #266, #79)
http server.go:1801 (*conn).serve(#503, #686, #568)
I looked around for related issues, but couldn't find anything recent.
This issue occurs when trying to deploy new hosts using 18.03.1-ce. Previously workloads would run correctly using ~17.06.
Trying to
image rmany image will block and never complete. There were 5 containers running on the host. Trying toinspectorrmtwo of them had the same behaviour (blocked and never complete).Full output from
SIGUSR1:goroutine-stacks-2018-05-15T151422Z.log
There are a bunch of goroutines stuck on image remove. There is also one stuck on container remove that seems to be holding the lock:
I looked around for related issues, but couldn't find anything recent.