This repository was archived by the owner on Feb 1, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
This repository was archived by the owner on Feb 1, 2021. It is now read-only.
Info hangs while events are being flushed to slow readers #2718
Copy link
Copy link
Closed
Description
In the events handler, swarm managers are currently holding a write lock while flushing events to listeners (https://github.com/docker/swarm/blob/master/api/events.go#L120-L142). If this flush operation takes more time than expected, such as due to a slow or inactive reader, then this causes all /info requests on the leader swarm manager to hang for that duration, as the handler for info attempts to calculate the number of listeners, which requires a read lock on the events handler (https://github.com/docker/swarm/blob/master/api/events.go#L149).
Steps to reproduce:
- Provision 1 classic swarm manager and 1 docker node
- Create several event listeners from different nodes using
docker events, and background these processes on the shell usingCtrl+Z. This will maintain the TCP connection active, but no events will be actively read by these listeners. - Populate events on the cluster by continuously running
docker -H manager:port run --rm hello-worldwith 0.1 second gap. - Perform a
docker -H manager:port infoand expect it to hang
Potential courses of action for a resolution:
- Remove the number of event listeners from the response of Info. The CLI doesn't pretty-print that field, so the use case is fairly limited. Technically this is a functionality regression so it should be the least preferred resolution
- Remove the read lock on the
Sizemethod of the event handler. Instead, we can maintain an integer counter that is incremented and decremented appropriately together with entries toeh.ws. We can use a separate lock to ensure atomicity of thedelete/decrement,add/incrementoperations, and have theSizeoperation hold that lock instead, which will have a much smaller critical section. - Flush the
http.ResponseWriteroutside of the critical section of theeventsHandler.Handlemethod. This may result in events appearing after delays, or in large chunks, so I'm not sure if that's the easiest approach