Skip to content
This repository was archived by the owner on Feb 1, 2021. It is now read-only.
This repository was archived by the owner on Feb 1, 2021. It is now read-only.

Info hangs while events are being flushed to slow readers #2718

@alexmavr

Description

@alexmavr

In the events handler, swarm managers are currently holding a write lock while flushing events to listeners (https://github.com/docker/swarm/blob/master/api/events.go#L120-L142). If this flush operation takes more time than expected, such as due to a slow or inactive reader, then this causes all /info requests on the leader swarm manager to hang for that duration, as the handler for info attempts to calculate the number of listeners, which requires a read lock on the events handler (https://github.com/docker/swarm/blob/master/api/events.go#L149).

Steps to reproduce:

  1. Provision 1 classic swarm manager and 1 docker node
  2. Create several event listeners from different nodes using docker events, and background these processes on the shell using Ctrl+Z. This will maintain the TCP connection active, but no events will be actively read by these listeners.
  3. Populate events on the cluster by continuously running docker -H manager:port run --rm hello-world with 0.1 second gap.
  4. Perform a docker -H manager:port info and expect it to hang

Potential courses of action for a resolution:

  1. Remove the number of event listeners from the response of Info. The CLI doesn't pretty-print that field, so the use case is fairly limited. Technically this is a functionality regression so it should be the least preferred resolution
  2. Remove the read lock on the Size method of the event handler. Instead, we can maintain an integer counter that is incremented and decremented appropriately together with entries to eh.ws. We can use a separate lock to ensure atomicity of the delete/decrement, add/increment operations, and have the Size operation hold that lock instead, which will have a much smaller critical section.
  3. Flush the http.ResponseWriter outside of the critical section of the eventsHandler.Handle method. This may result in events appearing after delays, or in large chunks, so I'm not sure if that's the easiest approach

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions