Skip to content

Zero-downtime deployments with rolling upgrades #30321

@mvdstam

Description

@mvdstam

Hey all,

Usecase
I have a front-end proxy which listens on ports 80 and 443. My applications are deployed as microservices behind this proxy, which acts as a layer-7 load balancer for those applications. These applications are, for example, php:7-apache containers which simply listen on port 80 for HTTP requests. The idea is to have at least 2 replicas for a service, so they can be upgraded incrementally using the rolling upgrade functionality that comes with Docker Swarm.

Issue
At this point, I'm not sure if zero-downtime deployments using the rolling update functionality of Docker Swarm are even possible, at least for my use case. There is one major issue for me here.

When containers are stopped during the rolling update, they are always stopped using the same signal (SIGTERM, or SIGKILL after a certain period). Many images, like the aforementioned apache-based image, won't gracefully shutdown with a SIGTERM, but need a different signal to be sent for the container to shutdown in a graceful way. I created an issue (#25696) for this as well, but this didn't make the 1.13 release. I don't see how the current rolling upgrade system can work in any use case, except for the cases where containers actually are designed specifically to shutdown gracefully when receiving a SIGTERM. In my situation, upgrading the service leads to intermittent HTTP-502 errors until the upgrade is complete. I can't imagine this not being a problem for anyone else, unless I'm missing something obvious.

Possible workaround
Wrap the main command of an image that needs to be able to shutdown gracefully in a wrapper script:

shut_down() {
  kill -SIGWINCH ${SCRIPT_PID}
}

trap 'shut_down' SIGTERM SIGINT

start_apache &

SCRIPT_PID = "$!"
wait ${SCRIPT_PID}

This would immediately fix the issue I'm having, since any SIGINT or SIGTERM that reaches the container would simply be relayed as a SIGWINCH. In this case, this would gracefully shutdown my apache container. However, this would mean that I would have to modify every image I'm using to use this script. It's also a non-standard and to be fair, a nasty solution.

What's the recommended course of action here? Am I missing a piece of the puzzle, or simply overseeing something? Also: even if this issue would be solved, would I get true zero downtime deployments with the rolling upgrade functionality of Docker Swarm Mode? In other words, are containers actually removed from the ingress load balancing pool prior to sending the stop_signal during the upgrade, or would I still get HTTP-502 errors from containers that are still being load-balanced to, but would be in the process of shutting down?

Metadata

Metadata

Assignees

Labels

area/networkingNetworkingarea/swarmkind/enhancementEnhancements are not bugs or new features but can improve usability or performance.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions