This repository was archived by the owner on Jan 30, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 299
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
fleet: server monitor fails to shutdown process #1716
Copy link
Copy link
Closed
Labels
Description
After a short etcd blip, fleet has issues on its agent and engine, but the process remains up. This is affecting v0.13.0
The symptoms are as follows: the Monitor detects the server failed heartbeat, asks all components to shut down, but the shutdown of all components never completes. This means that most `components are dead, the server process is still up, but serves:
{"error":{"code":503,"message":"fleet server unable to communicate with etcd"}}
The full error log is here:
Dec 07 19:28:55 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:236: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR reconciler.go:59: Failed getting current cluster state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: WARN engine.go:117: Engine completed reconciliation in 4.004575849s
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:237: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 07 19:29:03 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:30:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:248: Timed out waiting for server to shut down
The curious bit is this code:
https://github.com/coreos/fleet/blob/v0.13.0/server/server.go#L248
func (s *Server) Supervise() {
sd, err := s.mon.Monitor(s.hrt, s.killc)
if sd {
log.Infof("Server monitor triggered: told to shut down")
} else {
log.Errorf("Server monitor triggered: %v", err)
}
close(s.stopc)
done := make(chan struct{})
go func() {
s.wg.Wait()
close(done)
}()
select {
case <-done:
case <-time.After(shutdownTimeout):
log.Errorf("Timed out waiting for server to shut down")
sd = true
}
if !sd {
log.Infof("Restarting server")
s.SetRestartServer(true)
s.Run()
s.SetRestartServer(false)
}
}I think that after Timed out waiting for server to shut down the server should just crash immediately.