Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

fleet: server monitor fails to shutdown process #1716

@mwitkow

Description

@mwitkow

After a short etcd blip, fleet has issues on its agent and engine, but the process remains up. This is affecting v0.13.0

The symptoms are as follows: the Monitor detects the server failed heartbeat, asks all components to shut down, but the shutdown of all components never completes. This means that most `components are dead, the server process is still up, but serves:
{"error":{"code":503,"message":"fleet server unable to communicate with etcd"}}

The full error log is here:

Dec 07 19:28:55 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:56 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:236: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: ERROR reconciler.go:59: Failed getting current cluster state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:28:59 eu2-prod-core-hasu fleetd[3250]: WARN engine.go:117: Engine completed reconciliation in 4.004575849s
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR job.go:109: failed fetching all Units from etcd: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:120: Failed fetching Units from Registry: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:01 eu2-prod-core-hasu fleetd[3250]: ERROR reconcile.go:73: Unable to determine agent's desired state: client: etcd cluster is unavailable or misconfigured
Dec 07 19:29:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:237: Server monitor triggered: Monitor timed out before successful heartbeat
Dec 07 19:29:03 eu2-prod-core-hasu fleetd[3250]: ERROR engine.go:221: Engine leadership lost, renewal failed: client: etcd cluster is unavailable or misconfigured
Dec 07 19:30:02 eu2-prod-core-hasu fleetd[3250]: ERROR server.go:248: Timed out waiting for server to shut down

The curious bit is this code:
https://github.com/coreos/fleet/blob/v0.13.0/server/server.go#L248

func (s *Server) Supervise() {
	sd, err := s.mon.Monitor(s.hrt, s.killc)
	if sd {
		log.Infof("Server monitor triggered: told to shut down")
	} else {
		log.Errorf("Server monitor triggered: %v", err)
	}
	close(s.stopc)
	done := make(chan struct{})
	go func() {
		s.wg.Wait()
		close(done)
	}()
	select {
	case <-done:
	case <-time.After(shutdownTimeout):
		log.Errorf("Timed out waiting for server to shut down")
		sd = true
	}
	if !sd {
		log.Infof("Restarting server")
		s.SetRestartServer(true)
		s.Run()
		s.SetRestartServer(false)
	}
}

I think that after Timed out waiting for server to shut down the server should just crash immediately.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions