Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Units fail after a restart #974

@paddycarver

Description

@paddycarver

Here's my setup: cluster of 1 machine, with a bunch of units, some depending on others, running on it. Machine gets restarted. Some of the units come back online, others fail to. From the logs, it looks like the units that fail do so because they depend upon a unit that has not been written yet. Is there some way to either 1) not reload the units until all units have been written or 2) have the reloading take "Requires" into account? Or is my problem that I'm losing my entire cluster, so any promise of restarting kind of goes out the window at that point?

The logs:

-- Reboot --
Oct 17 04:12:26 localhost systemd[1]: Starting fleet daemon...
Oct 17 04:12:26 localhost systemd[1]: Started fleet daemon.
Oct 17 04:12:31 infra01.c.secondbit-infra.internal fleetd[562]: INFO fleet.go:144: No provided or default config file found - proceeding without
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: INFO server.go:137: Establishing etcd connectivity
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 100ms
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 200ms
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 400ms
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:32 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Update /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 800ms
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Create /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 100ms
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Create /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 200ms
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Create /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 400ms
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: INFO client.go:278: Failed getting response from http://localhost:4001/: dial tcp 127.0.0.1:4001: connection refused
Oct 17 04:12:33 infra01.c.secondbit-infra.internal fleetd[562]: ERROR client.go:200: Unable to get result for {Create /_coreos.com/fleet/machines/27383b7352cd4b66b4c29861c2f0e8d1/object}, retrying in 800ms
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO server.go:148: Starting server components
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO engine.go:149: Engine leadership acquired
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:218: Writing systemd unit hgweb-nginx.1.service (551b)
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:142: Instructing systemd to reload units
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:218: Writing systemd unit hgweb.1.service (475b)
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=hgweb-nginx.1.service reason="unit scheduled here but not loaded"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:142: Instructing systemd to reload units
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: ERROR manager.go:80: Failed to trigger systemd unit hgweb-nginx.1.service start: Unit code-ssh.1.service failed to load: No such file or directory.
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=hgweb-nginx.1.service reason="unit currently loaded but desired state is launched"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:218: Writing systemd unit code-backup.1.service (877b)
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=hgweb.1.service reason="unit scheduled here but not loaded"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:142: Instructing systemd to reload units
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: ERROR manager.go:80: Failed to trigger systemd unit hgweb.1.service start: Unit code-ssh.1.service failed to load: No such file or directory.
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=hgweb.1.service reason="unit currently loaded but desired state is launched"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:218: Writing systemd unit code-backup.1.timer (153b)
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=code-backup.1.service reason="unit scheduled here but not loaded"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:142: Instructing systemd to reload units
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:218: Writing systemd unit code-ssh.1.service (1495b)
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=code-backup.1.timer reason="unit scheduled here but not loaded"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:142: Instructing systemd to reload units
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:78: Triggered systemd unit code-backup.1.timer start: job=1112
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=code-backup.1.timer reason="unit currently loaded but desired state is launched"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=LoadUnit job=code-ssh.1.service reason="unit scheduled here but not loaded"
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:78: Triggered systemd unit code-ssh.1.service start: job=1113
Oct 17 04:12:35 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=code-ssh.1.service reason="unit currently loaded but desired state is launched"
Oct 17 04:49:48 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:89: Triggered systemd unit hgweb.1.service stop: job=1568
Oct 17 04:49:48 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StopUnit job=hgweb.1.service reason="unit currently launched but desired state is loaded"
Oct 17 04:49:56 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:78: Triggered systemd unit hgweb.1.service start: job=1569
Oct 17 04:49:56 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=hgweb.1.service reason="unit currently loaded but desired state is launched"
Oct 17 04:50:03 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:89: Triggered systemd unit hgweb-nginx.1.service stop: job=1654
Oct 17 04:50:03 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StopUnit job=hgweb-nginx.1.service reason="unit currently launched but desired state is loaded"
Oct 17 04:50:10 infra01.c.secondbit-infra.internal fleetd[562]: INFO manager.go:78: Triggered systemd unit hgweb-nginx.1.service start: job=1655
Oct 17 04:50:10 infra01.c.secondbit-infra.internal fleetd[562]: INFO reconcile.go:274: AgentReconciler completed task: type=StartUnit job=hgweb-nginx.1.service reason="unit currently loaded but desired state is launched"
core@infra01 ~ $ fleetctl cat hgweb-nginx.1.service
[Unit]
Description=hgweb-nginx
After=hgweb.1.service
Requires=hgweb.1.service
BindsTo=hgweb.1.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill hgweb-nginx-1
ExecStartPre=-/usr/bin/docker rm hgweb-nginx-1
ExecStartPre=/usr/bin/docker pull secondbit/hgweb-nginx
ExecStart=/usr/bin/docker run --rm --link hgweb-1:hgweb --volumes-from code-1 --volumes-from hgweb-1 --name hgweb-nginx-1 -p=80:80 secondbit/hgweb-nginx
ExecStop=/usr/bin/docker stop hgweb-nginx-1

[X-Fleet]
X-Conflicts=hgweb-nginx.*.service
MachineOf=hgweb.1.service
core@infra01 ~ $ fleetctl cat hgweb.1.service
[Unit]
Description=hgweb
After=code-ssh.1.service
Requires=code-ssh.1.service
BindsTo=code-ssh.1.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill hgweb-1
ExecStartPre=-/usr/bin/docker rm hgweb-1
ExecStartPre=/usr/bin/docker pull secondbit/hgweb
ExecStart=/usr/bin/docker run --rm --volumes-from code-1 --name hgweb-1 -p=3031:3031 secondbit/hgweb
ExecStop=/usr/bin/docker stop hgweb-1

[X-Fleet]
X-Conflicts=hgweb.*.service
MachineOf=code-ssh.1.service
core@infra01 ~ $

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions