Skip to content

Borg Deadlocked #813

@rumpelsepp

Description

@rumpelsepp

This night all my backups failed, due to some kind of deadlock. I have three servers, one contains the borg repository, the other two servers push their backups into that (=the same) repository. I do hourly backups, the diff between the hourly backups is usally just about 9 MB or something and normally they are really fast. The setup worked well for about 1 week.

To avoid races, I schedule my hourly cron with sleep $(jot -r 1 1 600) && backup.sh, which generates a random delay from 0 to 10 minutes. Additionally, I have set the --lock-wait flag to 1800. This night a got the following emails:

Server:

Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote: Borg 1.0.0: exception in RPC call:
Remote: Traceback (most recent call last):
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/remote.py", line 94, in serve
Remote:     res = f(*args)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 429, in put
Remote:     self.prepare_txn(self.get_transaction_id())
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/repository.py", line 178, in prepare_txn
Remote:     self.lock.upgrade()
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 298, in upgrade
Remote:     self.acquire(exclusive=True, remove=SHARED)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 257, in acquire
Remote:     self._wait_for_readers_finishing(remove, sleep)
Remote:   File "/usr/local/lib/python3.4/site-packages/borg/locking.py", line 286, in _wait_for_readers_finishing
Remote:     raise LockTimeout(self.path)
Remote: borg.locking.LockTimeout: /var/borg/backup/lock
Remote: Platform: FreeBSD korriban 10.2-RELEASE-p9 FreeBSD 10.2-RELEASE-p9 #0: Thu Jan 14 01:32:46 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Remote: Borg: 1.0.0  Python: CPython 3.4.4
Remote: PID: 11415  CWD: /var/borg
Remote: sys.argv: ['/usr/local/bin/borg', 'serve', '--umask=077']
Remote: SSH_ORIGINAL_COMMAND: None
Remote: 
Remote Exception (see remote log for the traceback)
Platform: FreeBSD coruscant 10.2-RELEASE-p14 FreeBSD 10.2-RELEASE-p14 #0: Wed Mar 16 20:46:12 UTC 2016     [email protected]:/usr/obj/usr/src/sys/GENERIC amd64 amd64
Borg: 1.0.0  Python: CPython 3.4.4
PID: 59110  CWD: /root
sys.argv: ['/usr/local/bin/borg', 'create', '-C', 'lz4', '--lock-wait', '1800', '[email protected]:/var/borg/backup::coruscant-2016-03-30-010210-CEST', '/usr/home', '/etc', '/usr/local/etc']
SSH_ORIGINAL_COMMAND: None

Client 1

Failed to create/acquire the lock /root/.cache/borg/018468490212573b7a255fdbeadc052964eb2fee8e0446b79abc7fead592bfb2/lock.exclusive (timeout).

Client 2

Failed to create/acquire the lock /root/.cache/borg/018468490212573b7a255fdbeadc052964eb2fee8e0446b79abc7fead592bfb2/lock.exclusive (timeout).

The server has too much borg processes left:

$ ps aux | grep borg
borg      11410   1,5  0,9 153684  38712  -  Ss    1:02am      5:18,73 /usr/local/bin/python3.4 /usr/local/bin/borg serve --umask=077
root      11407   0,0  0,2  86496   7468  -  Is    1:02am      0:00,05 sshd: borg [priv] (sshd)
borg      11409   0,0  0,2  86496  10304  -  I     1:02am      0:00,24 sshd: borg@notty (sshd)
root      25779   0,0  0,2  86496   7460  -  Is    5:08am      0:00,05 sshd: borg [priv] (sshd)
borg      25781   0,0  0,2  86496   7596  -  I     5:08am      0:00,01 sshd: borg@notty (sshd)
borg      25782   0,0  0,9 149588  38024  -  Is    5:08am      2:22,57 /usr/local/bin/python3.4 /usr/local/bin/borg serve --umask=077
root      35819   0,0  0,2  86496   7468  -  Is    8:00am      0:00,05 sshd: borg [priv] (sshd)
borg      35821   0,0  0,2  86496   7480  -  I     8:00am      0:00,01 sshd: borg@notty (sshd)
borg      35822   0,0  0,9 149332  37648  -  Ss    8:00am      0:01,22 /usr/local/bin/python3.4 /usr/local/bin/borg serve --umask=077
root      36103   0,0  0,2  86496   7460  -  Is    8:08am      0:00,05 sshd: borg [priv] (sshd)
borg      36105   0,0  0,2  86496   7472  -  I     8:08am      0:00,01 sshd: borg@notty (sshd)
borg      36106   0,0  0,9 149332  37632  -  Ss    8:08am      0:01,22 /usr/local/bin/python3.4 /usr/local/bin/borg serve --umask=077
stefan    37298   0,0  0,1  18824   2564  1  S+    8:28am      0:00,01 grep borg

It seems that it deadlocked very badly; any ideas what went wrong? Is this a bug or misconfiguration?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions