Description
The bug is caused by a race condition in hostcfgd - the daemon is missing graceful shutdown flow and since filesystem modification sequence is not atomic, sometimes it may end up in empty configuration files.
To simplify, the flow is:
- Config file is being moved to a backup copy
SIGTERM is received and execution is interrupted
- Original config file is missing in the filesystem
- On the next startup, the original config file is re-created as empty one
- Config file is being moved to a backup copy
- Both files are empty
LOG:
Apr 4 05:15:42.112943 sonic INFO hostcfgd: file size check pass: /etc/pam.d/sshd size is (2139) bytes
Apr 4 05:15:42.123196 sonic ERR hostcfgd: file size check failed: /etc/pam.d/login is empty, file corrupted
Apr 4 05:15:42.150275 sonic INFO hostcfgd: file size check pass: /etc/nsswitch.conf size is (494) bytes
Apr 4 05:15:42.170420 sonic INFO hostcfgd: file size check pass: /etc/nsswitch.conf size is (494) bytes
DUMP:
sonic_dump_20240404_051657/etc/pam.d
root@sonic: pam.d$ ls -la | grep "login\|sshd"
-rw-r--r-- 1 root root 0 Apr 4 05:18 login
-rw-r--r-- 1 root root 0 Apr 4 05:18 login.old
-rw-r--r-- 1 root root 2139 Apr 4 05:18 sshd
-rw-r--r-- 1 root root 2139 Apr 4 05:18 sshd.old
root@sonic: pam.d$ cat login
root@sonic: pam.d$ cat login.old
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L726
# Modify common-auth include file in /etc/pam.d/login, sshd.
# /etc/pam.d/sudo is not handled, because it would change the existing
# behavior. It can be modified once a config knob is added for sudo.
if os.path.isfile(PAM_AUTH_CONF):
self.modify_single_file(ETC_PAMD_SSHD, [ "/^@include/s/common-auth$/common-auth-sonic/" ])
self.modify_single_file(ETC_PAMD_LOGIN, [ "/^@include/s/common-auth$/common-auth-sonic/" ])
else:
self.modify_single_file(ETC_PAMD_SSHD, [ "/^@include/s/common-auth-sonic$/common-auth/" ])
self.modify_single_file(ETC_PAMD_LOGIN, [ "/^@include/s/common-auth-sonic$/common-auth/" ])
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L609
def modify_single_file(self, filename, operations=None):
if operations:
e_list = ['-e'] * len(operations)
e_operations = [item for sublist in zip(e_list, operations) for item in sublist]
with open(filename+'.new', 'w') as f:
subprocess.call(["sed"] + e_operations + [filename], stdout=f)
subprocess.call(["mv", '-f', filename, filename+'.old'])
subprocess.call(['mv', '-f', filename+'.new', filename])
self.check_file_not_empty(filename)
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L596
def check_file_not_empty(self, filename):
exists = os.path.exists(filename)
if not exists:
syslog.syslog(syslog.LOG_ERR, "file size check failed: {} is missing".format(filename))
return
size = os.path.getsize(filename)
if size == 0:
syslog.syslog(syslog.LOG_ERR, "file size check failed: {} is empty, file corrupted".format(filename))
return
syslog.syslog(syslog.LOG_INFO, "file size check pass: {} size is ({}) bytes".format(filename, size))
The mitigation attempt: sonic-net/sonic-host-services#36
Steps to reproduce the issue:
- Copy
hostcfgd module
root@sonic:/home/admin# cp -fv /usr/local/bin/hostcfgd ./hostcfg
- Run script
#!/usr/bin/env python
from hostcfg import AaaCfg
from hostcfg import ETC_PAMD_SSHD
from hostcfg import ETC_PAMD_LOGIN
cfg = AaaCfg()
cond = True
while True:
if cond:
cfg.modify_single_file(ETC_PAMD_SSHD, [ "/^@include/s/common-auth-sonic$/common-auth/" ])
cfg.modify_single_file(ETC_PAMD_LOGIN, [ "/^@include/s/common-auth-sonic$/common-auth/" ])
cond = False
else:
cfg.modify_single_file(ETC_PAMD_SSHD, [ "/^@include/s/common-auth$/common-auth-sonic/" ])
cfg.modify_single_file(ETC_PAMD_LOGIN, [ "/^@include/s/common-auth$/common-auth-sonic/" ])
cond = True
- Press
Ctrl+C
2024 Jul 30 14:57:19.285135 sonic ERR test.py: file size check failed: /etc/pam.d/sshd is empty, file corrupted
2024 Jul 30 14:57:19.289114 sonic ERR test.py: file size check failed: /etc/pam.d/login is empty, file corrupted
2024 Jul 30 14:57:19.292988 sonic ERR test.py: file size check failed: /etc/pam.d/sshd is empty, file corrupted
2024 Jul 30 14:57:19.297005 sonic ERR test.py: file size check failed: /etc/pam.d/login is empty, file corrupted
2024 Jul 30 14:57:19.301031 sonic ERR test.py: file size check failed: /etc/pam.d/sshd is empty, file corrupted
Describe the results you received:
Apr 4 05:15:42.112943 sonic INFO hostcfgd: file size check pass: /etc/pam.d/sshd size is (2139) bytes
Apr 4 05:15:42.123196 sonic ERR hostcfgd: file size check failed: /etc/pam.d/login is empty, file corrupted
Apr 4 05:15:42.150275 sonic INFO hostcfgd: file size check pass: /etc/nsswitch.conf size is (494) bytes
Apr 4 05:15:42.170420 sonic INFO hostcfgd: file size check pass: /etc/nsswitch.conf size is (494) bytes
Describe the results you expected:
No errors are expected
Output of show version:
Output of show techsupport:
Additional information you deem important (e.g. issue happens only occasionally):
Description
The bug is caused by a race condition in
hostcfgd- the daemon is missing graceful shutdown flow and since filesystem modification sequence is not atomic, sometimes it may end up in empty configuration files.To simplify, the flow is:
SIGTERMis received and execution is interruptedLOG:
DUMP:
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L726
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L609
https://github.com/sonic-net/sonic-host-services/blob/master/scripts/hostcfgd#L596
The mitigation attempt: sonic-net/sonic-host-services#36
Steps to reproduce the issue:
hostcfgdmoduleCtrl+CDescribe the results you received:
Describe the results you expected:
No errors are expected
Output of
show version:Output of
show techsupport:Additional information you deem important (e.g. issue happens only occasionally):