Skip to content

[202012] Fix race condition between networking service and interface-config service#142

Closed
Junchao-Mellanox wants to merge 1 commit into202012from
cp-fix-race-cond
Closed

[202012] Fix race condition between networking service and interface-config service#142
Junchao-Mellanox wants to merge 1 commit into202012from
cp-fix-race-cond

Conversation

@Junchao-Mellanox
Copy link
Copy Markdown
Owner

Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

  1. Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
  2. Systemd starts interface-config service who depends on networking service
  3. Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
  4. Interface-config service updates /etc/networking/interface to static configuration.
  5. Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
  6. When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it

Manual test.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

…rvice (sonic-net#10573)

Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
Conflicts:
	files/image_config/interfaces/interfaces-config.sh
@Junchao-Mellanox Junchao-Mellanox requested a review from keboliu May 6, 2022 01:42
Junchao-Mellanox pushed a commit that referenced this pull request Oct 25, 2022
79edf66 Longxiang Lyu Wed Aug 17 08:12:37 2022 +0800 Fix azure pipeline (#118)
8e0f2c6 Longxiang Lyu Wed Aug 17 08:36:07 2022 +0800 Update linkmgr health after getting default route update (#117)
b14ffb8 Jing Zhang Wed Aug 17 15:44:37 2022 -0700 [active-active] post mux metrics events (#123)
a30dbb3 Jing Zhang Thu Aug 18 18:16:04 2022 -0700 Update handleMuxConfigNotification logic (#125)
e14aaba Jing Zhang Tue Aug 23 10:02:17 2022 -0700 [active-active] Remove unnecessary mux wait timeout logs (#122)
cc83717 Longxiang Lyu Fri Sep 2 02:17:53 2022 +0800 Fix mux config (#128)
5429281 Mai Bui Thu Sep 1 17:44:04 2022 -0400 [linkmgrd] Replace memset function in link_prober (#126)
b5aaec1 Jing Zhang Fri Sep 9 14:01:03 2022 -0700 [active-active] shutdown link prober when starting as isolated (#130)
75f02cf Jing Zhang Tue Sep 13 10:34:32 2022 -0700 [active-standby] update warmboot reconciliation logic (#129)
a5a9f90 Hua Liu Fri Sep 16 09:54:32 2022 +0800 Install libyang to azure pipeline (#132)
6fe4f0f Jing Zhang Tue Sep 20 10:10:16 2022 -0700 [Active-Active] flaky LinkmgrdBootupSequence unit tests (#134)
ea68e8c Jing Zhang Wed Sep 21 10:52:18 2022 -0700 Post switchover reasons to STATE DB (#131)
60c35b5 Jing Zhang Thu Sep 22 13:00:41 2022 -0700 [Active-Active] server side admin forwarding state sync up (#133)
08e1be5 Jing Zhang Mon Sep 26 10:59:27 2022 -0700 [Active-Active] avoid being stuck in unknown after process init (#136)
2579988 Jing Zhang Mon Oct 3 09:40:55 2022 -0700 [Active-Standby] fix syslog flood caused by unkown -> standby switchovers (#137)
7e9f670 Jing Zhang Wed Oct 5 10:03:45 2022 -0700 [Active-Active] Retry config mux mode standby (#139)
23feb3b Jing Zhang Wed Oct 5 15:22:58 2022 -0700 [Active-Active] Post link prober stats to state db (#140)
e650098 Jing Zhang Fri Oct 7 15:27:17 2022 -0700 [Active-Active] Update default route shutdown heartbeat logic (#141)
d0653e7 Jing Zhang Tue Oct 11 10:22:02 2022 -0700 [Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification (#142)

dcf6460 Longxiang Lyu Fri Oct 21 12:15:42 2022 +0800 [active-active] Add support to send/handle mux probe request (#147)
fdf42ed Longxiang Lyu Fri Oct 21 10:34:47 2022 +0800 Fix link prober state event report twice issue (#149)
5fd19a3 Longxiang Lyu Mon Oct 17 09:20:27 2022 +0800 [active-active] Fix config reload (#145)

sign-off: Jing Zhang [email protected]
Junchao-Mellanox pushed a commit that referenced this pull request Nov 4, 2022
…rm-common] advance submodule head (sonic-net#12492)

linkmgrd:
* d7d6635 2022-10-21 | Fix link prober state event report twice issue (#149) (HEAD -> 202205) [Longxiang Lyu]
* 0ef3296 2022-10-21 | [active-active] Add support to send/handle mux probe request (#147) [Longxiang Lyu]
* a66fa34 2022-10-17 | [active-active] Fix config reload (#145) [Longxiang Lyu]
* 7e1c820 2022-10-11 | [Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification  (#142) [Jing Zhang]
* 237cfd2 2022-10-07 | [Active-Active] Update default route shutdown heartbeat logic (#141) [Jing Zhang]

utilities:
* 415d30e 2022-10-23 | [techsupport] Adding FRR EVPN dumps (sonic-net#2442) (HEAD -> 202205) [Sudharsan Dhamal Gopalarathnam]
* b3ffe45 2022-10-21 | [show][muxcable] add support for show mux firmware version all (sonic-net#2441) [vdahiya12]
* 7d68534 2022-10-19 | [app_ext] [auto-ts] Add available_mem_threshold option (sonic-net#2423) [Vivek]
* 52b9c16 2022-10-07 | [muxcable][config] add CLI support for mux mode detach (sonic-net#2425) [Jing Zhang]
* 14646ff 2022-10-10 | [show priority-group drop counters] Remove backup with cached PG drop counters after 'config reload' (sonic-net#2386) [Andriy Yurkiv]
* dffcc53 2022-10-11 | Add a subcommand to display a hexdump of transceiver EEPROM page (sonic-net#2379) [mihirpat1]
* 86175c2 2022-10-17 | [chassis]Add fabric counter cli commands (sonic-net#1860) [Maxime Lorrillere]

swss:
* 6fe0afd 2022-10-25 | [portsorch] remove port OID from saiOidToAlias map on port deletion (sonic-net#2483) (HEAD -> 202205, github/202205) [Stepan Blyshchak]
* 7290d66 2022-10-07 | [vlanmgr] Disable `arp_evict_nocarrier` for vlan host intf (sonic-net#2469) [Longxiang Lyu]
* d074001 2022-10-05 | [chassis][voq]Collect counters for fabric links (sonic-net#1944) [Maxime Lorrillere]
* 3a0353a 2022-10-18 | [counters][202205] Improve performance by polling only configured ports buffer queue/pg counters (sonic-net#2474) [Vadym Hlushko]
* 2feb39d 2022-10-14 | [202205] [crm] Fix issue with continues EXCEEDED and CLEAR logs for ACL group/table counters (sonic-net#2482) [Volodymyr Samotiy]

sairedis:
* 326b630 2022-10-21 | [gbsyncd] Add asic db prefix for channel NOTIFICATIONS (sonic-net#1129) (HEAD -> 202205) [Junhua Zhai]

platform-daemon:
* 6dbda9b 2022-10-25 | [ycabled] fix no port/state returned by grpc server (sonic-net#308) (HEAD -> 202205) [vdahiya12]
* 3d1228a 2022-10-20 | Fix xcvrd to support 400G ZR optic (sonic-net#293) [Bohan Yang]

platform-common:
* c04d710 2022-09-29 | Read CMIS data path state duration (sonic-net#312) (HEAD -> 202205) [Bohan Yang]

Signed-off-by: Ying Xie <[email protected]>

Signed-off-by: Ying Xie <[email protected]>
Junchao-Mellanox pushed a commit that referenced this pull request Jan 12, 2023
commit aa8fe6deff466909909430f00598d2dba9490904 (HEAD -> 202012, origin/202012)
Author: Jing Zhang [email protected]
Date: Tue Oct 11 10:22:02 2022 -0700

[Active-Standby] avoid posting mux metrics event when receiving unsolicited mux state notification  (#142)

Description of PR
Summary:
Fixes # (issue)

This PR is to fix incorrect mux metrics timestamps caused by unsolicited mux state notification.

Sign-off: Jing Zhang [email protected]
sign-off: Jing Zhang [email protected]
Junchao-Mellanox pushed a commit that referenced this pull request Jul 25, 2023
…ically (sonic-net#15886)

src/sonic-restapi

* a69ba06 - (HEAD -> 202205, origin/master, origin/HEAD, origin/202205, master) [actions] Support Semgrep by Github Actions (#144) (3 weeks ago) [Mai Bui]
* 6b242a3 - [Ci] Upgrade python 2 to python 3 (#145) (3 weeks ago) [xumia]
* 1c50caa - prevent downcasting of 64-bit integer (#142) (2 months ago) [Mai Bui]
* de26989 - Use -race detector when building and testing (#141) (3 months ago) [Lawrence Lee]
* 9fe2eff - [go] Update Go to version 1.15 (#140) (3 months ago) [Lawrence Lee]
Junchao-Mellanox pushed a commit that referenced this pull request Aug 20, 2024
…utomatically (sonic-net#19897)

#### Why I did it
src/sonic-host-services
```
* 39e31a9 - (HEAD -> master, origin/master, origin/HEAD) Fix modify_single_file generate empty file issue (#145) (26 hours ago) [Hua Liu]
* 1891b0a - Add dbus service to read file stat (#142) (2 days ago) [isabelmsft]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Dec 11, 2024
…omatically (sonic-net#20409)

#### Why I did it
src/sonic-mgmt-common
```
* b91a4df - (HEAD -> master, origin/master, origin/HEAD) PortChannel Interface Static Support - OpenConfig Yang (#142) (9 hours ago) [Satoru Shinohara]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Mar 7, 2025
…atically (sonic-net#786)

#### Why I did it
src/sonic-utilities
```
* 0f9ff3c7 - (HEAD -> 202412, origin/202412) [code sync] Merge code from sonic-net/sonic-utilities:202411 to 202412 (#142) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Feb 25, 2026
…ically (sonic-net#25551)

#### Why I did it
src/sonic-dash-ha
```
* a6cf697 - (HEAD -> master, origin/master, origin/HEAD) update dash-api submodule (3 hours ago) [Jing Zhang]
* 64022eb - Change convert_pb_to_json to parse proto encoded value from binary input (#142) (9 hours ago) [yue-fred-gao]
* 53fb250 - [ci] fix build error and save binaries (#144) (28 hours ago) [Jing Zhang]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Mar 18, 2026
…ically (sonic-net#25742)

#### Why I did it
src/sonic-dash-ha
```
* 8f9893d - (HEAD -> 202511, origin/master, origin/HEAD, origin/202511, master) Create bfd sessions only to NPU participating ha-set (#143) (9 days ago) [yue-fred-gao]
* a6cf697 - update dash-api submodule (10 days ago) [Jing Zhang]
* 64022eb - Change convert_pb_to_json to parse proto encoded value from binary input (#142) (10 days ago) [yue-fred-gao]
* 53fb250 - [ci] fix build error and save binaries (#144) (11 days ago) [Jing Zhang]
* d01ed94 - Add .github/copilot-instructions.md for AI-assisted development (#140) (2 weeks ago) [rustiqly]
* 2b6b37c - Write DASH_DPU_RESET_INFO_TABLE when dpu midplane or control plane down (#137) (4 weeks ago) [yue-fred-gao]
* 9b3c0bf - Add bfd rewrite on pmon change. (#136) (4 weeks ago) [dypet]
* af44396 - [build] Disable debian helper auto install for cargo project. (#135) (5 weeks ago) [Liu Shilong]
* 17e2e0b - Implement bfd pinned state (#134) (5 weeks ago) [yue-fred-gao]
* c04969e - switch to using libboost1.83 (#133) (6 weeks ago) [yijingyan2]
* b38d8fb - Change to DBConnector::clone_timeout_async (#132) (3 months ago) [yue-fred-gao]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Junchao-Mellanox pushed a commit that referenced this pull request Mar 18, 2026
…02511

8f9893d (HEAD -> master, origin/master, origin/HEAD) Create bfd sessions only to NPU participating ha-set (#143)
a6cf697 update dash-api submodule
64022eb Change convert_pb_to_json to parse proto encoded value from binary input (#142)
53fb250 [ci] fix build error and save binaries  (#144)
d01ed94 Add .github/copilot-instructions.md for AI-assisted development (#140)
2b6b37c Write DASH_DPU_RESET_INFO_TABLE when dpu midplane or control plane down (#137)
9b3c0bf Add bfd rewrite on pmon change. (#136)
af44396 [build] Disable debian helper auto install for cargo project. (#135)
17e2e0b Implement bfd pinned state (#134)
c04969e switch to using libboost1.83 (#133)
b38d8fb Change to DBConnector::clone_timeout_async (#132)

sign-off: Jing Zhang [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants