Skip to content

[warm-upgrade][202012] Slow Celestica platform init in rc.local causes lacp-teardown #10152

@vaibhavhd

Description

@vaibhavhd

Description

202012 Warm upgrade failure on Dx010 TOR.

Steps to reproduce the issue:

  1. Warm upgrade 6100 device from any older image to new 202012 image.
  2. If running test, the failure will be caught by test. Otherwise, to catch this manually, check for LAG flap signs in syslog.

Describe the results you received:

are hitting issues in warm-upgrading Celestica devices running SONiC from any image to 202012 branch image.

Short description of the issue:

  1. Warm upgrade fails on TOR due to LAG(s) flap.
  2. LAGs flap due to 90s lacp-session timeout, and lacp-teardown is initiated from the T1 neighbors.
  3. LACP session takes more than 90s as the reboot process is taking longer than before in 202012 warm bootup path.
  4. When investigating this I found that:
    a. Degradation is seen specifically in first boot steps in rc.local:
    b. installing and enabling platform-modules takes a lot of time – in 202012 branch.
    c. For comparison, time taken for rc.local processing.
    i. Same image warm reboot: ~3s.
    ii. Cross branch or in-branch warm “upgrades” to 202012 image: ~30s.
    d. The difference in the boot up path is degradation in 202012 upgrade scenario, which caused points 1, 2 above.

Note that this is a 202012 branch specific – I tried 201811 in-branch upgrade, and see that rc.local processing time is much lesser.

This is a blocker for warm upgrades, hence we need a faster resolution for this.

Questions:

  1. Why are we taking longer in 202012 (vs 201811) platform initialization (enable platform-modules-dx010).
  2. Can we reduce this time - is it possible to delay some of the operations in this step to later (when warmboot completes?).
  3. There is an error seen ion installing Python2 package – a) do we need an installation b) why is ERROR seen?

Describe the results you expected:

No LAG should flap after warmreboot.

Unblocked, shorter rc.local processing.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

dx010-202012-54-54-warm.txt
dx010-202012-53-54-warm.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions