Added support for Pensando-elba platform for trixie build#25518
Added support for Pensando-elba platform for trixie build#25518yxieca merged 6 commits intosonic-net:masterfrom
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Sahil Chaudhari <[email protected]>
Signed-off-by: Sahil Chaudhari <[email protected]>
Signed-off-by: Sahil Chaudhari <[email protected]>
Signed-off-by: Sahil Chaudhari <[email protected]>
Signed-off-by: Sahil Chaudhari <[email protected]>
59dd175 to
b672219
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
files/dsc/dpu.init
Outdated
| # Mount /var/log as tmpfs if not already mounted (required for monit filesystem check) | ||
| if ! mountpoint -q /var/log; then | ||
| mount -t tmpfs -o size=100M,mode=0755 tmpfs /var/log | ||
| # Recreate essential log directories | ||
| mkdir -p /var/log/journal | ||
| mkdir -p /var/log/swss | ||
| mkdir -p /var/log/sonic | ||
| fi | ||
| systemd-tmpfiles --create --prefix /var/log/journal | ||
| systemctl restart systemd-journald |
There was a problem hiding this comment.
Why is this being added here? Shouldn't this be handled from initramfs?
There was a problem hiding this comment.
Fixed it with varlog_size cmdline variable in platform.conf
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Related PR - sonic-net/sonic-linux-kernel#538 |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vvolam
left a comment
There was a problem hiding this comment.
LGTM, Please address Saikrishna's comment
Signed-off-by: Sahil Chaudhari <[email protected]>
520630a to
e8fdbe5
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @saiarcot895 please review the updates from Sahil and sign-off when you get a chance. hi @vmittal-msft please help merge this to 202511 once all checks pass. |
There was a problem hiding this comment.
Pull request overview
This PR adds support for the Pensando-elba platform on Debian trixie with the 6.12 kernel. The changes primarily involve platform driver updates to support the new kernel version, hardware revision detection for different board variants (Mtfuji V1 and V2), and improvements to DPU initialization and health monitoring.
Changes:
- Updated kernel drivers (ionic, mdev, pciesvc) for 6.12 kernel API compatibility
- Added board revision detection and hardware-specific sensor mappings for Mtfuji V1 and V2 variants
- Refactored DPU health monitoring and initialization scripts
- Modified systemd service dependencies for Pensando platform
Reviewed changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/watchdog.py | Changed watchdog device path from watchdog1 to watchdog0 for trixie |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/thermal.py | Added board revision detection and separate sensor mappings for Mtfuji V1 and V2 |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/sensor.py | Added voltage/current sensor mappings per board revision |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/pcie.py | New PCIe utility implementation for device discovery |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/helper.py | Added get_board_rev() and get_slot_id() helper methods |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/component.py | Fixed firmware version parsing to handle new image naming |
| platform/pensando/sonic-platform-modules-dpu/sonic_platform/chassis.py | Updated sensor counts per board revision and added slot ID masking |
| platform/pensando/sonic-platform-modules-dpu/setup.py | Pinned grpcio-tools version to <=1.66.2 |
| platform/pensando/sonic-platform-modules-dpu/dpu/utils/fetch_dpu_status | Refactored health monitoring with data plane state management |
| platform/pensando/sonic-platform-modules-dpu/dpu/utils/dpu_pensando_util.py | Added function to disable unused containers |
| platform/pensando/sonic-platform-modules-dpu/dpu/utils/dpu_db_util.py | Removed duplicate code and simplified helper functions |
| files/dsc/dpu.init | Added SSH key regeneration and robust DHCP handling with timeout logic |
| platform/pensando/platform.conf | Added varlog_size boot parameter and SSH host key cleanup |
| platform/pensando/dsc-drivers/src/drivers/linux/pciesvc/* | Implemented unity build approach for kexec compatibility on 6.12 kernel |
| platform/pensando/dsc-drivers/src/drivers/linux/mdev/mdev_drv.c | Updated mdev_remove signature and class_create call for 6.12 API |
| platform/pensando/dsc-drivers/src/drivers/linux/eth/ionic/* | Updated ethtool and tracing APIs for 6.12 kernel |
| platform/pensando/dsc-drivers/debian/* | Updated module installation paths for trixie build |
| platform/pensando/docker-syncd-pensando/* | Changed base image from bullseye to bookworm |
| files/build_templates/sonic_debian_extension.j2 | Added Pensando-specific systemd service masking and dependencies |
| device/pensando/arm64-elba-asic-flash128-r0/plugins/ssd_util.py | Added SSD utility using eMMC backend |
| build_debian.sh | Updated device tree paths for new kernel naming convention |
| self._api_helper = APIHelper() | ||
| self.index = thermal_index + 1 | ||
| self.board_id = g_board_id | ||
| self.board_rev = g_board_rev |
There was a problem hiding this comment.
Missing global declaration for g_board_rev. Line 72 accesses the global variable g_board_rev, but it's not declared as global in the init method (only g_board_id is declared on line 67 in the old code). Since g_board_rev is set as a module-level global in _thermals_available(), it needs to be declared as global here to be accessed correctly. Add 'global g_board_rev' after the existing global g_board_id declaration.
| #try: | ||
| # cmd = f'sonic-cfggen -a "{{\\"INTERFACE\\": {{\\"Ethernet0\\": {{}},\\"Ethernet0|18.{slot_id}.202.1/31\\": {{}}}}}}" --write-to-db' | ||
| # run_cmd(cmd) | ||
| #except Exception as e: | ||
| # log_err("failed to set Ethernet0 ip due to {}".format(e)) |
There was a problem hiding this comment.
Commented-out code should be removed rather than left in the codebase. If this code might be needed in the future, it should be documented in version control history or added as a TODO comment explaining why it's disabled. Leaving commented code reduces maintainability and creates confusion about whether this functionality is intentional or incomplete.
| #try: | |
| # cmd = f'sonic-cfggen -a "{{\\"INTERFACE\\": {{\\"Ethernet0\\": {{}},\\"Ethernet0|18.{slot_id}.202.1/31\\": {{}}}}}}" --write-to-db' | |
| # run_cmd(cmd) | |
| #except Exception as e: | |
| # log_err("failed to set Ethernet0 ip due to {}".format(e)) | |
| # TODO: Add Ethernet0 IP configuration here if required for this platform. |
| if self.board_rev == self._api_helper.mtfuji_rev_v2: | ||
| self.voltage_sensor_mapping = VOLTAGE_SENSOR_MAPPING_V2 |
There was a problem hiding this comment.
Use elif instead of two separate if statements. Lines 89 checks a condition on board_rev that is mutually exclusive with the check in the surrounding conditional structure, though the change here uses if. For consistency with the pattern and to make the mutual exclusivity clear, consider using elif for line 89.
| except: | ||
| return -1 |
There was a problem hiding this comment.
Bare except clause catches all exceptions without specifying the exception type. This can mask programming errors and make debugging difficult. Consider catching specific exceptions (e.g., ValueError, IOError) or at minimum use 'except Exception:' to allow KeyboardInterrupt and SystemExit to propagate.
| if self.board_rev == self._api_helper.mtfuji_rev_v1: | ||
| self.sensor_mapping = self.SENSOR_MAPPING_MTFUJI_V1 | ||
| if self.board_rev == self._api_helper.mtfuji_rev_v2: | ||
| self.sensor_mapping = self.SENSOR_MAPPING_MTFUJI_V2 |
There was a problem hiding this comment.
Use elif instead of two separate if statements. Lines 80 and 82 both check conditions on board_rev, but they are mutually exclusive. Using elif makes this intention clearer and slightly more efficient by avoiding the second condition check when the first is true.
| except: | ||
| return "N/A" |
There was a problem hiding this comment.
Bare except clause catches all exceptions without specifying the exception type. This can mask programming errors and make debugging difficult. Consider catching specific exceptions (e.g., ValueError, IOError) or at minimum use 'except Exception:' to allow KeyboardInterrupt and SystemExit to propagate.
| def config_setup(): | ||
| try: | ||
| from sonic_platform.chassis import Chassis | ||
| slot_id = Chassis().get_my_slot() | ||
| except Exception as e: | ||
| log_err("failed to get slot id due to {}".format(e)) | ||
|
|
||
| try: | ||
| cmd = f'sonic-cfggen -a "{{\\"INTERFACE\\": {{\\"Ethernet0\\": {{}},\\"Ethernet0|18.{slot_id}.202.1/31\\": {{}}}}}}" --write-to-db' | ||
| run_cmd(cmd) | ||
| except Exception as e: | ||
| log_err("failed to set Ethernet0 ip due to {}".format(e)) | ||
| #try: | ||
| # cmd = f'sonic-cfggen -a "{{\\"INTERFACE\\": {{\\"Ethernet0\\": {{}},\\"Ethernet0|18.{slot_id}.202.1/31\\": {{}}}}}}" --write-to-db' | ||
| # run_cmd(cmd) | ||
| #except Exception as e: | ||
| # log_err("failed to set Ethernet0 ip due to {}".format(e)) | ||
|
|
||
| setup_platform_components_json(slot_id) | ||
| disable_unused_containers() |
There was a problem hiding this comment.
Variable slot_id may be undefined if the exception is raised on line 107. If Chassis().get_my_slot() raises an exception, slot_id is never assigned, but it's used on lines 117 (in setup_platform_components_json call). Initialize slot_id before the try block (e.g., slot_id = -1) or handle the case where it's undefined.
| if self.board_rev == self._api_helper.mtfuji_rev_v2: | ||
| self.current_sensor_mapping = CURRENT_SENSOR_MAPPING_V2 |
There was a problem hiding this comment.
Use elif instead of two separate if statements. Lines 187 checks a condition on board_rev that is mutually exclusive with line 185. Using elif makes the mutual exclusivity clear and avoids the unnecessary second condition check.
| mdev: | ||
| @echo "===> Building MDEV driver " | ||
| $(MAKE) -C $(KSRC) V=1 M=$(KMOD_OUT_DIR) src=$(KMOD_SRC_DIR)/mdev $(KOPT) | ||
| #$(MAKE) -C $(KSRC) V=1 M=$(KMOD_SRC_DIR)/mdev KBUILD_OUTPUT=$(KMOD_OUT_DIR) $(KOPT) |
There was a problem hiding this comment.
Commented-out code should be removed rather than left in the Makefile. The commented line on 102 appears to be an old build command with KBUILD_OUTPUT that has been replaced. If this might be needed in the future, document it in version control history or add a comment explaining why it's disabled.
| #$(MAKE) -C $(KSRC) V=1 M=$(KMOD_SRC_DIR)/mdev KBUILD_OUTPUT=$(KMOD_OUT_DIR) $(KOPT) |
| function generate_ssh_host_keys() | ||
| { | ||
| # Generate SSH host keys | ||
| log_msg "Removing existing SSH host keys" | ||
| rm -rfd /etc/ssh/ssh_host* | ||
| sleep 1 | ||
| ssh-keygen -A | ||
| systemctl restart ssh.service | ||
| log_msg "SSH host keys generated successfully" | ||
| } | ||
|
|
||
| function start_polaris() | ||
| { | ||
| # Run only if ssh.service is NOT active (inactive/failed/not-found) | ||
| if ! systemctl is-active --quiet ssh.service; then | ||
| log_msg "ssh.service is not active; regenerating host keys and restarting ssh" | ||
| generate_ssh_host_keys | ||
| fi |
There was a problem hiding this comment.
The combination of generate_ssh_host_keys() and its use in start_polaris() will delete /etc/ssh/ssh_host* and regenerate SSH host keys on every boot, since dpu.init starts in runlevel S before ssh.service is active. This makes the device’s SSH host identity effectively ephemeral, preventing clients from reliably detecting man-in-the-middle attacks because host keys appear to change on each reboot. Limit host key regeneration to first install or the case where no host keys exist (e.g., checking for missing key files or a one-time flag) so SSH host keys remain stable across normal reboots.
yxieca
left a comment
There was a problem hiding this comment.
Reviewed: Pensando trixie/kernel 6.12 support changes; looks good.
|
Cherry-pick PR to 202511: #25685 |
…25518) What is the motivation for this PR Latest master moved to trixie (6.12 kernel). Update Pensando drivers/scripts/makefiles/plugins to support 6.12. How did you do it Updated dsc-drivers, Pensando scripts, makefiles and plugins for 6.12; built using Pensando artifacts (1.87.0-SS-18-release). How did you verify/test it Loaded image on Pensando DPU on Mtfuji DSS; all dockers up and interfaces up. Signed-off-by: Sahil Chaudhari <[email protected]> Signed-off-by: Feng Pan <[email protected]>
What is the motivation for this PR Latest master moved to trixie (6.12 kernel). Update Pensando drivers/scripts/makefiles/plugins to support 6.12. How did you do it Updated dsc-drivers, Pensando scripts, makefiles and plugins for 6.12; built using Pensando artifacts (1.87.0-SS-18-release). How did you verify/test it Loaded image on Pensando DPU on Mtfuji DSS; all dockers up and interfaces up. Signed-off-by: Sahil Chaudhari <[email protected]> Signed-off-by: dprital <[email protected]>
Why I did it
Latest master branch is moved to trixie environment which uses 6.12 kernel. Modified dsc-drivers, pensando scripts, makefiles and plugins to support 6.12 kernel.
Work item tracking
How I did it
How to verify it
load image on Pensando dpu on Mtfuji DSS. All dockers should be up and running and all interfaces should be up
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)