Skip to content

Update machine-leaning/a3-ultragpu-8g/nemo-framework to fix segmentation fault error#4725

Merged
SwarnaBharathiMantena merged 1 commit into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/update_nemofw_ex
Oct 6, 2025
Merged

Update machine-leaning/a3-ultragpu-8g/nemo-framework to fix segmentation fault error#4725
SwarnaBharathiMantena merged 1 commit into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/update_nemofw_ex

Conversation

@SwarnaBharathiMantena
Copy link
Copy Markdown
Contributor

Updates to fix the Segmentation Fault error encountered when following the steps in the README file.

Updates include:

  1. Update to a multi-stage Dockerfile.
  2. Set up ldconfig to load nccl-gib in the Dockerfile

Explanation:

  • The use of these ldconfig lines were not correct earlier. Prior to the above multi-stage build process, since /usr/local/gib/lib64 does not exist in the Docker environment, the ldconfig has no effect since it doesn't find the libnccl / libnccl-net files that will (eventually) be there when we run the workload using a pyxis/enroot --container-mount option.
  • The NeMo 24.12 container is built with ENV NCCL_VERSION=2.22.3, but nccl-gib on a3-ultra requires 2.23 or greater.
    Now that we are using the multi-stage build, we're copying a compatible libnccl into the container and configuring ldconfig properly. That way when pytorch / NeMo starts, it loads both our nccl-gib network plugin and the compatible libnccl.so.

Note: The example uses NeMo 1.0, and work to support NeMo 2.0 is in progress.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@SwarnaBharathiMantena SwarnaBharathiMantena added the release-bugfix Added to release notes under the "Bug fixes" heading. label Oct 6, 2025
@SwarnaBharathiMantena SwarnaBharathiMantena merged commit 02f5a0a into GoogleCloudPlatform:develop Oct 6, 2025
11 of 62 checks passed
@SwarnaBharathiMantena SwarnaBharathiMantena deleted the swarnabm/update_nemofw_ex branch October 13, 2025 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants