Skip to content

Conversation

@bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Mar 9, 2022

This PR adds support for injecting CDI devices into containers.

CDI injection is already supported in Podman since v 3.20 and was recently implemented in CRI-O

This is also an essential part of work to implement Dynamic Resource Allocation in Kubernetes.

@k8s-ci-robot
Copy link

Hi @bart0sh. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bart0sh bart0sh force-pushed the cdi6 branch 2 times, most recently from 8dd2e31 to 9b8ad75 Compare March 9, 2022 19:43
@mikebrow
Copy link
Member

mikebrow commented Mar 9, 2022

/ok-to-test

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 9, 2022

Build succeeded.

  • containerd-build-arm64 : RETRY_LIMIT in 3m 37s (non-voting)
  • containerd-test-arm64 : RETRY_LIMIT in 5m 41s (non-voting)
  • containerd-integration-test-arm64 : RETRY_LIMIT in 25m 08s (non-voting)

"errors"
"fmt"

"github.com/container-orchestrated-devices/container-device-interface/pkg/cdi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the current spec version is 0.3.0. Does the git tag follow the spec version? How do you intend on handling upgrades and backward compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you make a bug fix that is also a new spec? Also since this is an issue with the current CNI setup, what are your thoughts on upgrades and backward compatibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeZappa87 so far it was the case. As CDI API is in alpha state we have been updating the spec and the code for any bug fixes and major API updates. We're going to do that until we reach first major release. After that we may consider using different API paths for incompatible API versions.

Copy link
Member

@klihub klihub Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even if the code/git and the Spec versions are in sync now, going forward they will almost inevitably diverge. IMO the package version and the Spec version, while not completely orthogonal, they are not supposed to be updated in tandem. Instead, both should follow semantic versioning on their own. I think this implies that bumping the Spec version will always result in bumping the git tag, but not the other way around. For instance, changing the externally visible package programmatic interfaces/abstractions would result in a git tag bump, but it would leave the Spec version alone.

As for handling the evolution of the Spec, I think our default goal should be that when we change the Spec version, we provide transparent conversion (while loading the Spec) from a (reasonable window of) previous version(s) to the latest one, whenever this is possible. And when we design/decide Spec changes, we should keep this in mind and aim for such a possibility to exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MikeZappa87 we discussed this in our WG meeting yesterday and just wanted to check in.

I think the short answer is that we don't have versioning quite figured out at this stage. For the time being we are using the following approach:

  • API changes / bug fixes will result in a patch-version bump. This will only be applied to the git TAG and not affect the spec version.
  • Spec changes will result in a minor-version bump. Here, v0.3.0 is our first tagged release. There will be a git TAG associated with each spec version bump.

I don't think that we have hard guarantees on backward compatibility at this stage which we feel is reflected by a non-v1 spec version. This is not to say that we will be making breaking changes to the spec at every version bump, but rather that we cannot commit to not having to make breaking changes -- especially now that we are seeing adoption of CDI in podman, cri-o, and containerd.

I see the following immediate action items from our side:

We will also consider whether separating the spec and API makes sense at this stage and reevaluate this periodically.
If you have any insights on how to handle spec versioning going forward or examples of projects that you feel could be used as good examples, please reach out to us.

With the above points considered, do you see this as a blocker for getting this PR merged (given that this functionality is gated by a feature flag)?

Copy link
Contributor

@MikeZappa87 MikeZappa87 Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that a GitHub issue is created for this? Also, a TODO added to the code with a link to that Github issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created cncf-tags/container-device-interface#60 to track this on our end. Please let me know if there is insufficient context there.

Copy link
Member

@mikebrow mikebrow Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works.. we might need a longer term tracking issue for changes expected in containerd over time up to some desired golden state.. :-) and that could link back over to the cod/cod issues & prs

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 11, 2022

Build succeeded.

Comment on lines 342 to 350
registry := cdi.GetRegistry(cdi.WithSpecDirs(cdiSpecDirs...))
if err = registry.Refresh(); err != nil {
// We don't consider registry refresh failure a fatal error.
// For instance, a dynamically generated invalid CDI Spec file for
// any particular vendor shouldn't prevent injection of devices of
// different vendors. CDI itself knows better and it will fail the
// injection if necessary.
log.G(ctx).Warnf("CDI registry refresh failed: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but it seems GetRegistry returns a singleton object and Refresh takes a lock on the object.
...
Wouldn't it introduce a lock contention here? Do we need to refresh the registry every time?

Yes, it is a singleton object which is locked during both Refresh() and CDI device resolution. We don't need to refresh the registry (at least not the brute force way we do now), as long as we make sure in some way that it's up to date. We haven't added any logic for that yet, but we'll get down to it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kzys Once we have the implementation for automatic refresh of the CDI registry merged, we won't need to explicitly trigger a manual refresh here.

@kzys
Copy link
Member

kzys commented Mar 11, 2022

Is CDI Linux-only? Seems some of CDI dependencies are not build-able on Windows.

@kzys kzys mentioned this pull request Mar 11, 2022
@bart0sh
Copy link
Contributor Author

bart0sh commented Mar 14, 2022

Is CDI Linux-only? Seems some of CDI dependencies are not build-able on Windows.

yes, it's Linux-only currently. I'll update the PR to fix build breaks on non-Linux platforms.

@bart0sh
Copy link
Contributor Author

bart0sh commented Mar 14, 2022

@kzys

I'll update the PR to fix build breaks on non-Linux platforms.

done

@theopenlab-ci
Copy link

theopenlab-ci bot commented Mar 14, 2022

Build succeeded.

@bart0sh
Copy link
Contributor Author

bart0sh commented Mar 15, 2022

@mikebrow can you re-trigger Windows integration test? I'm not sure its failure caused by the PR changes as they only change Linux code.

@mikebrow
Copy link
Member

@mikebrow can you re-trigger Windows integration test? I'm not sure its failure caused by the PR changes as they only change Linux code.

that's a known issue you'll need to rebase

@bart0sh
Copy link
Contributor Author

bart0sh commented Mar 15, 2022

@mikebrow thanks. rebased.

@bart0sh bart0sh force-pushed the cdi6 branch 5 times, most recently from e2a6153 to 66cc0d7 Compare March 22, 2022 21:14
@theopenlab-ci
Copy link

theopenlab-ci bot commented Apr 6, 2022

Build succeeded.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bart0sh
Copy link
Contributor Author

bart0sh commented Apr 6, 2022

@mxpv @mikebrow thank you for approving this. Can you merge or suggest whom to ask to merge it?

@containerd containerd deleted a comment from theopenlab-ci bot Apr 6, 2022
@mxpv mxpv merged commit d6cf0d8 into containerd:main Apr 6, 2022
@bart0sh
Copy link
Contributor Author

bart0sh commented Apr 7, 2022

@mxpv @mikebrow Thank you for merging it! Very much appreciated.

Should I submit this PR for release/1.6 branch if I want it to be released sooner than in 1.7?

@fuweid
Copy link
Member

fuweid commented Apr 7, 2022

backport is for bugfix and security fix. This change is new feature and big one. I don't think it should be backported.

@mxpv mxpv mentioned this pull request Apr 19, 2022
@klueska
Copy link

klueska commented Aug 4, 2022

@mikebrow, @fuweid
Should this be added to the 1.7 milestone so it doesn't get lost for release notes purposes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.