-
Notifications
You must be signed in to change notification settings - Fork 2
Fix live migration when AMX is enabled #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix live migration when AMX is enabled #46
Conversation
ea1dd7a to
159bedc
Compare
phip1611
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this and fixing the issue :) I left some remarks
6c344af to
bbd008d
Compare
bbd008d to
dd47163
Compare
dd47163 to
4e60410
Compare
1d70fb6 to
7ba61cc
Compare
phip1611
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work! I think this is a clear improvement! Thanks for working on this. I left a few remarks
In order for guests to use AMX it is necessary to ask the kernel to enable the related state components for guests. While cloud hypervisor already does this, we would prefer to extract the logic into a stand alone (reusable) function. In this commit we only introduce the error type that will later be part of the enable_amx_state_components function's signature. Signed-Off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
7ba61cc to
115cfae
Compare
phip1611
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks! I left one remark but I already approve it
We introduce a static enable_amx_state_components method on the XSaveState struct that will be used in a follow up commit. We will also extend the logic of what this method does in the near future. Signed-Off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
We use the central amx_state_components enabling function in the CpuManager constructor. This way we can make changes to the AMX related state components functionality without needing to update the CpuManager's constructor. Signed-Off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP <[email protected]>
We will need to query KVM for the size of the xsave struct in a follow up commit. This commit introduces the necessary method on the hypervisor trait to do that. Signed-Off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
AMX requires dynamically enabling certain large state components which leads to an increase in the size of kvm_xsave. This was not taken into account by Cloud hypervisor until now. We solve this by refactoring `XSaveState` to directly wrap `kvm::Xsave` and always ensure (via a OnceLocked static variable) that all operations on the wrapped xsave state obtain an instance of the right size. Signed-Off-by: Oliver Anderson <[email protected]> On-behalf-of: SAP [email protected]
115cfae to
1a214ba
Compare
Closes cloud-hypervisor#5800
Variable length Xsave state
When XSAVE state components are dynamically enabled, the de-facto length of
kvm_xsavechanges and this needs to be accounted for. I do this by updating the pre-existingXSaveStatestruct to wrap the FAM-structXsavefrom thekvm_bindingscrate directly and I track the FAM length in a Oncelocked static variable.How this relates to the upcoming CPU profiles feature
This fix is necessary to properly test the upcoming CPU Profiles feature, but the
XSaveState::enable_amx_state_componentsmethod introduced here will also be called during CPU profile generation and also prior to checking CPU compatibility when receiving a live migration command.Risks
There are a few things we should keep in mind with regards to this fix:
kvm_bindingscrate is sound. I am not entirely convinced that this will be the case in the eyes of the Rust abstract machine, but I think we can investigate that more in the future.cargo nextestruns each test in a separate process hence as long as that is used for running the unit/integration tests this should not be a problem.XsaveStatewill now heap allocate whenever AMX is enabled. Each initialization will require roughly 1KB of memory on the heap. We have however determined that this will never be called in any hot paths.How this PR has been tested
This was tested by running an AMX workload (loading and mutiplying some matrices, then printing the result in a loop)
inside a guest on a Sapphire rapids server and then migrating it to a Granite rapids server. The migration was successful and the program kept running.