Skip to content

Conversation

@Luap99
Copy link
Member

@Luap99 Luap99 commented Apr 1, 2025

I saw a flake in parallel podman testing, podman images can fail if the manifest was removed at the right time. In general listing should never be able to fail when another image or manifest is removed in parallel.

Change the logic to convert to manifest and only collect the digests in the success case and ignore all other errors to make the listing more robust.

I observed the following error from podman images: Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Apr 1, 2025
@Luap99
Copy link
Member Author

Luap99 commented Apr 1, 2025

cc @mtrmac @rhatdan

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK to the general idea (we don’t hold the storage lock long enough to get a consistent snapshot, and if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling).

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

if err != nil {
return nil, err
}
// ignore errors, common errors are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx is unused as of now, lint test is failing.

@Luap99
Copy link
Member Author

Luap99 commented Apr 1, 2025

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

I wasn't sure if these would be the only ones it can encounter and I wanted to be on the safer side on podman images should just work.
It is certainly possible to check specific errors and not ignore others if that is preferred.

@mtrmac
Copy link
Contributor

mtrmac commented Apr 1, 2025

if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling.

… ugh, this can theoretically violate causality:

  • A refers to C
  • We run MultiList
  • User adds B referring to C
  • User deletes A
  • We see that A has been removed, and don’t see its reference to C. We have never seen B. We think that C is dangling, but it was not dangling at any point in time.

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

@Luap99
Copy link
Member Author

Luap99 commented Apr 1, 2025

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

if we talk about operations like podman system/image prune than they are extremely racy by design. Like every single one of them. As there are no global locks taken for the main logic of the commands they all do something like list + remove. Even if you make the manifest listing "atomic" the time between list and rm will still be unlocked. A manifest list with a reference to that image could have been added after listing but before removal. To fix that we would need to teach the storage to not remove the image if it is part of an manifest.
And it is not even that would be enough, currently something like podman run IMAGE will pull the image and then there is a window between the image is unused after pull before we actually create the storage container to reference the image.

So overall I don't think we need to worry to about this case to much here. All I care about is to make the commands at least consistent so that they at least don't randomly fail which is far worse IMO.

@mtrmac
Copy link
Contributor

mtrmac commented Apr 1, 2025

WFM. I have little sympathy for a user adding a reference to an untagged image, while running prune concurrently; but the fact that podman run is not resistant to concurrent prune is a strong argument that this is not something we really are aiming to support at the moment.

@mheon
Copy link
Member

mheon commented Apr 2, 2025

LGTM

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge as is.

Luap99 added 2 commits April 9, 2025 13:54
I saw a flake in parallel podman testing, podman images can fail if the
manifest was removed at the right time. In general listing should never
be able to fail when another image or manifest is removed in parallel.

Change the logic to convert to manifest and only collect the digests in
the success case and ignore all other errors to make the listing more
robust.

I observed the following error from podman images:
Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known

Signed-off-by: Paul Holzinger <[email protected]>
All other errors are returned wrapped with the image ID so do the same
when the manifest blobl decoding fails.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member Author

Luap99 commented Apr 9, 2025

created a podman test PR just to see if this breaks anything containers/podman#25840

@Luap99
Copy link
Member Author

Luap99 commented Apr 9, 2025

Podman PR looks good, PTAL again

@mtrmac
Copy link
Contributor

mtrmac commented Apr 9, 2025

/lgtm

@mtrmac
Copy link
Contributor

mtrmac commented Apr 9, 2025

Thanks!

@openshift-merge-bot openshift-merge-bot bot merged commit f71a7a6 into containers:main Apr 9, 2025
15 checks passed
@Luap99 Luap99 deleted the list-manifest branch April 9, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants