libimage: fix manifest race during listing #2400

Luap99 · 2025-04-01T13:55:56Z

I saw a flake in parallel podman testing, podman images can fail if the manifest was removed at the right time. In general listing should never be able to fail when another image or manifest is removed in parallel.

Change the logic to convert to manifest and only collect the digests in the success case and ignore all other errors to make the listing more robust.

I observed the following error from podman images: Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known

openshift-ci · 2025-04-01T13:56:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2025-04-01T13:56:11Z

cc @mtrmac @rhatdan

mtrmac

ACK to the general idea (we don’t hold the storage lock long enough to get a consistent snapshot, and if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling).

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

flouthoc · 2025-04-01T16:46:38Z

libimage/layer_tree.go

-				if err != nil {
-					return nil, err
-				}
+			// ignore errors, common errors are


ctx is unused as of now, lint test is failing.

Luap99 · 2025-04-01T16:49:31Z

I’m worried about silently returning invalid data on I/O errors, especially when that can lead to removing images, though. Can this be restricted to only silently ignore the two relevant errors?

I wasn't sure if these would be the only ones it can encounter and I wanted to be on the safer side on podman images should just work.
It is certainly possible to check specific errors and not ignore others if that is preferred.

mtrmac · 2025-04-01T16:54:07Z

if a manifest list goes away between MultiList and querying the “dangling” property, it’s reasonable to consider the components dangling.

… ugh, this can theoretically violate causality:

A refers to C
We run MultiList
User adds B referring to C
User deletes A
We see that A has been removed, and don’t see its reference to C. We have never seen B. We think that C is dangling, but it was not dangling at any point in time.

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

Luap99 · 2025-04-01T17:43:45Z

Is that possible? Worth worrying about? Right now we don’t have the infrastructure to do anything else… I guess we would want something like MultiListOptions.ReturnAllManifestsData?! And then teach the manifest list parsers to work with raw data instead of image references?

if we talk about operations like podman system/image prune than they are extremely racy by design. Like every single one of them. As there are no global locks taken for the main logic of the commands they all do something like list + remove. Even if you make the manifest listing "atomic" the time between list and rm will still be unlocked. A manifest list with a reference to that image could have been added after listing but before removal. To fix that we would need to teach the storage to not remove the image if it is part of an manifest.
And it is not even that would be enough, currently something like podman run IMAGE will pull the image and then there is a window between the image is unused after pull before we actually create the storage container to reference the image.

So overall I don't think we need to worry to about this case to much here. All I care about is to make the commands at least consistent so that they at least don't randomly fail which is far worse IMO.

mtrmac · 2025-04-01T18:02:13Z

WFM. I have little sympathy for a user adding a reference to an untagged image, while running prune concurrently; but the fact that podman run is not resistant to concurrent prune is a strong argument that this is not something we really are aiming to support at the moment.

mheon · 2025-04-02T12:10:37Z

LGTM

mtrmac

LGTM, feel free to merge as is.

libimage/layer_tree.go

I saw a flake in parallel podman testing, podman images can fail if the manifest was removed at the right time. In general listing should never be able to fail when another image or manifest is removed in parallel. Change the logic to convert to manifest and only collect the digests in the success case and ignore all other errors to make the listing more robust. I observed the following error from podman images: Error: locating image "xxx" for loading instance list: locating image with ID "xxx": image not known Signed-off-by: Paul Holzinger <[email protected]>

All other errors are returned wrapped with the image ID so do the same when the manifest blobl decoding fails. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2025-04-09T12:00:28Z

created a podman test PR just to see if this breaks anything containers/podman#25840

Luap99 · 2025-04-09T14:37:31Z

Podman PR looks good, PTAL again

mtrmac · 2025-04-09T18:52:22Z

/lgtm

mtrmac · 2025-04-09T18:52:28Z

Thanks!

openshift-ci bot added the approved label Apr 1, 2025

mtrmac reviewed Apr 1, 2025

View reviewed changes

flouthoc reviewed Apr 1, 2025

View reviewed changes

Luap99 force-pushed the list-manifest branch from c99f52b to 1b7b077 Compare April 2, 2025 11:39

mtrmac reviewed Apr 2, 2025

View reviewed changes

libimage/layer_tree.go Show resolved Hide resolved

Luap99 added 2 commits April 9, 2025 13:54

libimage/manifests: LoadFromImage() wrap all errors

0ae438f

All other errors are returned wrapped with the image ID so do the same when the manifest blobl decoding fails. Signed-off-by: Paul Holzinger <[email protected]>

Luap99 force-pushed the list-manifest branch from 1b7b077 to 0ae438f Compare April 9, 2025 11:54

openshift-ci bot assigned mtrmac Apr 9, 2025

openshift-ci bot added the lgtm label Apr 9, 2025

openshift-merge-bot bot merged commit f71a7a6 into containers:main Apr 9, 2025
15 checks passed

Luap99 deleted the list-manifest branch April 9, 2025 18:57

libimage: fix manifest race during listing #2400

libimage: fix manifest race during listing #2400

Uh oh!

Conversation

Luap99 commented Apr 1, 2025

Uh oh!

openshift-ci bot commented Apr 1, 2025

Uh oh!

Luap99 commented Apr 1, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

flouthoc Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Luap99 commented Apr 1, 2025

Uh oh!

mtrmac commented Apr 1, 2025

Uh oh!

Luap99 commented Apr 1, 2025

Uh oh!

mtrmac commented Apr 1, 2025

Uh oh!

mheon commented Apr 2, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Luap99 commented Apr 9, 2025

Uh oh!

Luap99 commented Apr 9, 2025

Uh oh!

mtrmac commented Apr 9, 2025

Uh oh!

mtrmac commented Apr 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants