store: new API ApplyStagedLayer #1826

giuseppe · 2024-02-12T12:08:33Z

Add a race-condition-free alternative to using CreateLayer and ApplyDiffFromStagingDirectory, ensuring the store is locked for the entire duration while the layer is being created and populated.

Signed-off-by: Giuseppe Scrivano [email protected]

The relative patch for c/image is:

diff --git a/storage/storage_dest.go b/storage/storage_dest.go
index 88e492b7..51b30a64 100644
--- a/storage/storage_dest.go
+++ b/storage/storage_dest.go
@@ -149,7 +149,7 @@ func (s *storageImageDestination) Close() error {
 	}
 	for _, v := range s.diffOutputs {
 		if v.Target != "" {
-			_ = s.imageRef.transport.store.CleanupStagingDirectory(v.Target)
+			_ = s.imageRef.transport.store.CleanupStagedLayer(v)
 		}
 	}
 	return os.RemoveAll(s.directory)
@@ -669,11 +669,6 @@ func (s *storageImageDestination) commitLayer(index int, info addedLayerInfo, si
 			return false, fmt.Errorf("index %d out of range for configOCI.RootFS.DiffIDs", index)
 		}
 
-		layer, err := s.imageRef.transport.store.CreateLayer(id, parentLayer, nil, "", false, nil)
-		if err != nil {
-			return false, err
-		}
-
 		// let the storage layer know what was the original uncompressed layer.
 		flags := make(map[string]interface{})
 		flags[expectedLayerDiffIDFlag] = configOCI.RootFS.DiffIDs[index]
@@ -682,8 +677,15 @@ func (s *storageImageDestination) commitLayer(index int, info addedLayerInfo, si
 			Flags: flags,
 		}
 
-		if err := s.imageRef.transport.store.ApplyDiffFromStagingDirectory(layer.ID, diffOutput.Target, diffOutput, options); err != nil {
-			_ = s.imageRef.transport.store.Delete(layer.ID)
+		args := storage.ApplyStagedLayerOptions{
+			ID:          id,
+			ParentLayer: parentLayer,
+
+			DiffOutput:       diffOutput,
+			DiffOptions:      options,
+		}
+		layer, err := s.imageRef.transport.store.ApplyStagedLayer(args)
+		if err != nil {
 			return false, err
 		}

openshift-ci · 2024-02-12T12:08:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

store.go

mtrmac · 2024-02-12T17:14:50Z

store.go

+
+	StagingDirectory string
+	DiffOutput       *drivers.DriverWithDifferOutput
+	DiffOptions      *drivers.ApplyDiffWithDifferOpts


Another opportunity to clean up that ApplyDiffWithDiffer and ApplyDiffFromStagingDirectory should have separate options types, so that callers aren’t tempted to set completely ignored options.

I guess that’s non-blocking…

I agree the API can be improved, and this is a good chance. More in details, how would you improve this part?

I’m not sure I have the whole picture … All I really wanted here was to have separate ApplyDiffWithDifferOptions and ApplyDiffFromStagingDirectoryOptions, and to drop fields that are not relevant to either operation.

But looking at ApplyDiffOptions below, even that seems fairly invasive to do to the maximum extent.

Looking a bit further… A lot of this might be better as separate PRs, and probably not right now?

Is there any caller of ApplyDiffWithDiffer writing to a pre-existing layer? If not, maybe that can be just dropped. In the driver, the usingComposefs paths have already diverged, which seem like a good reason to either share more code (not in this PR!), or to remove the redundant code path entirely.

From a c/image perspective, we now have two “apply” functions which are ~opposites of each other, and it always takes me a bit to tell which one is “stage” and which one is “commit”. So something like s/ApplyDiffWithDiffer/StageChangesWithDiffer/ would be nice — but that also somewhat depends on the above.

Actually having ApplyDiffOptions inside ApplyDiffWithDifferOptions is not a good semantic match at all — ApplyDiffOptions all revolve around the ApplyDiffOptions.Diff stream, and that is completely ignored in those paths.

… and using the graphdriver.Differ interface to connect the overlay driver and the zstd deduplicator code seems like an imprecise fit as well; maybe that should be just a c-storage-private interface with no exposed methods, so that c/storage can change the mechanics over time.

as mentioned elsewhere, it would be convenient for CleanupStagingDirectory to consume DriverWithDifferOutput so that c/image doesn’t need to care about .Target at all

A much more general, and much more vague, thought is that the “stage”+“commit” model might be interesting for non-chunked layers as well.

Right now, on the PutBlob path, c/image:

putBlobToPendingFile stores the input stream into a file, fully consuming it. (That must happen to validate digests.) That is fully parallel.

commitLayer extracts the file into a the graph driver’s layer. On some graph drivers (DM, VFS), that is inherently serial; in overlay, that is just a tar extraction which could, in principle, be fully parallel

creates the layer record, with parent links, etc. That is inherently serial, or maybe it could be parallel but it is anyway cheap enough not to worry about.

It seems potentially interesting to see whether the “extract tar” part could be parallelized — and whether it would be better. I can imagine that this part is I/O heavy enough that doing two of them would just slow things down; that probably needs building and measuring.

The "store stream” + “extract" + “commit" parts seem very similar to what the chunked path does in the “convert” case. OTOH “true chunked” input does’t have a stream in the first place…

But then, worrying about the traditional tar layers is backwards-looking. What would an ideal API for creating a composefs layer look like? I didn’t look into that and I have no idea. Would that be relevant for building the chunked one?

… I think Dan would bite our head off if we worked on changing the traditional-tar API right now :)

store.go

mtrmac · 2024-02-12T17:19:10Z

Thanks for working on this!

giuseppe · 2024-02-12T20:34:40Z

@mtrmac thanks for the review.

I've addressed your comments, except one pending question: #1826 (comment)

mtrmac

Changing the API, after looking a bit more, seems to to require a bunch of fairly intrusive changes all at once… so I don’t think it is worth blocking this PR on them.

Or maybe you can see some way to make all of that simple.

layers.go

mtrmac · 2024-02-12T20:59:40Z

store.go

+
+	StagingDirectory string
+	DiffOutput       *drivers.DriverWithDifferOutput
+	DiffOptions      *drivers.ApplyDiffWithDifferOpts


I’m not sure I have the whole picture … All I really wanted here was to have separate ApplyDiffWithDifferOptions and ApplyDiffFromStagingDirectoryOptions, and to drop fields that are not relevant to either operation.

But looking at ApplyDiffOptions below, even that seems fairly invasive to do to the maximum extent.

Looking a bit further… A lot of this might be better as separate PRs, and probably not right now?

Is there any caller of ApplyDiffWithDiffer writing to a pre-existing layer? If not, maybe that can be just dropped. In the driver, the usingComposefs paths have already diverged, which seem like a good reason to either share more code (not in this PR!), or to remove the redundant code path entirely.

From a c/image perspective, we now have two “apply” functions which are ~opposites of each other, and it always takes me a bit to tell which one is “stage” and which one is “commit”. So something like s/ApplyDiffWithDiffer/StageChangesWithDiffer/ would be nice — but that also somewhat depends on the above.

Actually having ApplyDiffOptions inside ApplyDiffWithDifferOptions is not a good semantic match at all — ApplyDiffOptions all revolve around the ApplyDiffOptions.Diff stream, and that is completely ignored in those paths.

… and using the graphdriver.Differ interface to connect the overlay driver and the zstd deduplicator code seems like an imprecise fit as well; maybe that should be just a c-storage-private interface with no exposed methods, so that c/storage can change the mechanics over time.

as mentioned elsewhere, it would be convenient for CleanupStagingDirectory to consume DriverWithDifferOutput so that c/image doesn’t need to care about .Target at all

mtrmac · 2024-02-12T21:09:03Z

store.go

+
+	StagingDirectory string
+	DiffOutput       *drivers.DriverWithDifferOutput
+	DiffOptions      *drivers.ApplyDiffWithDifferOpts


A much more general, and much more vague, thought is that the “stage”+“commit” model might be interesting for non-chunked layers as well.

Right now, on the PutBlob path, c/image:

putBlobToPendingFile stores the input stream into a file, fully consuming it. (That must happen to validate digests.) That is fully parallel.

commitLayer extracts the file into a the graph driver’s layer. On some graph drivers (DM, VFS), that is inherently serial; in overlay, that is just a tar extraction which could, in principle, be fully parallel

creates the layer record, with parent links, etc. That is inherently serial, or maybe it could be parallel but it is anyway cheap enough not to worry about.

It seems potentially interesting to see whether the “extract tar” part could be parallelized — and whether it would be better. I can imagine that this part is I/O heavy enough that doing two of them would just slow things down; that probably needs building and measuring.

The "store stream” + “extract" + “commit" parts seem very similar to what the chunked path does in the “convert” case. OTOH “true chunked” input does’t have a stream in the first place…

But then, worrying about the traditional tar layers is backwards-looking. What would an ideal API for creating a composefs layer look like? I didn’t look into that and I have no idea. Would that be relevant for building the chunked one?

mtrmac · 2024-02-12T21:09:48Z

store.go

+
+	StagingDirectory string
+	DiffOutput       *drivers.DriverWithDifferOutput
+	DiffOptions      *drivers.ApplyDiffWithDifferOpts


… I think Dan would bite our head off if we worked on changing the traditional-tar API right now :)

rhatdan · 2024-02-14T14:40:19Z

LGTM

mtrmac

On second thought, I’m afraid this isn’t sufficient.

It is fine for a running system: it prevents other processes from observing the WIP layer.

But it doesn’t handle crashes sufficiently. For that, the layer metadata needs to be saved to disk with incompleteFlag (so that if we are recovering from a crash, we delete everything), and after the contents are set up, the flag is removed again.

(Or, hypothetically, we could first write the on-disk contents and only afterwards write the layer metadata?? But that would be a new unproven code path, and we would have to worry about re-creating a layer on top of previously partially-created files. Seems risky, when the other path is well-understood.)

Very roughly speaking, I think this can be done by applying the staged data from inside layerStore.create, around the place where applyDiffWithOptions is called for non-chunked layers.

store.go

mtrmac · 2024-02-14T22:06:48Z

The API design LGTM.

For the record:

The relative patch for c/image is:

diff --git a/storage/storage_dest.go b/storage/storage_dest.go
index 88e492b7..51b30a64 100644
--- a/storage/storage_dest.go
+++ b/storage/storage_dest.go

+		layer, err := s.imageRef.transport.store.ApplyStagedLayer(args)
+		if err != nil {

This path in c/image also needs to check for, and succeed with, ErrDuplicateID. And it would be convenient, and consistent with PutLayer, for the new function to return a layer value with the duplicate object in this case.

store.go

giuseppe · 2024-02-15T08:48:29Z

On second thought, I’m afraid this isn’t sufficient.

It is fine for a running system: it prevents other processes from observing the WIP layer.

But it doesn’t handle crashes sufficiently. For that, the layer metadata needs to be saved to disk with incompleteFlag (so that if we are recovering from a crash, we delete everything), and after the contents are set up, the flag is removed again.

(Or, hypothetically, we could first write the on-disk contents and only afterwards write the layer metadata?? But that would be a new unproven code path, and we would have to worry about re-creating a layer on top of previously partially-created files. Seems risky, when the other path is well-understood.)

Very roughly speaking, I think this can be done by applying the staged data from inside layerStore.create, around the place where applyDiffWithOptions is called for non-chunked layers.

you are right, we need to use the incompleteFlag. I've pushed a new version where I set the incompleteFlag, barely tested. I'll work more on it through the day

giuseppe · 2024-02-15T15:02:46Z

@mtrmac what do you think of the last version?

mtrmac

I’d rather prefer if the incompleteFlag remained an internal implementation detail of layers.go.

giuseppe · 2024-02-15T17:35:45Z

I’d rather prefer if the incompleteFlag remained an internal implementation detail of layers.go.

moved the incompleteFlag usage to layers.go

mtrmac

This is a half-way step, but the partialoption is still basically “leave the layer incomplete” and/or “I promise to callapplyDiffFromStagingDirectorylater” (with neither effect documented), withstore.go` being responsible for that.

It seems to me

	if diff != nil {
		if size, err = r.applyDiffWithOptions …
+	else if staged != nil {
+		if size, err = r.applyDiffFromStagingDirectory …
	else { …

should be possible and and not too invasive a change.

layers.go

giuseppe · 2024-02-15T18:28:48Z

pushed a new version, I moved the applyDiffFromStagingDirectory() call inside create()

mtrmac

ACK.

layers.go

store.go

this is needed by the following commit. Signed-off-by: Giuseppe Scrivano <[email protected]>

enforce that the stagingDirectory must have the same value as the diffOutput.Target variable. It allows to simplify the internal API. Signed-off-by: Giuseppe Scrivano <[email protected]>

Add a race-condition-free alternative to using CreateLayer and ApplyDiffFromStagingDirectory, ensuring the store is locked for the entire duration while the layer is being created and populated. Signed-off-by: Giuseppe Scrivano <[email protected]>

It uses the diff output as input and callers are not expected to know about the Target directory. Signed-off-by: Giuseppe Scrivano <[email protected]>

mtrmac

/lgtm

Thanks!

giuseppe · 2024-02-16T08:05:03Z

@mtrmac the related change in c/image: containers/image#2301

openshift-ci bot added the do-not-merge/work-in-progress label Feb 12, 2024

openshift-ci bot added the approved label Feb 12, 2024

giuseppe mentioned this pull request Feb 12, 2024

Creation of Zstd:chunked layers seems racy containers/image#1979

Closed

giuseppe force-pushed the put-partial-layers branch from d190380 to ba0d7f7 Compare February 12, 2024 12:09

mtrmac reviewed Feb 12, 2024

View reviewed changes

giuseppe force-pushed the put-partial-layers branch 3 times, most recently from 0a76801 to 5e58781 Compare February 12, 2024 20:34

giuseppe marked this pull request as ready for review February 13, 2024 08:03

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 13, 2024

mtrmac reviewed Feb 13, 2024

View reviewed changes

giuseppe force-pushed the put-partial-layers branch from 5e58781 to cc2d131 Compare February 14, 2024 09:43

giuseppe changed the title ~~store: new API PutLayerFromStagingDirectory~~ store: new API PutLayerFromStaging Feb 14, 2024

giuseppe force-pushed the put-partial-layers branch from cc2d131 to ec4f82c Compare February 14, 2024 16:19

giuseppe changed the title ~~store: new API PutLayerFromStaging~~ store: new API ApplyStagedLayer Feb 14, 2024

mtrmac requested changes Feb 14, 2024

View reviewed changes

store.go Outdated Show resolved Hide resolved

store.go Show resolved Hide resolved

mtrmac reviewed Feb 14, 2024

View reviewed changes

store.go Outdated Show resolved Hide resolved

store.go Outdated Show resolved Hide resolved

giuseppe force-pushed the put-partial-layers branch from ec4f82c to 41deea0 Compare February 15, 2024 08:47

giuseppe force-pushed the put-partial-layers branch from 41deea0 to a9c9b49 Compare February 15, 2024 12:19

mtrmac mentioned this pull request Feb 15, 2024

storage: enable partial images by default #1833

Merged

mtrmac reviewed Feb 15, 2024

View reviewed changes

giuseppe force-pushed the put-partial-layers branch from a9c9b49 to e5163a1 Compare February 15, 2024 17:35

mtrmac reviewed Feb 15, 2024

View reviewed changes

layers.go Outdated Show resolved Hide resolved

giuseppe force-pushed the put-partial-layers branch 2 times, most recently from 34fbf02 to 4c51bea Compare February 15, 2024 18:28

mtrmac reviewed Feb 15, 2024

View reviewed changes

layers.go Outdated Show resolved Hide resolved

store.go Outdated Show resolved Hide resolved

mtrmac mentioned this pull request Aug 27, 2025

Zstd(:chunked) work tracking checklist containers/container-libs#205

Open

37 tasks

giuseppe added 4 commits February 15, 2024 21:56

store: split PutLayer

091f854

this is needed by the following commit. Signed-off-by: Giuseppe Scrivano <[email protected]>

driver: simplify ApplyDiffFromStagingDirectory

c6de01c

enforce that the stagingDirectory must have the same value as the diffOutput.Target variable. It allows to simplify the internal API. Signed-off-by: Giuseppe Scrivano <[email protected]>

store: new API ApplyStagedLayer

21ed482

Add a race-condition-free alternative to using CreateLayer and ApplyDiffFromStagingDirectory, ensuring the store is locked for the entire duration while the layer is being created and populated. Signed-off-by: Giuseppe Scrivano <[email protected]>

store: new API CleanupStagedLayer

d36d6c1

It uses the diff output as input and callers are not expected to know about the Target directory. Signed-off-by: Giuseppe Scrivano <[email protected]>

giuseppe force-pushed the put-partial-layers branch from 4c51bea to d36d6c1 Compare February 15, 2024 20:57

mtrmac reviewed Feb 16, 2024

View reviewed changes

openshift-ci bot assigned mtrmac Feb 16, 2024

openshift-ci bot added the lgtm label Feb 16, 2024

openshift-merge-bot bot merged commit 6f63bc4 into containers:main Feb 16, 2024

giuseppe mentioned this pull request Feb 16, 2024

storage: use the new ApplyStagedLayer interface containers/image#2301

Merged

mtrmac mentioned this pull request Aug 27, 2025

Zstd(:chunked) work tracking checklist containers/container-libs#210

Open

37 tasks

store: new API ApplyStagedLayer #1826

store: new API ApplyStagedLayer #1826

Uh oh!

Conversation

giuseppe commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Feb 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mtrmac commented Feb 12, 2024

Uh oh!

giuseppe commented Feb 12, 2024

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhatdan commented Feb 14, 2024

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mtrmac commented Feb 14, 2024

Uh oh!

Uh oh!

Uh oh!

giuseppe commented Feb 15, 2024

Uh oh!

giuseppe commented Feb 15, 2024

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Feb 15, 2024

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giuseppe commented Feb 15, 2024

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

giuseppe commented Feb 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

giuseppe commented Feb 12, 2024 •

edited

Loading