Skip to content

Fix file exists error when restoring remote snapshot after unexpected…#2091

Merged
ktock merged 1 commit intocontainerd:mainfrom
wswsmao:main
Jul 24, 2025
Merged

Fix file exists error when restoring remote snapshot after unexpected…#2091
ktock merged 1 commit intocontainerd:mainfrom
wswsmao:main

Conversation

@wswsmao
Copy link
Copy Markdown
Contributor

@wswsmao wswsmao commented Jul 23, 2025

After PR #2076, in scenarios where the process restarts unexpectedly (such as due to OOM), restoring a remote snapshot may fail if the target directory already exists. For example:

{"error":"failed to create new snapshotter: failed to restore remote snapshot: failed to create remote snapshot directory: sha256:52fa3204fe00dd4d492873408e2ef89c13e142748931086998cc2eca69549b48: mkdir /var/lib/containerd-stargz-grpc/snapshotter/snapshots/1: file exists","level":"fatal","msg":"failed to configure snapshotter","time":"2025-07-23T12:48:12.070612675Z"}

This PR adjusts the logic in restoreRemoteSnapshot so that if mkdir fails because the directory already exists, it is treated as a result of an ungraceful shutdown and the process continues.

@wswsmao
Copy link
Copy Markdown
Contributor Author

wswsmao commented Jul 23, 2025

However, this situation may lead to fscache duplicating cached data. There are two possible solutions:

  1. Since this scenario is caused by an abnormal exit, users are expected to manually clean up the cache, and we can update the documentation to remind users of this requirement.
  2. Clean up the cache on every startup. Currently, the logic ensures that the cache is cleaned up on graceful exit, so in theory, the cache directory should be empty on each restart.

@wswsmao wswsmao closed this Jul 24, 2025
@wswsmao wswsmao reopened this Jul 24, 2025
@wswsmao wswsmao closed this Jul 24, 2025
@wswsmao wswsmao reopened this Jul 24, 2025
@wswsmao wswsmao closed this Jul 24, 2025
@wswsmao wswsmao reopened this Jul 24, 2025
Copy link
Copy Markdown
Member

@ktock ktock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CI flakiness doesn't seem to related to this PR. I'll work on fixing that in a separated patch.

@ktock
Copy link
Copy Markdown
Member

ktock commented Jul 24, 2025

However, this situation may lead to fscache duplicating cached data. There are two possible solutions:

1. Since this scenario is caused by an abnormal exit, users are expected to manually clean up the cache, and we can update the documentation to remind users of this requirement.

2. Clean up the cache on every startup. Currently, the logic ensures that the cache is cleaned up on graceful exit, so in theory, the cache directory should be empty on each restart.

Let's take the approach 1 for now.

@ktock ktock merged commit 3aa69ea into containerd:main Jul 24, 2025
211 of 217 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants