Skip to content

fusemanager: fix container fail after ttl timeout in detach mode#1905

Merged
ktock merged 1 commit intocontainerd:mainfrom
wswsmao:main
Jan 7, 2025
Merged

fusemanager: fix container fail after ttl timeout in detach mode#1905
ktock merged 1 commit intocontainerd:mainfrom
wswsmao:main

Conversation

@wswsmao
Copy link
Copy Markdown
Contributor

@wswsmao wswsmao commented Dec 16, 2024

In detach mode, when containerd-stargz-grpc exits normally, it sends an Unmount request to the fuse manager. Additionally, during the startup of containerd-stargz-grpc, the restoreRemoteSnapshot function cleans up previous mountpoints. If there are still running containers at this time, it can lead to issues when the TTL cache expires, resulting in abnormal behavior of the containers.

I considered several solutions:

  1. The approach in the current PR, where users should restart containerd-stargz-grpc using SIGKILL, and then skip the cleanup step in restoreRemoteSnapshot.
  2. Setting ResolveResultEntryTTLSec to an infinitely large value to leverage the TTL cache for ensuring the normal operation of containers. However, this would still lead to failures if the containers attempt to access uncached content.
  3. Implementing a complex mechanism to determine if any running containers are using the mountpoints, and if so, skipping the cleanup.

After careful consideration, I have decided to proceed with the first approach.

Comment on lines +747 to +748
// In detach mode, rs is taken over by fusemanager,
// and there may be running containers, so we skip clean
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to mention about fusemanager as snapshot/snapshot.go is agnostic about fusemanager.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return fmt.Errorf("failed to unmount %s: %w", m.Mountpoint, err)
// In detach mode, rs is taken over by fusemanager,
// and there may be running containers, so we skip clean
if !o.detach {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just retruning from this function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,if o.detach will return directly

docs/overview.md Outdated

### Important Considerations

Before restarting the `containerd-stargz-grpc` process, if there are running containers, it is crucial to use `SIGKILL` to terminate the `containerd-stargz-grpc` process. This approach prevents the normal shutdown sequence from attempting to clean up the mountpoints of the running containers, which could lead to disruptions in their availability. By using `SIGKILL`, you ensure that the process is forcefully terminated without affecting the ongoing operations of the containers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a doc about when the user should use SIGTERM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wswsmao
Copy link
Copy Markdown
Contributor Author

wswsmao commented Dec 26, 2024

all done

@wswsmao
Copy link
Copy Markdown
Contributor Author

wswsmao commented Jan 6, 2025

@ktock

Copy link
Copy Markdown
Member

@ktock ktock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@ktock ktock merged commit 928a4dd into containerd:main Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants