Skip to content

ignore update status file when no space#11458

Open
yylt wants to merge 2 commits intocontainerd:mainfrom
yylt:nospace
Open

ignore update status file when no space#11458
yylt wants to merge 2 commits intocontainerd:mainfrom
yylt:nospace

Conversation

@yylt
Copy link
Copy Markdown
Contributor

@yylt yylt commented Mar 3, 2025

Fix #7247

Failed to start recovre, the root cause is inconsistent data between cri and container in metadata

Ensure consistency, the following manual recovery could be resolve.

  - stop containerd service, like `systemctl stop containerd`
  - disable cri plugin in config.toml and start containerd service
  - find container and task which reversed, and stop task, delete container to release disk, which most like 
    - `ctr -n k8s.io t stop [id]`
    - `ctr -n k8s.io c rm [id]`
  - enable cri plugin in config.toml
  - restart containerd service, `systemctl restart containerd`

As described in here, ignoring noSpace error is a more convenient way to handle such issues

@k8s-ci-robot
Copy link
Copy Markdown

Hi @yylt. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dosubot dosubot Bot added the area/cri Container Runtime Interface (CRI) label Mar 3, 2025
@yylt yylt force-pushed the nospace branch 2 times, most recently from 36fe892 to ce34076 Compare March 3, 2025 10:03
@yylt
Copy link
Copy Markdown
Contributor Author

yylt commented Mar 3, 2025

/cc @fuweid @mikebrow

@k8s-ci-robot k8s-ci-robot requested review from fuweid and mikebrow March 3, 2025 12:29
Comment thread internal/cri/store/container/status.go Outdated
}

// writeFileInplace writes the data to the file in-place.
func writeFileInplace(path string, data []byte, perm os.FileMode) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "in-place"? How does this differ from os.WriteFile?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inplace write cannot guarantee to be a transaction and may result in incompleteness file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just use os.WriteFile here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use file descripter execute Sync to make sure had saved on disk, try to minimize the time that may cause inconsistent JSON struct content in the file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not update status file when root directory full

Comment thread internal/cri/store/container/status.go Outdated
}
if err := continuity.AtomicWriteFile(s.path, data, 0600); err != nil {
return fmt.Errorf("failed to checkpoint status to %q: %w", s.path, err)
if strings.Contains(err.Error(), "no space") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.Is(err, unix.ENOSPC) may work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use syscall.ENOSPC which is depending on the underlying system

runtime "k8s.io/cri-api/pkg/apis/runtime/v1"
)

func TestIssue7247(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an explanation of the issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yylt yylt force-pushed the nospace branch 3 times, most recently from b5fa99a to 2e4d435 Compare March 5, 2025 03:12
Comment thread internal/cri/store/container/status.go Outdated
path := filepath.Join(root, "status")
if err := continuity.AtomicWriteFile(path, data, 0600); err != nil {
return nil, fmt.Errorf("failed to checkpoint status to %q: %w", path, err)
if strings.Contains(err.Error(), "no space") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a fix plan in #11504.

Please take a look on that.

My idea is to introduce a new StoreStatus which is used in restart.go file only.
In your patch, in place doesn't make sense to me. We can't guarantee it always work.
Based on that, we can just ignore no-space error and use memory status directly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks better and has been modified.

@yylt yylt changed the title inplace update status file when no space ignore update status file when no space Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) needs-ok-to-test size/L

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

Containerd restart failed with message "failed to recover state: failed to reserve container name xxx: name xxx is reserved for xxx"

4 participants