Skip to content

Prevent GC from schedule itself with 0 period.#9795

Merged
mxpv merged 1 commit intocontainerd:mainfrom
catap:prevent-zero-timer
Feb 21, 2024
Merged

Prevent GC from schedule itself with 0 period.#9795
mxpv merged 1 commit intocontainerd:mainfrom
catap:prevent-zero-timer

Conversation

@catap
Copy link
Copy Markdown
Contributor

@catap catap commented Feb 8, 2024

On startup gcTimeSum might work fast and return 0, so on this case the algorithm turns in infinity loop which simple consume CPU on timer which fires without any interval.

Closes: #5089

@k8s-ci-robot
Copy link
Copy Markdown

Hi @catap. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@catap catap force-pushed the prevent-zero-timer branch 2 times, most recently from fc72892 to 1e1d037 Compare February 8, 2024 21:55
@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Feb 8, 2024

I'm curious what log.G(ctx).WithField("d", gcTime).Trace("garbage collected") is outputting in this case. It is not a condition I would have expected to happen. Is the pause threshold also the default?

Comment thread plugins/gc/scheduler.go Outdated
// anyway, on startup gcTimeSum might work fast and return 0
// on this case algorithm above turns in infinity loop which
// simple consume CPU on timer which fires without any interval
if interval < time.Second {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than define a lower bound for the interval, I think it would be better to define it for the avg. A sensible value there may be 5ms. With the default threshold of 0.02 that would have the interval set at a minimum 245ms, but adjustable by changing the pause threshold.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still make it possible to turn into the same issue. If someone put threshold 1 for example.

My approach should help for not heavy loaded docker machine which runs on background, on that case 1s doesn't make any differences.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still make it possible to turn into the same issue. If someone put threshold 1 for example.

.5 is the max value and .02 is the default. The pauseThreshold is the configurable value to modulate the interval though. Setting the threshold to 0.0 would also solve the problem in this case to keep the interval will stay at a second.

Better solution would be to update the default pauseThreshold along with changing the lower bound avg to have a better default interval for unloaded machines and generally spend less time checking the gc variables. We already have an event system that can reschedule in the cases when gc is expected to run following some user action and we see this commonly used now by clients.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and updated the PR.

Thanks for explanation.

@catap
Copy link
Copy Markdown
Contributor Author

catap commented Feb 9, 2024

I'm curious what log.G(ctx).WithField("d", gcTime).Trace("garbage collected") is outputting in this case. It is not a condition I would have expected to happen. Is the pause threshold also the default?

Nothing.

It is clean VM (alpine 3.19) where I've installed only docker. So, on start it doesn't do anything and it turns that interval into 0, that makes it looping without any limits, like for (;;);.

@catap catap requested a review from dmcgowan February 9, 2024 01:02
@catap catap force-pushed the prevent-zero-timer branch 3 times, most recently from 2ed629a to 5153b4e Compare February 9, 2024 13:03
Comment thread plugins/gc/scheduler.go Outdated
// on this case the algorithm turns in infinity loop which
// simple consume CPU on timer which fires without any delay.
// Use 5ms as fallback value to have interval 245ms for that case.
if avg == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time.Duration is actually in nanoseconds. It should never actually be zero here. Rather just comparing < 5000000 it probably makes more sent to define a const like const minimumGCTime = float64(5 * time.Millisecond) and use that to compare less than and setting avg.

@catap catap force-pushed the prevent-zero-timer branch 2 times, most recently from 6774776 to e365070 Compare February 10, 2024 01:46
@catap
Copy link
Copy Markdown
Contributor Author

catap commented Feb 10, 2024 via email

@catap catap requested a review from dmcgowan February 10, 2024 12:09
@catap
Copy link
Copy Markdown
Contributor Author

catap commented Feb 14, 2024

@dmcgowan ping here

@dmcgowan
Copy link
Copy Markdown
Member

Code change looks good. The comment above it could be cleaned up. Maybe remove the first 3 lines and just have

// Enforce that avg is no less than minimumGCTime to prevent immediate rescheduling

@dmcgowan dmcgowan added this to the 2.0 milestone Feb 14, 2024
@dmcgowan dmcgowan added cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Feb 14, 2024
@catap catap force-pushed the prevent-zero-timer branch from e365070 to a422176 Compare February 15, 2024 10:11
@catap
Copy link
Copy Markdown
Contributor Author

catap commented Feb 15, 2024 via email

On startup `gcTimeSum` might work fast and return `0`, so on this case
the algorithm turns in infinity loop which simple consume CPU on timer
which fires without any interval.

Use `5ms` as fallback to have interval `245ms` for that case.

Closes: containerd#5089

Signed-off-by: Kirill A. Korinsky <[email protected]>
@catap catap force-pushed the prevent-zero-timer branch from a422176 to c876612 Compare February 15, 2024 10:33
@mxpv mxpv added this pull request to the merge queue Feb 21, 2024
Merged via the queue into containerd:main with commit 67ff3db Feb 21, 2024
@catap catap deleted the prevent-zero-timer branch February 21, 2024 21:48
@catap
Copy link
Copy Markdown
Contributor Author

catap commented Apr 22, 2024

@dmcgowan, @mxpv shall it be backported to 1.6 and 1.7?

@dmcgowan
Copy link
Copy Markdown
Member

Yes, this one is awaiting backport. Right now the automated backport is not working but anyone is free to open up the backport PRs.

@catap
Copy link
Copy Markdown
Contributor Author

catap commented Apr 22, 2024

@dmcgowan here it is!

@dmcgowan dmcgowan added cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch and removed cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch needs-ok-to-test size/XS

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Uses a lot of CPU when idle

4 participants