Skip to content
This repository was archived by the owner on Jan 19, 2026. It is now read-only.

fix: Correctly create PolicyEvaluatorPre and consume eval timeouts#1284

Merged
viccuad merged 6 commits intokubewarden:mainfrom
viccuad:fix/timeout
Sep 23, 2025
Merged

fix: Correctly create PolicyEvaluatorPre and consume eval timeouts#1284
viccuad merged 6 commits intokubewarden:mainfrom
viccuad:fix/timeout

Conversation

@viccuad
Copy link
Copy Markdown
Member

@viccuad viccuad commented Sep 19, 2025

Description

Fix #1270.
With this, both integration tests and coverage (which run integration tests) should not be flaky anymore.

This PR fixes 2 bugs:

  1. Ensure we enable eval timeout even if we don't have policy-server global timeout.

  2. Before, we were creating a PolicyEvaluator per compiled module.
    This is not enough, as each policy can now have a different
    spec.timeoutEvalSeconds.
    This bug made the timeout integration test fail in a flaky manner.
    The integration tests have 2 policies, sleep and sleep-1s-timeout.
    They are saved in a hashmap (nondeterministic order). If on the run the
    sleep policy was registered first, the sleep-1s-timeout would reuse
    the sleep PolicyEvaluator, which meant that it didn't have the
    spec.timeoutEvalSeconds assigned and the test would fail.
    To fix this:

    • Extend the check when registering and creating a PolicyEvaluatorPre
      from only the module to the module + policy_evaluation_limit_seconds.
      Add new PolicyEvaluatorPreKey{} with those values as key for the
      hashmap of policies and PolicyEvaluatorPre.
    • Add module_digest_and_eval_timeout_to_policy_evaluator_pre, needed
      to obtain the eval timeout used in the key when we are rehydrating.

Test

Extended existing unit test for creation of PolicyEvaluators.
Use existing integration tests for the timeoutEvalSeconds feature.
Add configurable logging output to integration tests.

Additional Information

Tradeoff

Irrelevant space cost increase of +N on worst case, N: number of policies. As we have a new hashmap of policy_id, eval timeout.

Potential improvement

Maybe drop the commit that adds configurable logging output to integration tests (defaults to INFO) if it slows test runs.

@viccuad viccuad requested a review from a team as a code owner September 19, 2025 12:25
@viccuad viccuad changed the title fix: Correctly create PolicyEvaluatorPre and consume timeouts for spec.timeoutEvalSeconds feature fix: Correctly create PolicyEvaluatorPre and consume eval timeouts Sep 19, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 20.00000% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 26.61%. Comparing base (d9504b9) to head (82cec0f).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
src/evaluation/evaluation_environment.rs 33.33% 8 Missing ⚠️
src/lib.rs 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1284      +/-   ##
==========================================
- Coverage   26.62%   26.61%   -0.02%     
==========================================
  Files          16       16              
  Lines        1044     1052       +8     
==========================================
+ Hits          278      280       +2     
- Misses        766      772       +6     
Flag Coverage Δ
unit-tests 26.61% <20.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@viccuad
Copy link
Copy Markdown
Member Author

viccuad commented Sep 19, 2025

Dropped the commit adding logging to the integration tests, which slowed them in CI and made them fail.

@flavio
Copy link
Copy Markdown
Member

flavio commented Sep 20, 2025

@viccuad you approach works, but it leads to more memory being used. That's because, when multiple policies share the same module, but have different timeout values, we end up creating multiple "pre" objects. This leads to more memory being used by the policy server.

I've looked into the codebase and I discovered that the evaluation timeout doesn't actually need to be bound to the "pre" objects. Instead, it can be provided to a "pre" object at rehydration time.

This required quite some changes across different projects:

Stuff left to be done:

  • update kwctl to consume the new version of policy evaluator. The API changed, we will have to do some small tweaks. I don't think this is going to be hard, since we don't use epoch deadline in there (AFAIK)
  • merge and tag all the crates involved

I'll be traveling next week, but I'll find time to review your comments. Please, ping me on slack whenever you leave a message on one of these PRs

Before, we were creating a PolicyEvaluator per compiled module.
This is not enough, as each policy can now have a different
spec.timeoutEvalSeconds.

This bug made the timeout integration test fail in a flaky manner.
The integration tests have 2 policies, `sleep` and `sleep-1s-timeout`.
They are saved in a hashmap (nondeterministic order). If on the run the
`sleep` policy was registered first, the `sleep-1s-timeout` would reuse
the `sleep` PolicyEvaluator, which meant that it didn't have the
spec.timeoutEvalSeconds assigned and the test would fail.

To fix this:
- Extend the check when registering and creating a PolicyEvaluatorPre
  from only the module to the module + policy_evaluation_limit_seconds.
  Add new `PolicyEvaluatorPreKey{}` with those values as key for the
  hashmap of policies and PolicyEvaluatorPre.
- Add `module_digest_and_eval_timeout_to_policy_evaluator_pre`, needed
  to obtain the eval timeout used in the key when we are rehydrating.

Signed-off-by: Víctor Cuadrado Juan <[email protected]>
Ensure we enable eval timeout even if we don't have policy-server global timeout.

Signed-off-by: Víctor Cuadrado Juan <[email protected]>
@flavio
Copy link
Copy Markdown
Member

flavio commented Sep 20, 2025

I just did a force push because I had to solve some conflicts between this branch and main.

Copy link
Copy Markdown
Member Author

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, thanks! (can't approve, I'm also author)

Copy link
Copy Markdown
Member

@jvanz jvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering all the changes related to this fix (all the other PRs opened by Flavio), the code looks fine. But there is a linter error that we can fix to make the CI green.

@viccuad
Copy link
Copy Markdown
Member Author

viccuad commented Sep 22, 2025

The lint fix would be the same as this one:
d83a9bd
We can either port it or rebase after merging that one.

The prior commit that fixes kubewarden#1270
causes a higher memory usage. That happens because, if the same module
is used by two different policies, each one of them with different epoch deadlines,
two "pre" objects would be created and kept in memory.

This happens because the epoch deadline is bound to the "pre" object.

This limitation has been changed with PR
kubewarden/policy-evaluator#766.

This commit uses the new API introduced by the PR linked above, allowing
to go back to the old memory consumption.

Signed-off-by: Flavio Castelli <[email protected]>
Signed-off-by: Flavio Castelli <[email protected]>
@flavio
Copy link
Copy Markdown
Member

flavio commented Sep 23, 2025

I've updated the PR to use officially tagged crates, plus I've fixed the linter error in the way as @viccuad suggested

@viccuad
Copy link
Copy Markdown
Member Author

viccuad commented Sep 23, 2025

Thanks, merging!

@viccuad viccuad merged commit 92cd9c0 into kubewarden:main Sep 23, 2025
11 of 14 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky tests of timeoutSeconds and timeoutEvalSeconds

3 participants