Skip to content

Replace api.Rule with types.PolicyEntry#40213

Merged
joestringer merged 5 commits intocilium:mainfrom
TheBeeZee:anp-refactor
Oct 14, 2025
Merged

Replace api.Rule with types.PolicyEntry#40213
joestringer merged 5 commits intocilium:mainfrom
TheBeeZee:anp-refactor

Conversation

@TheBeeZee
Copy link
Copy Markdown
Contributor

Refactor policy engine to use PolicyEntry as the internal representation of policies, as described in CFP-39646.

Fixes: #40015

Refactor policy engine to use PolicyEntry as the internal representation of policies, as described in CFP-39646.

@TheBeeZee TheBeeZee requested review from a team as code owners June 25, 2025 22:57
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 25, 2025
@github-actions github-actions bot added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. cilium-cli This PR contains changes related with cilium-cli kind/community-contribution This was a contribution made by a community member. labels Jun 25, 2025
@TheBeeZee TheBeeZee marked this pull request as draft June 25, 2025 22:59
@TheBeeZee TheBeeZee force-pushed the anp-refactor branch 7 times, most recently from e7a6663 to 442a26f Compare June 27, 2025 17:01
@squeed squeed self-requested a review July 8, 2025 15:08
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Jul 8, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 8, 2025
@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/ci-e2e-upgrade

1 similar comment
@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/ci-e2e-upgrade

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/ci-kpr

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of minor things that seem out of place but otherwise it seems like this is almost ready to merge.

@joestringer
Copy link
Copy Markdown
Member

Why are the tests failing? Please triage.

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

Why are the tests failing? Please triage.

For ci-ginkgo: #41254

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

ci-kpr failure: #41974
ci-ginkgo failure: #41254

@joestringer
Copy link
Copy Markdown
Member

ci-ginkgo failure: #41254

That issue is marked as duplicate of #41194, which was fixed on August 18 via #41234. If you're now observing the same symptoms then either that issue was not fixed, or there's a new issue with the same symptoms. Sounds like we need to either reopen or file a fresh issue for this, otherwise we will continue to ignore it. Given that the same test seemed to fail on Sept 30 as well (per #40213 (comment)), this seems to be recurring. Are we sure this is affecting the main branch as well or could there be a regression in this PR?

But wait, the latest failure isn't even failing in the same test as that issue. The symptoms seem different.

I'll re-run the tests to get a better view into whether the tests are reliably failing or not.

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. I wonder whether we may be missing some optimizations with the aggregatedSelectors stuff being removed, but it is definitely a cleanup. This should also address a long-running tech debt issue #8353 🎉

Last blocker is whether this PR introduces failures in the testsuite. Let's see how the testing results come out.

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

@maintainer-s-little-helper
Copy link
Copy Markdown

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

The following ci-integration test flakiness is caused by this PR:

--- FAIL: TestMergeL7PolicyEgressWithMultipleSelectors (0.00s)

    l4_filter_test.go:194: 
        	Error Trace:	/home/runner/work/cilium/cilium/pkg/policy/l4_filter_test.go:194
        	            				/home/runner/work/cilium/cilium/pkg/policy/l4.go:1475
        	            				/home/runner/work/cilium/cilium/pkg/policy/l4_filter_test.go:164
        	            				/home/runner/work/cilium/cilium/pkg/policy/l4_filter_test.go:225
        	            				/home/runner/work/cilium/cilium/pkg/policy/l4_filter_test.go:279
        	            				/home/runner/work/cilium/cilium/pkg/policy/rule_test.go:2455
        	Error:      	Should be true
        	Test:       	TestMergeL7PolicyEgressWithMultipleSelectors
        	Messages:   	Expected: {"priority":101,"http":[{"method":"GET"},{"host":"foo"}]}
        	            	Actual: {"priority":101,"http":[{"host":"foo"},{"method":"GET"}]}

The ordering of L7 rules within PerSelectorPolicy becomes non-deterministic. The current Equal method insists that the order of HTTP rules (and DNS, L7 and Kafka rules, too) match exactly, which is not necessary.

I have added a commit that changes the DeepEqual for L7Rule so that the order of HTTP, FQDN and L7 rules does not matter. I am not changing the DeepEqual for Kafka rules, as they are not causing test flakiness right now, Kafka rules are marked as deprecated and fixing it would require changes to an external package.

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

The two ci-ginkgo failures appear to be flakes. The first one is failing because the node doesn't become health (filed issue), the second one is failing because eBPF cleanup fails after the test succeeds (filed issue). Both appear to be unrelated to this PR.

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

ci-clustermesh failure looks to be the same as this ci-ginkgo failure: #42125

time=2025-10-10T17:59:29.331812219Z level=error source=/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:805 msg="endpoint regeneration failed" ciliumEndpointName=/ endpointID=3250 containerID="" desiredPolicyRevision=0 k8sPodName=/ datapathPolicyRevision=0 ipv4=11.10.0.58 containerInterface="" identity=4 ipv6="" subsys=endpoint error="loading eBPF collection into the kernel: map cilium_percpu_trace_id: pin map to /sys/fs/bpf/tc/globals/cilium_percpu_trace_id: file exists" (1 occurrences)

The key part being:

loading eBPF collection into the kernel: map cilium_percpu_trace_id: pin map to /sys/fs/bpf/tc/globals/cilium_percpu_trace_id: file exists"

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

1 similar comment
@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

The recent policy engine refactor introduced non-deterministic ordering of L7Rules within the PerSelectorPolicy struct. This causes test flakiness because the `Equal` method's use of `DeepEqual` is order-sensitive when comparing slices.

To stabilize the comparison, this commit uses `deepequal-gen` annotations to generate a custom `DeepEqual` method that compares L7Rules slices without regard to order. New slice types were introduced for this purpose, as the annotation cannot be applied directly to struct fields.

**Scope:** The Kafka slice is intentionally excluded from this change. Kafka rules are deprecated, and modifying them would require changes to an external package. No test flakiness has been observed for these rules.

Signed-off-by: Blaz Zupan <[email protected]>
@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/test

Copy link
Copy Markdown
Contributor

@viktor-kurchenko viktor-kurchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/ci-ginkgo

1 similar comment
@TheBeeZee
Copy link
Copy Markdown
Contributor Author

/ci-ginkgo

@joestringer
Copy link
Copy Markdown
Member

The more times we run ci-ginkgo and hit the same failure, the less sure I am whether there's no regression here. The failure rate for f14-datapath-service-ns-xdp-2 is high, but the workflow history shows it often passing, and even in the 20 scheduled runs from the last week it still passed around half of those runs. I'm not sure the probability of hitting the same failure this many times in a row against this PR, but I wouldn't expect failures this many times.

I retriggered the target job for the third time just now at this link: https://github.com/cilium/cilium/actions/runs/18501946770 .

@joestringer
Copy link
Copy Markdown
Member

Thanks for your patience and efforts on this @TheBeeZee !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Cilium agent related. cilium-cli This PR contains changes related with cilium-cli kind/community-contribution This was a contribution made by a community member. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Projects

Status: Released

Development

Successfully merging this pull request may close these issues.

CFP-39646: Decouple internal policy engine from external API