Skip to content

multi-pool: add support for excluding IP pools from masquerade#40132

Merged
gandro merged 1 commit intocilium:mainfrom
hassenius:addsnatexclusion
Dec 16, 2025
Merged

multi-pool: add support for excluding IP pools from masquerade#40132
gandro merged 1 commit intocilium:mainfrom
hassenius:addsnatexclusion

Conversation

@hassenius
Copy link
Copy Markdown

@hassenius hassenius commented Jun 19, 2025

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

This adds a new flag only-masquerade-default-pool which disables masquerading for pods which have IPs allocated non-default pools.

A per-pool annotation (ipam.cilium.io/skip-masquerade) is also added to configure this behaviour in a more granular way.

This can be particularly useful when using the newly supported combination of multi-pool ipam and tunneling, where you may have the default pool be tunnelled, and additional pools handled natively (for example in conjunction with bgp control plane)

This feature is supported with a new stateDB table for CiliumPodIPPools. At IP allocation time, the chosen pool is looked up on stateDB to look for the skip-masquerade annotation.

Fixes: #40131

@hassenius hassenius requested review from a team as code owners June 19, 2025 14:55
@hassenius hassenius requested review from bimmlerd, derailed and ti-mo June 19, 2025 14:55
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 19, 2025
@hassenius hassenius requested a review from gentoo-root June 19, 2025 14:55
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 19, 2025
@hassenius hassenius marked this pull request as draft June 19, 2025 15:57
@hassenius
Copy link
Copy Markdown
Author

I have tried this on a local build, and it functionally works for our use case.
Probably need more tests cover and documentation but I'd like some feedback and thoughts on high level approach, and what is there so far.

@joestringer joestringer added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jun 20, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2025
@hassenius hassenius force-pushed the addsnatexclusion branch 2 times, most recently from c52d715 to da8d87d Compare June 22, 2025 09:35
@hassenius
Copy link
Copy Markdown
Author

Thanks for the review so far @bimmlerd

I've made an attempt at moving the configuration into the iptables cell as requested, and I've updated the iptables implementation such that it adds a rule at the top of the CILIUM_POST_nat chain. This also opens the possibility of support an array (i.e. comma separated list of CIDR) as input, which would be useful for this feature, but not a pattern I remember having seen in cilium yet. Would appreciate thoughts on this.

For the bpf implementation I'm unsure how to pass this. Would really appreciate some suggestions.

@ti-mo ti-mo removed request for a team and ti-mo June 23, 2025 12:06
@alimehrabikoshki
Copy link
Copy Markdown
Contributor

Hi @bimmlerd

I added support for multiple CIDRs to the flag (comma-separated) and changed the BPF implementation to use LPM trie maps.

I'd appreciate a review on the way the config is done now, and where the map update logic is put in loader.go.

The ip-masq-agent BPF map lookup was the same as ours so I made a common function for that as well.

I've also found that by default the iptables implementation doesn't masquerade the additional IP pools, but leaving in our iptables implementation anyway to be more explicit about it.

Thanks

@bimmlerd
Copy link
Copy Markdown
Member

bimmlerd commented Jun 24, 2025

Administrative note: we're in feature freeze at the moment - since this is new it'll not be merged until 1.18 is branched. Freeze usually also means that review of bugfixes is prioritized at the cost of feature PRs such as this one, might take a bit of patience 🙏

that said, had a brief look and it looks better - but the CODEOWNERs will give detailed reviews

@joestringer joestringer removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jul 1, 2025
@hahasheminejad
Copy link
Copy Markdown

hahasheminejad commented Jul 10, 2025

We're encountering a similar issue where a subnet advertised via BGP (used as a secondary Cilium IP pool) requires internet access through an external firewall that performs sNAT. While this PR is currently on hold (or is there a chance it could be merged soon?), we're wondering if it's possible to specify a wildcard range in ipMasqAgent.config.nonMasqueradeCIDRs to represent all internet ranges except a subnet such as 10.0.0.0/8?

Specifying 0.0.0.0/0 in nonMasqueradeCIDRs disables NAT for all pods, which causes issues - specifically, cluster workloads can't reach external services anymore since node-level masquerading is no longer applied.

@hassenius Are there any known workarounds to achieve this behavior in the meantime? Also, is there an estimated timeline for when this PR might be merged or released?

@alimehrabikoshki
Copy link
Copy Markdown
Contributor

Are there any known workarounds to achieve this behavior in the meantime? Also, is there an estimated timeline for when this PR might be merged or released?

The workaround we will use until this is merged is to revert back to iptables masq - so no BPF-based masquerade. Since this is a new feature I imagine it will go into 1.19.

@hahasheminejad
Copy link
Copy Markdown

Are there any known workarounds to achieve this behavior in the meantime? Also, is there an estimated timeline for when this PR might be merged or released?

The workaround we will use until this is merged is to revert back to iptables masq - so no BPF-based masquerade. Since this is a new feature I imagine it will go into 1.19.

Thank you.

We have come up with this nonMasqueradeCIDR list to exclude our internal subnet (10.0.0.0/8). It's been working fine so far 😊

nonMasqueradeCIDRs:
      - "0.0.0.0/5"
      - "8.0.0.0/7"
      - "11.0.0.0/8"
      - "12.0.0.0/6"
      - "16.0.0.0/4"
      - "32.0.0.0/3"
      - "64.0.0.0/2"
      - "128.0.0.0/1"

@alimehrabikoshki
Copy link
Copy Markdown
Contributor

alimehrabikoshki commented Nov 24, 2025

@julianwiedmann & @brb I've re-done the PR to implement both configuration knobs that I mentioned, --only-masquerade-default-pool as a global configuration to skip masquerade for all non-default pools, and ipam.cilium.io/skip-masquerade as a per-pool annotation. I've not run it on a cluster yet to test (waiting for the Push dev charts CI to run) but I'd love to hear some feedback on the general direction.

edit: I did a cursory look, fixed a couple silly bugs, but overall behaviour looks like its working. I only tested the basic happy path as it's quite late so I'll keep testing over the next few days.

k8s-node-2> db/show podippools
Name      v4CIDRs                                                                               v4MaskSize   v6CIDRs   v6MaskSize   Flags
default   10.233.64.0/18                                                                        24                     -
other     10.235.200.0/24, 10.235.201.0/24, 10.235.202.0/24, 10.235.203.0/24, 10.235.204.0/24   32                     -            SkipMasquerade=true

This is the stateDB table - the CIDRs field might get quite unwieldy on larger pools so I might remove that or limit how many are shown here.

@thehonker
Copy link
Copy Markdown

thehonker commented Nov 25, 2025

  • i can confirm that this does indeed work as expected via the annotation method for tcp and udp traffic (as far as i have tested it) using quay.io/cilium/cilium-ci:915be6b1ac6ce2f467fae807f75841549353f9fe on top of rke2
  • we have clusters and workloads this can be tested against further if y'all need people to test things irl
  • thank you for implementing this, it'll allow me to eliminate the very last of our docker-compose based infra <3

@alimehrabikoshki
Copy link
Copy Markdown
Contributor

alimehrabikoshki commented Nov 26, 2025

Looked into the failing test "K8s Network E2E tests (dual)" - looks to be timing out on endpoint creation. Found some other pipelines from unrelated MRs with the same issue (I downloaded the logs and checked):
https://github.com/cilium/cilium/actions/runs/19664117723/job/56316862487
https://github.com/cilium/cilium/actions/runs/19672493789/job/56344795734
https://github.com/cilium/cilium/actions/runs/19461327067/job/55686038082

Could we give it a retry?

Also thanks @thehonker - appreciate you testing it out

Copy link
Copy Markdown
Member

@fristonio fristonio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cilium/endpoint: Looks good 🚀
One question to clarify my understanding of the behavior.

Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fast turnaround! I like that we're now getting much closer to a proposal how we expect folks to actually use this feature. And we don't require them to push down magic IP ranges into the datapath :).

Let's see how IPAM folks feel about this.

Copy link
Copy Markdown
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just looked at the docs, I've got minor style suggestions (inline below).

@qmonnet
Copy link
Copy Markdown
Member

qmonnet commented Dec 1, 2025

/test

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level, I think this is a solid approach. Structure-wise, I've left some feedback for sig-ipam

@alimehrabikoshki
Copy link
Copy Markdown
Contributor

@julianwiedmann if you are happy with the latest changes, could you remove the dont-merge/discussion label?

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPAM-wise this looks good to me!

@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 9, 2025

/test

Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few minor comments inline.

@alimehrabikoshki
Copy link
Copy Markdown
Contributor

Thanks a lot @julianwiedmann, appreciate the review (and the idea for this approach). I've removed the unnecessary add_world_entry and set_identity_mark calls from the tests and added two additional tests to verify behaviour without the SKIP_MASQ flag

@devodev
Copy link
Copy Markdown
Contributor

devodev commented Dec 9, 2025

/test

This adds a new flag only-masquerade-default-pool which disables masquerading for pods which have IPs allocated non-default pools.

A per-pool annotation (ipam.cilium.io/skip-masquerade) is also added to configure this behaviour in a more granular way.

This can be particularly useful when using the newly supported combination of multi-pool ipam and tunneling, where you may have the default pool be tunnelled, and additional pools handled natively (for example in conjunction with bgp control plane)

This feature is supported with a new stateDB table for CiliumPodIPPools. At IP allocation time, the chosen pool is looked up on stateDB to look for the skip-masquerade annotation.

Signed-off-by: alimehrabikoshki <[email protected]>
@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 16, 2025

/test

@julianwiedmann
Copy link
Copy Markdown
Member

Looks like we have CI fallout in the multipool workflow from this change, although it's not clear to me why CI on the PR itself didn't catch it:

example

time=2025-12-16T13:12:45.808654792Z level=error source=/go/src/github.com/cilium/cilium/vendor/github.com/cilium/hive/cell/lifecycle.go:129 msg="Start hook failed" function="cmd.daemonLegacyInitialization.func1 (cmd/daemon_main.go:1294) (agent.controlplane.daemon)" error="daemon configuration failed: Unable to allocate router IP for family ipv4: IP pool 'default' not found in stateDB table" (1 occurrences)

@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 16, 2025

Seems like it's a race. But yeah, indeed strange that we didn't catch it in this PR. We need to check in multiPoolManager.waitForAllPools if the StateDB table also contains the pool, before we return from NewIPAM

@gandro
Copy link
Copy Markdown
Member

gandro commented Dec 17, 2025

For posterity: The CI issue has been fixed: #43380

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/ipam IP address management, including cloud IPAM area/multipool Affects Multi-Pool IPAM feature/bpf-masquerading feature/snat Relates to SNAT or Masquerading of traffic kind/community-contribution This was a contribution made by a community member. release-note/misc This PR makes changes that have no direct user impact.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

CFP: Allow opting CiliumPodIPPools out from SNAT / masquerading