fix GetOutboundIp auto-detection on ipv6-only systems #1356
No reviewers
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner!1356
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "cristicbz/forgejo-runner:try-ipv6-for-outbound"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
On an IPv6-only node, the
daemonwould fail to enable the cache, withand, similarly,
execwould just crash on start-up withError: unable to determine outbound IP address.The issue stems from
GetOutboundIpassuming 8.8.8.8 being reachable is equivalent to the node having internet access (which is not true for IPv6 nodes of course). I changed this (in the first commit) to try the IPv6 of Google's public DNS as well, and took the opportunity to fallback to cloudflare's as well.A related issue (handled in the second commit) is thatartifactcache.StartHandlerwas always passed""(forcing auto-detection) even ifcache.hostwas set in the config. I don't claim to fully understand what the different components are, but forwardingcache.hostto that seems to work?Scaleway offers dirt cheap (<1 EUR/month) IPv6-only nodes. I wanted to use some for my runners.
fix GetOutbooundIp auto-detection on ipv6-only systemsto fix GetOutboundIp auto-detection on ipv6-only systemsI think the IPv6-capable detection seems like a good change. The overall approach to detection is a bit too clever for my liking, but adding additional IPv6 addresses seems like an OK workaround for the gap you've identified. I don't think we can make a practical automated test for this, so this part of the change is good as-is.
The cache config change would be a regression of #1088. There are two different servers being run by the runner; a proxy that the actions connect to, and a cache server that the proxy connects to. You're changing the cache server, which isn't what
cfg.Cache.Hostis intended & documented to control.23840a560bf2200fee29@mfenniak thanks for the explanation, I reverted the second commit.
There's something about the overall architecture i don't get---why any of these internal services need an external IP adress vs being able to just use loopback; it seems to extend the attack surface unnecessarily to listen on public interfaces.
@cristicbz wrote in #1356 (comment):
It's complicated. Here's my understanding, which may be incomplete.
The first requirement is to create a proxy network port which can be accessed by a job container. A loopback address wouldn't work for this because
localhostwould reference the container, not the runner. The runner itself may be within a container, and therefore we need the container's IP address and not an IP from the host.The "outbound" IP is not necessarily a public IP. It's the IP address of the interface which the default route is applied to; in an IPv4 network that's usually a non-routable RFC1918 address. It isn't really important that it is used for internet access... it serves more of a role of "this is a reasonable default IP address which I can access both from the runner, and from the
The second requirement is to create a cache server network port which can be accessed by the runner (after the proxy is hit). This has different requirements depending on whether the cache server is internal, or external. If it's external, it needs to be accessible from the cache clients (the proxies). If it's internal, it could probably be a loopback address.
This patch is failing CI as-is and needs a
make fmtrun on it.f2200fee294d573ee4deDone. Sorry, I ran
go fmt, which didn't complain, didn't realize this had stricter linting requirementsBecause we're not actually doing anything with the UDP network connection -- it's UDP which is stateless, and nothing is sent, and it's not even to a DNS port 🤣 -- in retrospect adding in Cloudflare's addresses is probably not meaningful. It's really just a hack to get the outbound IP which does nothing but resolve the route table. But I think it has no harm either, so all good. 👍
@mfenniak
I'm not set up to trivially check this for an IPv4 node, but for my little IPv6 only node at least, it does seem to resolve to the public IP address:
And I can confirm that if I
curlthe VPS's public IP address from my machine, i get a 404 Not Found from the services running on there.Don't know if you have an IPv4 node where you can check that assertion empirically, but this seems like a significant vulnerability if I'm right.
I can confirm that this seems to happen on a node with IPv4 as well, do you see something else if you try? I was expecting the runner to not listen on ANY ports on a public interface. I'm running:
forgejo-runnerbuilt frommainconfig.ymlOk. This is clearly intended behaviour, I just don't understand how this is safe:
(in both cacheproxy & artifactcache)
I would expect these services to listen on the network's gateway, (i.e.
docker network inspect bridge -f "{{ (index .IPAM.Config 0).Gateway }}"); of course this gets messier with ephemeral networksAh, I see your concern. The cache handlers always bind themselves on all network interfaces. This doesn't relate to
GetOutboundIPdirectly; that IP is just used as an address that can be used to access the network ports once they're open.listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfaceslistener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfacesWithin the designed scope of the APIs that are exposed for caching, they are fairly safe for exposure -- there should be no way to access cached data without a credential that is embedded into the
ACTIONS_CACHE_URLenv variable. As a workaround if you want to maximize security against design flaws and vulnerabilities, you can consider a host-based firewall -- a default-DENY incoming connection on your public interface should address the issue, but you'll have to test that it doesn't affect traffic from your containers (or local jobs, if you're using a host-based executor model).It would be difficult to design a better auto-detection solution that would meet the runner's needs, and so the only plausible direction to me to improve this would require administrators to manually configure the bind address. I think adding manual configuration options like that would be reasonable. I wish it could be automatic instead, but I can't imagine a plausible way.
This discussion should probably continue in a separate issue if you'd like to file one, in order to make it more searchable for future users.
I believe #229 requests the option to set a bind address.