fix GetOutboundIp auto-detection on ipv6-only systems #1356

Merged
mfenniak merged 1 commit from cristicbz/forgejo-runner:try-ipv6-for-outbound into main 2026-02-07 18:19:07 +00:00
Contributor

On an IPv6-only node, the daemon would fail to enable the cache, with

Could not start the cache server, cache will be disabled: unable to determine outbound IP address

and, similarly, exec would just crash on start-up with Error: unable to determine outbound IP address.

The issue stems from GetOutboundIp assuming 8.8.8.8 being reachable is equivalent to the node having internet access (which is not true for IPv6 nodes of course). I changed this (in the first commit) to try the IPv6 of Google's public DNS as well, and took the opportunity to fallback to cloudflare's as well.

A related issue (handled in the second commit) is that artifactcache.StartHandler was always passed "" (forcing auto-detection) even if cache.host was set in the config. I don't claim to fully understand what the different components are, but forwarding cache.host to that seems to work?

Scaleway offers dirt cheap (<1 EUR/month) IPv6-only nodes. I wanted to use some for my runners.

  • other
    • PR: fix GetOutboundIp auto-detection on ipv6-only systems
On an IPv6-only node, the `daemon` would fail to enable the cache, with ``` Could not start the cache server, cache will be disabled: unable to determine outbound IP address ``` and, similarly, `exec` would just crash on start-up with `Error: unable to determine outbound IP address`. The issue stems from `GetOutboundIp` assuming 8.8.8.8 being reachable is equivalent to the node having internet access (which is not true for IPv6 nodes of course). I changed this (in the first commit) to try the IPv6 of Google's public DNS as well, and took the opportunity to fallback to cloudflare's as well. ~A related issue (handled in the second commit) is that `artifactcache.StartHandler` was always passed `""` (forcing auto-detection) even if `cache.host` was set in the config. I don't claim to fully understand what the different components are, but forwarding `cache.host` to that seems to work?~ Scaleway offers dirt cheap (<1 EUR/month) IPv6-only nodes. I wanted to use some for my runners. <!--start release-notes-assistant--> <!--URL:https://code.forgejo.org/forgejo/runner--> - other - [PR](https://code.forgejo.org/forgejo/runner/pulls/1356): <!--number 1356 --><!--line 0 --><!--description Zml4IEdldE91dGJvdW5kSXAgYXV0by1kZXRlY3Rpb24gb24gaXB2Ni1vbmx5IHN5c3RlbXM=-->fix GetOutboundIp auto-detection on ipv6-only systems<!--description--> <!--end release-notes-assistant-->
cristicbz changed title from fix GetOutbooundIp auto-detection on ipv6-only systems to fix GetOutboundIp auto-detection on ipv6-only systems 2026-02-07 15:53:18 +00:00
Owner

I think the IPv6-capable detection seems like a good change. The overall approach to detection is a bit too clever for my liking, but adding additional IPv6 addresses seems like an OK workaround for the gap you've identified. I don't think we can make a practical automated test for this, so this part of the change is good as-is.

The cache config change would be a regression of #1088. There are two different servers being run by the runner; a proxy that the actions connect to, and a cache server that the proxy connects to. You're changing the cache server, which isn't what cfg.Cache.Host is intended & documented to control.

I think the IPv6-capable detection seems like a good change. The overall approach to detection is a bit too clever for my liking, but adding additional IPv6 addresses seems like an OK workaround for the gap you've identified. I don't think we can make a practical automated test for this, so this part of the change is good as-is. The cache config change would be a regression of https://code.forgejo.org/forgejo/runner/pulls/1088. There are two different servers being run by the runner; a proxy that the actions connect to, and a cache server that the proxy connects to. You're changing the cache server, which isn't what `cfg.Cache.Host` is intended & documented to control.
cristicbz force-pushed try-ipv6-for-outbound from 23840a560b
Some checks failed
issue-labels / release-notes (pull_request_target) Successful in 5s
checks / Build Forgejo Runner (pull_request) Has been cancelled
checks / Build unsupported platforms (pull_request) Has been cancelled
checks / runner exec tests (pull_request) Has been cancelled
checks / integration tests (docker-latest) (pull_request) Has been cancelled
checks / integration tests (docker-stable) (pull_request) Has been cancelled
checks / validate mocks (pull_request) Has been cancelled
checks / validate pre-commit-hooks file (pull_request) Has been cancelled
to f2200fee29
Some checks failed
cascade / forgejo (pull_request_target) Has been skipped
cascade / end-to-end (pull_request_target) Has been skipped
cascade / debug (pull_request_target) Has been skipped
issue-labels / release-notes (pull_request_target) Successful in 4s
checks / Build Forgejo Runner (pull_request) Failing after 33s
checks / runner exec tests (pull_request) Has been skipped
checks / Build unsupported platforms (pull_request) Has been skipped
checks / integration tests (docker-stable) (pull_request) Has been skipped
checks / integration tests (docker-latest) (pull_request) Has been skipped
checks / validate pre-commit-hooks file (pull_request) Successful in 47s
checks / validate mocks (pull_request) Failing after 47s
2026-02-07 17:20:08 +00:00
Compare
Author
Contributor

@mfenniak thanks for the explanation, I reverted the second commit.

There's something about the overall architecture i don't get---why any of these internal services need an external IP adress vs being able to just use loopback; it seems to extend the attack surface unnecessarily to listen on public interfaces.

@mfenniak thanks for the explanation, I reverted the second commit. There's something about the overall architecture i don't get---why any of these internal services need an external IP adress vs being able to just use loopback; it seems to extend the attack surface unnecessarily to listen on public interfaces.
Owner

@cristicbz wrote in #1356 (comment):

There's something about the overall architecture i don't get---why any of these internal services need an external IP adress vs being able to just use loopback; it seems to extend the attack surface unnecessarily to listen on public interfaces.

It's complicated. Here's my understanding, which may be incomplete.

The first requirement is to create a proxy network port which can be accessed by a job container. A loopback address wouldn't work for this because localhost would reference the container, not the runner. The runner itself may be within a container, and therefore we need the container's IP address and not an IP from the host.

The "outbound" IP is not necessarily a public IP. It's the IP address of the interface which the default route is applied to; in an IPv4 network that's usually a non-routable RFC1918 address. It isn't really important that it is used for internet access... it serves more of a role of "this is a reasonable default IP address which I can access both from the runner, and from the

The second requirement is to create a cache server network port which can be accessed by the runner (after the proxy is hit). This has different requirements depending on whether the cache server is internal, or external. If it's external, it needs to be accessible from the cache clients (the proxies). If it's internal, it could probably be a loopback address.

@cristicbz wrote in https://code.forgejo.org/forgejo/runner/pulls/1356#issuecomment-77199: > There's something about the overall architecture i don't get---why any of these internal services need an external IP adress vs being able to just use loopback; it seems to extend the attack surface unnecessarily to listen on public interfaces. It's complicated. Here's my understanding, which may be incomplete. The first requirement is to create a proxy network port which can be accessed by a job container. A loopback address wouldn't work for this because `localhost` would reference the container, not the runner. The runner itself may be within a container, and therefore we need the container's IP address and not an IP from the host. The "outbound" IP is not necessarily a public IP. It's the IP address of the interface which the default route is applied to; in an IPv4 network that's usually a non-routable RFC1918 address. It isn't really important that it is used for internet access... it serves more of a role of "this is a reasonable default IP address which I can access *both* from the runner, and from the The second requirement is to create a cache server network port which can be accessed by the runner (after the proxy is hit). This has different requirements depending on whether the cache server is internal, or external. If it's external, it needs to be accessible from the cache clients (the proxies). If it's internal, it could probably be a loopback address.
Owner

This patch is failing CI as-is and needs a make fmt run on it.

This patch is failing CI as-is and needs a `make fmt` run on it.
cristicbz force-pushed try-ipv6-for-outbound from f2200fee29
Some checks failed
cascade / forgejo (pull_request_target) Has been skipped
cascade / end-to-end (pull_request_target) Has been skipped
cascade / debug (pull_request_target) Has been skipped
issue-labels / release-notes (pull_request_target) Successful in 4s
checks / Build Forgejo Runner (pull_request) Failing after 33s
checks / runner exec tests (pull_request) Has been skipped
checks / Build unsupported platforms (pull_request) Has been skipped
checks / integration tests (docker-stable) (pull_request) Has been skipped
checks / integration tests (docker-latest) (pull_request) Has been skipped
checks / validate pre-commit-hooks file (pull_request) Successful in 47s
checks / validate mocks (pull_request) Failing after 47s
to 4d573ee4de
All checks were successful
issue-labels / release-notes (pull_request_target) Successful in 9s
checks / Build Forgejo Runner (pull_request) Successful in 33s
checks / validate pre-commit-hooks file (pull_request) Successful in 47s
checks / validate mocks (pull_request) Successful in 52s
checks / Build unsupported platforms (pull_request) Successful in 27s
checks / runner exec tests (pull_request) Successful in 38s
Integration tests for the release process / release-simulation (pull_request) Successful in 5m12s
checks / integration tests (docker-latest) (pull_request) Successful in 13m21s
checks / integration tests (docker-stable) (pull_request) Successful in 14m56s
cascade / debug (pull_request_target) Has been skipped
cascade / end-to-end (pull_request_target) Successful in 9s
cascade / forgejo (pull_request_target) Successful in 33s
2026-02-07 17:41:36 +00:00
Compare
Author
Contributor

Done. Sorry, I ran go fmt, which didn't complain, didn't realize this had stricter linting requirements

Done. Sorry, I ran `go fmt`, which didn't complain, didn't realize this had stricter linting requirements
Owner

Because we're not actually doing anything with the UDP network connection -- it's UDP which is stateless, and nothing is sent, and it's not even to a DNS port 🤣 -- in retrospect adding in Cloudflare's addresses is probably not meaningful. It's really just a hack to get the outbound IP which does nothing but resolve the route table. But I think it has no harm either, so all good. 👍

Because we're not actually doing anything with the UDP network connection -- it's UDP which is stateless, and nothing is sent, and it's not even to a DNS port 🤣 -- in retrospect adding in Cloudflare's addresses is probably not meaningful. It's really just a hack to get the outbound IP which does nothing but resolve the route table. But I think it has no harm either, so all good. 👍
Author
Contributor

@mfenniak

The "outbound" IP is not necessarily a public IP. It's the IP address of the interface which the default route is applied to; in an IPv4 network that's usually a non-routable RFC1918 address. It isn't really important that it is used for internet access... it serves more of a role of "this is a reasonable default IP address which I can access both from the runner, and from the

I'm not set up to trivially check this for an IPv4 node, but for my little IPv6 only node at least, it does seem to resolve to the public IP address:

root@***:~# ss -lntup | grep forgejo
tcp   LISTEN 0      4096               *:42689            *:*    users:(("forgejo-runner",pid=943,fd=8))
tcp   LISTEN 0      4096               *:33341            *:*    users:(("forgejo-runner",pid=943,fd=7))

And I can confirm that if I curl the VPS's public IP address from my machine, i get a 404 Not Found from the services running on there.

Don't know if you have an IPv4 node where you can check that assertion empirically, but this seems like a significant vulnerability if I'm right.

@mfenniak > The "outbound" IP is not necessarily a public IP. It's the IP address of the interface which the default route is applied to; in an IPv4 network that's usually a non-routable RFC1918 address. It isn't really important that it is used for internet access... it serves more of a role of "this is a reasonable default IP address which I can access both from the runner, and from the I'm not set up to trivially check this for an IPv4 node, but for my little IPv6 only node at least, it does seem to resolve to the public IP address: ``` root@***:~# ss -lntup | grep forgejo tcp LISTEN 0 4096 *:42689 *:* users:(("forgejo-runner",pid=943,fd=8)) tcp LISTEN 0 4096 *:33341 *:* users:(("forgejo-runner",pid=943,fd=7)) ``` And I can confirm that if I `curl` the VPS's public IP address from my machine, i get a 404 Not Found from the services running on there. Don't know if you have an IPv4 node where you can check that assertion empirically, but this seems like a significant vulnerability if I'm right.
Author
Contributor

I can confirm that this seems to happen on a node with IPv4 as well, do you see something else if you try? I was expecting the runner to not listen on ANY ports on a public interface. I'm running:

  1. Latest forgejo-runner built from main
  2. podman 5.4.2 with rootless containers
  3. No config.yml

Ok. This is clearly intended behaviour, I just don't understand how this is safe:

	listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfaces
	if err != nil {
		return nil, err
	}
	server := &http.Server{
		ReadHeaderTimeout: 2 * time.Second,
		Handler:           router,
	}

(in both cacheproxy & artifactcache)

I would expect these services to listen on the network's gateway, (i.e. docker network inspect bridge -f "{{ (index .IPAM.Config 0).Gateway }}"); of course this gets messier with ephemeral networks

I can confirm that this seems to happen on a node with IPv4 as well, do you see something else if you try? I was expecting the runner to not listen on ANY ports on a public interface. I'm running: 1. Latest `forgejo-runner` built from `main` 2. podman 5.4.2 with rootless containers 3. No `config.yml` Ok. This is clearly intended behaviour, I just don't understand how this is safe: ``` listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfaces if err != nil { return nil, err } server := &http.Server{ ReadHeaderTimeout: 2 * time.Second, Handler: router, } ``` (in both cacheproxy & artifactcache) I would expect these services to listen on the network's gateway, (i.e. `docker network inspect bridge -f "{{ (index .IPAM.Config 0).Gateway }}"`); of course this gets messier with ephemeral networks
Owner

Ah, I see your concern. The cache handlers always bind themselves on all network interfaces. This doesn't relate to GetOutboundIP directly; that IP is just used as an address that can be used to access the network ports once they're open.

listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfaces

listener, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) // listen on all interfaces

Within the designed scope of the APIs that are exposed for caching, they are fairly safe for exposure -- there should be no way to access cached data without a credential that is embedded into the ACTIONS_CACHE_URL env variable. As a workaround if you want to maximize security against design flaws and vulnerabilities, you can consider a host-based firewall -- a default-DENY incoming connection on your public interface should address the issue, but you'll have to test that it doesn't affect traffic from your containers (or local jobs, if you're using a host-based executor model).

It would be difficult to design a better auto-detection solution that would meet the runner's needs, and so the only plausible direction to me to improve this would require administrators to manually configure the bind address. I think adding manual configuration options like that would be reasonable. I wish it could be automatic instead, but I can't imagine a plausible way.

This discussion should probably continue in a separate issue if you'd like to file one, in order to make it more searchable for future users.

Ah, I see your concern. The cache handlers always bind themselves on all network interfaces. This doesn't relate to `GetOutboundIP` directly; that IP is just used as an address that can be used to access the network ports once they're open. https://code.forgejo.org/forgejo/runner/src/commit/52767745cc696967439470176195b3b950107679/act/artifactcache/handler.go#L93 https://code.forgejo.org/forgejo/runner/src/commit/52767745cc696967439470176195b3b950107679/act/cacheproxy/handler.go#L98 Within the designed scope of the APIs that are exposed for caching, they are fairly safe for exposure -- there should be no way to access cached data without a credential that is embedded into the `ACTIONS_CACHE_URL` env variable. As a workaround if you want to maximize security against design flaws and vulnerabilities, you can consider a host-based firewall -- a default-DENY incoming connection on your public interface should address the issue, but you'll have to test that it doesn't affect traffic from your containers (or local jobs, if you're using a host-based executor model). It would be difficult to design a better auto-detection solution that would meet the runner's needs, and so the only plausible direction to me to improve this would require administrators to manually configure the bind address. I think adding manual configuration options like that would be reasonable. I wish it could be automatic instead, but I can't imagine a plausible way. This discussion should probably continue in a separate issue if you'd like to file one, in order to make it more searchable for future users.
Member

I believe #229 requests the option to set a bind address.

I believe https://code.forgejo.org/forgejo/runner/issues/229 requests the option to set a bind address.
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/runner!1356
No description provided.