Bug Description
Work on the event loop can interrupt the Undici lifecycle for making requests, causing errors to be thrown even when there is no problem with the underlying connection. For example, if a fetch request is started and then work on the event loop takes more than 10 seconds (default connect timeout), Undici will throw a UND_ERR_CONNECT_TIMEOUT error even if the connection could be established very quickly.
I believe what is happening is:
- When the fetch request is started, Undici starts the work to make a connection. Undici calls
setTimeout with the value of the connectTimeoutMs to throw an error and cancel the connection if it takes too long (https://github.com/nodejs/undici/blob/main/lib/core/connect.js). It makes a call to GetAddrInfoReqWrap (https://github.com/nodejs/node/blob/main/lib/dns.js#L221), but this is asynchronous and processing of the callback will be delayed until the next event loop.
- User tasks block the event loop for a long period of time.
- The
onConnectTimeout timer is run because the previous task took longer than the timeout. onConnectTimeout calls setImmediate with a function to destroy the socket and throw the error. https://github.com/nodejs/undici/blob/main/lib/core/connect.js
- The
GetAddrInfoReq lookup callback (emitLookup in node:net) is run. This code begins the TCP connection (internalConnect is called in https://github.com/nodejs/node/blob/main/lib/net.js#L1032) but that is also asynchronous, so it won't finish in this round of the event loop.
- The
setImmediate function is run in the next phase which destroys the socket and throws the UND_ERR_CONNECT_TIMEOUT error.
- Undici never gets a chance to handle the TCP connection response.
Internally at Vercel, we have been seeing a high number of these UND_ERR_CONNECT_TIMEOUT issues while pre-rendering pages in our Next.js application. I can't run this task on my local machine so it's harder to debug, but it's a CPU intensive task and moving fetch requests to a worker thread eliminated the Undici errors. We tried other suggestions (like --dns-result-order=ipv4first and verified that we were not seeing any packet loss) that did not resolve the issue. Increasing the connect timeout resolves the issue in the reproduction but not the issue in our Next.js build (which I can't explain).
Reproducible By
A minimal reproduction is available at https://github.com/mknichel/undici-connect-timeout-errors.
We can reproduce the behavior on Node 18.x and 20.x and with the 5.24.0 and the latest version of Undici (6.19.2)
Expected Behavior
The Undici request lifecycle could operate on a separate thread that does not get blocked by user code. By separating it out from the user code, this would remove impact of any user code on requests.
To test this theory, we created a dispatcher that proxied the fetch request to a dedicated worker thread (new Worker from worker_threads). This eliminated all the Undici errors that we were seeing in our Next.js build.
Logs & Screenshots
In the minimal reproduction, the error is:
TypeError: fetch failed
at fetch (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/[email protected]/node_modules/undici/index.js:112:13)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at fetchExample (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:21:20)
at main (/Users/mknichel/code/tmp/undici-connect-timeout-errors/index.ts:66:3) {
[cause]: ConnectTimeoutError: Connect Timeout Error
at onConnectTimeout (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/[email protected]/node_modules/undici/lib/core/connect.js:190:24)
at /Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/[email protected]/node_modules/undici/lib/core/connect.js:133:46
at Immediate._onImmediate (/Users/mknichel/code/tmp/undici-connect-timeout-errors/node_modules/.pnpm/[email protected]/node_modules/undici/lib/core/connect.js:174:9)
at process.processImmediate (node:internal/timers:478:21) {
code: 'UND_ERR_CONNECT_TIMEOUT'
In our Next.js builds, the error is:
TypeError: fetch failed
at node:internal/deps/undici/undici:12618:11
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async s (elided path)
at async elided path {
cause: ConnectTimeoutError: Connect Timeout Error
at onConnectTimeout (node:internal/deps/undici/undici:7760:28)
at node:internal/deps/undici/undici:7716:50
at Immediate._onImmediate (node:internal/deps/undici/undici:7748:13)
at process.processImmediate (node:internal/timers:478:21)
at process.callbackTrampoline (node:internal/async_hooks:130:17) {
code: 'UND_ERR_CONNECT_TIMEOUT'
}
}
Environment
The reproduction repo was erroring for me on Mac OS 14.4, while internally we are seeing issues on AWS EC2 Intel machines.
Additional context
Vercel/Next.js users have reported UND_ERR_CONNECT_TIMEOUT issues to us:
Bug Description
Work on the event loop can interrupt the Undici lifecycle for making requests, causing errors to be thrown even when there is no problem with the underlying connection. For example, if a fetch request is started and then work on the event loop takes more than 10 seconds (default connect timeout), Undici will throw a
UND_ERR_CONNECT_TIMEOUTerror even if the connection could be established very quickly.I believe what is happening is:
setTimeoutwith the value of theconnectTimeoutMsto throw an error and cancel the connection if it takes too long (https://github.com/nodejs/undici/blob/main/lib/core/connect.js). It makes a call toGetAddrInfoReqWrap(https://github.com/nodejs/node/blob/main/lib/dns.js#L221), but this is asynchronous and processing of the callback will be delayed until the next event loop.onConnectTimeouttimer is run because the previous task took longer than the timeout.onConnectTimeoutcallssetImmediatewith a function to destroy the socket and throw the error. https://github.com/nodejs/undici/blob/main/lib/core/connect.jsGetAddrInfoReqlookup callback (emitLookupinnode:net) is run. This code begins the TCP connection (internalConnectis called in https://github.com/nodejs/node/blob/main/lib/net.js#L1032) but that is also asynchronous, so it won't finish in this round of the event loop.setImmediatefunction is run in the next phase which destroys the socket and throws theUND_ERR_CONNECT_TIMEOUTerror.Internally at Vercel, we have been seeing a high number of these
UND_ERR_CONNECT_TIMEOUTissues while pre-rendering pages in our Next.js application. I can't run this task on my local machine so it's harder to debug, but it's a CPU intensive task and moving fetch requests to a worker thread eliminated the Undici errors. We tried other suggestions (like--dns-result-order=ipv4firstand verified that we were not seeing any packet loss) that did not resolve the issue. Increasing the connect timeout resolves the issue in the reproduction but not the issue in our Next.js build (which I can't explain).Reproducible By
A minimal reproduction is available at https://github.com/mknichel/undici-connect-timeout-errors.
We can reproduce the behavior on Node 18.x and 20.x and with the
5.24.0and the latest version of Undici (6.19.2)Expected Behavior
The Undici request lifecycle could operate on a separate thread that does not get blocked by user code. By separating it out from the user code, this would remove impact of any user code on requests.
To test this theory, we created a dispatcher that proxied the fetch request to a dedicated worker thread (
new Workerfromworker_threads). This eliminated all the Undici errors that we were seeing in our Next.js build.Logs & Screenshots
In the minimal reproduction, the error is:
In our Next.js builds, the error is:
Environment
The reproduction repo was erroring for me on Mac OS 14.4, while internally we are seeing issues on AWS EC2 Intel machines.
Additional context
Vercel/Next.js users have reported
UND_ERR_CONNECT_TIMEOUTissues to us: