fix: escape bug by zekth · Pull Request #181 · fastify/fast-uri

zekth · 2026-06-07T08:24:51Z

The lib was using the deprecated global escape() function to percent-encode
characters in two hot paths: normalizePathEncoding and escapePreservingEscapes. For any non-ASCII input, escape() produces output that is not valid per RFC 3986, and in some cases produces output that silently disagrees with every other URI parser.

The bug

escape() is a leftover from early JavaScript. It does not implement RFC 3986
percent-encoding. See: https://262.ecma-international.org/9.0/index.html#sec-escape-string

The encoding is partly based on the encoding described in RFC 1738, but the entire encoding specified in this standard is described above without regard to the contents of RFC 1738. This encoding does not reflect changes to RFC 1738 made by RFC 3986.

1. BMP characters (code > 0xFF) → non-standard `%uXXXX`

escape('日') === '%u65E5'        // invalid — RFC 3986 only allows %XX
// expected (UTF-8 bytes): '%E6%97%A5'

%uXXXX is a Netscape-era extension. No RFC, no current spec, and most URI
parsers (Node's URL, browsers, decodeURIComponent) either reject it or
treat the % as literal.

2. Latin-1 characters (0x80..0xFF) → wrong byte

escape('é') === '%E9'            // single Latin-1 byte, NOT UTF-8
// expected: '%C3%A9'

This is the most dangerous case. The output looks like a valid percent
escape, so naive consumers don't notice anything wrong. But decodeURIComponent('%E9')
throws URIError: URI malformed, and any downstream HTTP client that re-encodes
the path as UTF-8 will produce a different URI than the one fast-uri returned.

3. Surrogate pairs (emoji, supplementary plane) → two `%uXXXX` halves

escape('😀') === '%uD83D%uDE00'  // two surrogate halves, not a real code point
// expected: '%F0%9F%98%80'

Where it triggered

Anywhere serialize, normalize, or resolve produced output for a URI
containing non-ASCII characters. Examples:

fastUri.serialize({ path: '/café' })
// before: '/caf%E9'              — broken
// after:  '/caf%C3%A9'            — valid UTF-8

fastUri.normalize('http://example.com/日本')
// before: 'http://example.com/%u65E5%u672C'      — invalid
// after:  'http://example.com/%E6%97%A5%E6%9C%AC' — valid UTF-8

fastUri.normalize('http://example.com/🚀')
// before: 'http://example.com/%uD83D%uDE80'      — invalid
// after:  'http://example.com/%F0%9F%9A%80'      — valid UTF-8

ASCII-only paths were unaffected — escape() happens to match UTF-8 for
ASCII, so the bug only fires when a non-ASCII character is present.

Why this matters beyond correctness

The risk is that fast-uri and the next tool in the chain end up with
different ideas of what the URI is. A URL string that contains é
goes into a check function and comes out looking one way; the same string
gets sent over the network by fetch and comes out looking a different
way. To a developer reading the code, both calls "handle the same URL"
— but they don't.

Two concrete examples of how this breaks things:

A safety check that gets fooled. Imagine an app that only allows
outbound requests to https://api.example.com/.... The check uses
fast-uri to parse the URL and compares the host against an allowed
list. An attacker submits a URL containing a Latin-1 character that
fast-uri encodes as %E9. The check sees one string and approves it.
The actual HTTP request, made by fetch, encodes the same character
as %C3%A9 and goes to a different path on the server. The two halves
of the system disagreed and the check did nothing.
A signed URL that stops matching itself. Some APIs sign a URL on
one side and verify the signature on the other. If the signing side
uses fast-uri to canonicalize the URL (producing %E9) and the
verifying side uses a UTF-8-aware parser (producing %C3%A9), the
two sides hash different inputs and the signature check fails — or
worse, a careful attacker can use the gap to forge a URL that passes
on one side but means something different on the other.

In short: when two tools that are supposed to agree on what a URL means
silently disagree, security checks built on top of that agreement stop
working. Fixing the encoder so its output matches what every other tool
produces removes the disagreement.

Benchmarks

Comparison of median throughput on the same machine:

Task	Before (broken `escape`)	After (UTF-8)	Δ
serialize ascii path	1,142,857	1,333,333	+16.7%
serialize ascii path with spaces	1,200,480	1,412,429	+17.7%
serialize cjk path	1,262,626	1,000,000	−20.8%
serialize mixed path	1,090,513	1,091,703	+0.1%
serialize emoji path	1,410,437	1,410,437	0%
serialize long ascii path	107,619	124,347	+15.5%
serialize long utf8 path	165,508	131,148	−20.8%
normalize utf8 uri	727,273	705,716	−3.0%
parse utf8 uri	2,000,000	2,000,000	0%

Tradeoffs

ASCII paths get faster (+15–17%). The precomputed BYTE_HEX lookup
table is cheaper than the deprecated escape() global call, and the
inlined hot path lets V8 specialize better.

Non-ASCII paths get slower in ops/sec (~20% on pure CJK / long UTF-8
inputs). This is not really a regression — the old code was producing
1 char per input char (e.g. %u65E5, 6 chars) while the new code produces
the correct UTF-8 bytes (e.g. %E6%97%A5, 9 chars, ~3× the work). On a
per-byte-of-output basis, the new encoder is actually faster than the old
one. The ops/sec comparison is apples-to-oranges because the two versions
produce different output sizes.

No memory growth. The encoder is fully inlined into the two callers,
adds no per-call allocations (no objects, no arrays, no intermediate
buffers). The only added allocation is the one-time BYTE_HEX[256] table
at module load.

Behavior change for non-ASCII URIs. Anyone who was depending on the
broken %uXXXX output will see different results. Given that the previous
output was not valid per any URI spec, we consider this a bug fix rather
than a breaking change. The major-version stance should be maintainers'
call.

Checklist

run npm run test && npm run benchmark --if-present
tests and/or benchmarks are included
documentation is changed or added
commit message and code follows the Developer's Certification of Origin
and the Code of conduct

gurgunday

lgtm

mcollina

lgtm

fix: escape bug

ef83721

zekth requested a review from Uzlopak June 7, 2026 08:25

gurgunday approved these changes Jun 7, 2026

View reviewed changes

mcollina approved these changes Jun 7, 2026

View reviewed changes

mcollina merged commit 24ef47c into main Jun 7, 2026
44 checks passed

mcollina deleted the fix/escape-bug branch June 7, 2026 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: escape bug#181

fix: escape bug#181
mcollina merged 1 commit into
mainfrom
fix/escape-bug

zekth commented Jun 7, 2026

Uh oh!

gurgunday left a comment

Uh oh!

mcollina left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zekth commented Jun 7, 2026

The bug

1. BMP characters (code > 0xFF) → non-standard %uXXXX

2. Latin-1 characters (0x80..0xFF) → wrong byte

3. Surrogate pairs (emoji, supplementary plane) → two %uXXXX halves

Where it triggered

Why this matters beyond correctness

Benchmarks

Tradeoffs

Checklist

Uh oh!

gurgunday left a comment

Choose a reason for hiding this comment

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. BMP characters (code > 0xFF) → non-standard `%uXXXX`

3. Surrogate pairs (emoji, supplementary plane) → two `%uXXXX` halves