Skip to content

fix: escape bug#181

Merged
mcollina merged 1 commit into
mainfrom
fix/escape-bug
Jun 7, 2026
Merged

fix: escape bug#181
mcollina merged 1 commit into
mainfrom
fix/escape-bug

Conversation

@zekth

@zekth zekth commented Jun 7, 2026

Copy link
Copy Markdown
Member

The lib was using the deprecated global escape() function to percent-encode
characters in two hot paths: normalizePathEncoding and escapePreservingEscapes. For any non-ASCII input, escape() produces output that is not valid per RFC 3986, and in some cases produces output that silently disagrees with every other URI parser.

The bug

escape() is a leftover from early JavaScript. It does not implement RFC 3986
percent-encoding. See: https://262.ecma-international.org/9.0/index.html#sec-escape-string

The encoding is partly based on the encoding described in RFC 1738, but the entire encoding specified in this standard is described above without regard to the contents of RFC 1738. This encoding does not reflect changes to RFC 1738 made by RFC 3986.

1. BMP characters (code > 0xFF) → non-standard %uXXXX

escape('日') === '%u65E5'        // invalid — RFC 3986 only allows %XX
// expected (UTF-8 bytes): '%E6%97%A5'

%uXXXX is a Netscape-era extension. No RFC, no current spec, and most URI
parsers (Node's URL, browsers, decodeURIComponent) either reject it or
treat the % as literal.

2. Latin-1 characters (0x80..0xFF) → wrong byte

escape('é') === '%E9'            // single Latin-1 byte, NOT UTF-8
// expected: '%C3%A9'

This is the most dangerous case. The output looks like a valid percent
escape, so naive consumers don't notice anything wrong. But decodeURIComponent('%E9')
throws URIError: URI malformed, and any downstream HTTP client that re-encodes
the path as UTF-8 will produce a different URI than the one fast-uri returned.

3. Surrogate pairs (emoji, supplementary plane) → two %uXXXX halves

escape('😀') === '%uD83D%uDE00'  // two surrogate halves, not a real code point
// expected: '%F0%9F%98%80'

Where it triggered

Anywhere serialize, normalize, or resolve produced output for a URI
containing non-ASCII characters. Examples:

fastUri.serialize({ path: '/café' })
// before: '/caf%E9'              — broken
// after:  '/caf%C3%A9'            — valid UTF-8

fastUri.normalize('http://example.com/日本')
// before: 'http://example.com/%u65E5%u672C'      — invalid
// after:  'http://example.com/%E6%97%A5%E6%9C%AC' — valid UTF-8

fastUri.normalize('http://example.com/🚀')
// before: 'http://example.com/%uD83D%uDE80'      — invalid
// after:  'http://example.com/%F0%9F%9A%80'      — valid UTF-8

ASCII-only paths were unaffected — escape() happens to match UTF-8 for
ASCII, so the bug only fires when a non-ASCII character is present.

Why this matters beyond correctness

The risk is that fast-uri and the next tool in the chain end up with
different ideas of what the URI is
. A URL string that contains é
goes into a check function and comes out looking one way; the same string
gets sent over the network by fetch and comes out looking a different
way. To a developer reading the code, both calls "handle the same URL"
— but they don't.

Two concrete examples of how this breaks things:

  • A safety check that gets fooled. Imagine an app that only allows
    outbound requests to https://api.example.com/.... The check uses
    fast-uri to parse the URL and compares the host against an allowed
    list. An attacker submits a URL containing a Latin-1 character that
    fast-uri encodes as %E9. The check sees one string and approves it.
    The actual HTTP request, made by fetch, encodes the same character
    as %C3%A9 and goes to a different path on the server. The two halves
    of the system disagreed and the check did nothing.

  • A signed URL that stops matching itself. Some APIs sign a URL on
    one side and verify the signature on the other. If the signing side
    uses fast-uri to canonicalize the URL (producing %E9) and the
    verifying side uses a UTF-8-aware parser (producing %C3%A9), the
    two sides hash different inputs and the signature check fails — or
    worse, a careful attacker can use the gap to forge a URL that passes
    on one side but means something different on the other.

In short: when two tools that are supposed to agree on what a URL means
silently disagree, security checks built on top of that agreement stop
working. Fixing the encoder so its output matches what every other tool
produces removes the disagreement.

Benchmarks

Comparison of median throughput on the same machine:

Task Before (broken escape) After (UTF-8) Δ
serialize ascii path 1,142,857 1,333,333 +16.7%
serialize ascii path with spaces 1,200,480 1,412,429 +17.7%
serialize cjk path 1,262,626 1,000,000 −20.8%
serialize mixed path 1,090,513 1,091,703 +0.1%
serialize emoji path 1,410,437 1,410,437 0%
serialize long ascii path 107,619 124,347 +15.5%
serialize long utf8 path 165,508 131,148 −20.8%
normalize utf8 uri 727,273 705,716 −3.0%
parse utf8 uri 2,000,000 2,000,000 0%

Tradeoffs

ASCII paths get faster (+15–17%). The precomputed BYTE_HEX lookup
table is cheaper than the deprecated escape() global call, and the
inlined hot path lets V8 specialize better.

Non-ASCII paths get slower in ops/sec (~20% on pure CJK / long UTF-8
inputs).
This is not really a regression — the old code was producing
1 char per input char (e.g. %u65E5, 6 chars) while the new code produces
the correct UTF-8 bytes (e.g. %E6%97%A5, 9 chars, ~3× the work). On a
per-byte-of-output basis, the new encoder is actually faster than the old
one. The ops/sec comparison is apples-to-oranges because the two versions
produce different output sizes.

No memory growth. The encoder is fully inlined into the two callers,
adds no per-call allocations (no objects, no arrays, no intermediate
buffers). The only added allocation is the one-time BYTE_HEX[256] table
at module load.

Behavior change for non-ASCII URIs. Anyone who was depending on the
broken %uXXXX output will see different results. Given that the previous
output was not valid per any URI spec, we consider this a bug fix rather
than a breaking change. The major-version stance should be maintainers'
call.

Checklist

@zekth zekth requested a review from Uzlopak June 7, 2026 08:25

@gurgunday gurgunday left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina merged commit 24ef47c into main Jun 7, 2026
44 checks passed
@mcollina mcollina deleted the fix/escape-bug branch June 7, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants