fix: escape bug#181
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The lib was using the deprecated global
escape()function to percent-encodecharacters in two hot paths:
normalizePathEncodingandescapePreservingEscapes. For any non-ASCII input,escape()produces output that is not valid per RFC 3986, and in some cases produces output that silently disagrees with every other URI parser.The bug
escape()is a leftover from early JavaScript. It does not implement RFC 3986percent-encoding. See: https://262.ecma-international.org/9.0/index.html#sec-escape-string
1. BMP characters (code > 0xFF) → non-standard
%uXXXX%uXXXXis a Netscape-era extension. No RFC, no current spec, and most URIparsers (Node's
URL, browsers,decodeURIComponent) either reject it ortreat the
%as literal.2. Latin-1 characters (0x80..0xFF) → wrong byte
This is the most dangerous case. The output looks like a valid percent
escape, so naive consumers don't notice anything wrong. But
decodeURIComponent('%E9')throws
URIError: URI malformed, and any downstream HTTP client that re-encodesthe path as UTF-8 will produce a different URI than the one
fast-urireturned.3. Surrogate pairs (emoji, supplementary plane) → two
%uXXXXhalvesWhere it triggered
Anywhere
serialize,normalize, orresolveproduced output for a URIcontaining non-ASCII characters. Examples:
ASCII-only paths were unaffected —
escape()happens to match UTF-8 forASCII, so the bug only fires when a non-ASCII character is present.
Why this matters beyond correctness
The risk is that
fast-uriand the next tool in the chain end up withdifferent ideas of what the URI is. A URL string that contains
égoes into a check function and comes out looking one way; the same string
gets sent over the network by
fetchand comes out looking a differentway. To a developer reading the code, both calls "handle the same URL"
— but they don't.
Two concrete examples of how this breaks things:
A safety check that gets fooled. Imagine an app that only allows
outbound requests to
https://api.example.com/.... The check usesfast-urito parse the URL and compares the host against an allowedlist. An attacker submits a URL containing a Latin-1 character that
fast-uriencodes as%E9. The check sees one string and approves it.The actual HTTP request, made by
fetch, encodes the same characteras
%C3%A9and goes to a different path on the server. The two halvesof the system disagreed and the check did nothing.
A signed URL that stops matching itself. Some APIs sign a URL on
one side and verify the signature on the other. If the signing side
uses
fast-urito canonicalize the URL (producing%E9) and theverifying side uses a UTF-8-aware parser (producing
%C3%A9), thetwo sides hash different inputs and the signature check fails — or
worse, a careful attacker can use the gap to forge a URL that passes
on one side but means something different on the other.
In short: when two tools that are supposed to agree on what a URL means
silently disagree, security checks built on top of that agreement stop
working. Fixing the encoder so its output matches what every other tool
produces removes the disagreement.
Benchmarks
Comparison of median throughput on the same machine:
escape)Tradeoffs
ASCII paths get faster (+15–17%). The precomputed
BYTE_HEXlookuptable is cheaper than the deprecated
escape()global call, and theinlined hot path lets V8 specialize better.
Non-ASCII paths get slower in ops/sec (~20% on pure CJK / long UTF-8
inputs). This is not really a regression — the old code was producing
1 char per input char (e.g.
%u65E5, 6 chars) while the new code producesthe correct UTF-8 bytes (e.g.
%E6%97%A5, 9 chars, ~3× the work). On aper-byte-of-output basis, the new encoder is actually faster than the old
one. The ops/sec comparison is apples-to-oranges because the two versions
produce different output sizes.
No memory growth. The encoder is fully inlined into the two callers,
adds no per-call allocations (no objects, no arrays, no intermediate
buffers). The only added allocation is the one-time
BYTE_HEX[256]tableat module load.
Behavior change for non-ASCII URIs. Anyone who was depending on the
broken
%uXXXXoutput will see different results. Given that the previousoutput was not valid per any URI spec, we consider this a bug fix rather
than a breaking change. The major-version stance should be maintainers'
call.
Checklist
npm run test && npm run benchmark --if-presentand the Code of conduct