Response bodies are bytes, but the algorithm (as of #1107) uses a JS string as a response body.
Black box testing plan
At #1107 (comment) Boris gave a test plan that would allow us to figure out the string -> byte conversion in a black box way:
In terms of test matrix, if we need to determine this in a black-box way, it seems to me that the following are somewhat useful cases to test:
- Return string is all ASCII (charCodeAt() < 128 for all indices).
- Return string has charCodeAt() < 256 for all indices, but does not fall into case 1.
- Return string has does not have any charCodeAt values corresponding to UTF-16 surrogate code unit values, but does not fall into cases 1 or 2.
- Return string has surrogate code units which are all paired properly.
- Return string has unpaired surrogate code units.
Each of these should be tested in situations in which the source of the javascript: URL is either UTF-8 or ISO-8859-1/Windows-1252. That is, either an iframe in a document with that encoding with src pointing to a javascript: URL, or a link in a document with that encoding with href pointing to a javascript: URL. Probably test both scenarios.
The tests should look for the following things:
- What is the
document.body.textContent of the resulting document?
- What is the
document.charset of the resulting document?
Relevant implementer reports
Gecko
From #1107 (comment)
What Gecko does in terms of conversion to bytes is that it examines the returned string to see whether all charCodeAt() values are 255 or less. If so, the string is treated as byte-inflated ISO-8859-1 data (and a response is synthesized which has "ISO-8859-1" as its encoding, with the byte-deflated bytes as data). This allows generation of non-text data, dating back to when we supported javascript: in <img>, say.
Otherwise the return value is treated as a sequence of UTF-16 code units encoding a Unicode string and converted to UTF-8 bytes (insert handwaving about what happens to unpaired surrogates here). The synthesized response has "UTF-8" as its encoding.
Blink
From #1107 (comment) with further analysis by Boris in #1107 (comment):
I don't think Blink does conversion to bytes at all here. See the FIXME comment in https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/loader/DocumentWriter.cpp&l=75&ct=xref_jump_to_def&cl=GROK&gsn=appendReplacingData as called from https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/Source/core/loader/DocumentLoader.cpp&sq=package:chromium&l=684&rcl=1461583037 (DocumentLoader::replaceDocumentWhileExecutingJavaScriptURL).
EdgeHTML
From #1107 (comment)
In Edge we specify two code pages for transformation. The first is the calculated code page which is always CP_UCS_2 which translates to Unicode, ISO 10646 according to comments. We then specify as the source code page CPSRC_NATIVEDATA which means native data, known to be CP_UCS_2 so don't allow any sort of fallbacks.
Response bodies are bytes, but the algorithm (as of #1107) uses a JS string as a response body.
Black box testing plan
At #1107 (comment) Boris gave a test plan that would allow us to figure out the string -> byte conversion in a black box way:
Relevant implementer reports
Gecko
From #1107 (comment)
Blink
From #1107 (comment) with further analysis by Boris in #1107 (comment):
EdgeHTML
From #1107 (comment)