Skip to content

Conversation

@jridgewell
Copy link
Member

Q                       A
Fixed Issues?
Patch: Bug Fix?
Major: Breaking Change? Yes
Minor: New Feature?
Tests Added + Pass? Yes
Documentation PR Link
Any Dependency Changes?
License MIT

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

Maybe we want to add a simple ascii option to toggle the behavior back?

@nicolo-ribaudo
Copy link
Member

nicolo-ribaudo commented Apr 6, 2020

Ref: #247, #4478.

Reading #4477, it seems that it was done not for compatibility with old engines but for non-utf8 encodings. Is it safe to output non-ASCII characters by default?

cc @mathiasbynens (Sorry, you are my Unicode expert 😛)

@jridgewell
Copy link
Member Author

I think there's a bit of compatibility risk for lone surrogates, which aren't transformed into \uHHHH format. We could fix this, so we're always serving valid utf-8.

it seems that it was done not for compatibility with old engines but for non-utf8 encodings. Is it safe to output non-ASCII characters by default?

In strings, yes. The risk is that users will have improperly setup servers (not serving with Content-Type: text/javascript; charset=utf-8 header). In which case, just fix your server.

Also, we still retain the ability to output ASCII only, you just need to configure it. The default should probably match what the user typed. And for foreign language programmers, they're probably typing stings containing non-ASCII chars. Being able to actually read the output is probably going to much nicer trying to understand whatever "\u1234" means.

@mathiasbynens
Copy link
Contributor

Yep, @jridgewell nailed it. Note that folks who cannot alter their server configuration to add charset=utf-8 to the Content-Type header can instead use <script charset=utf-8 src=...>. The nice thing about shipping ASCII-only code is that you don't need to do any of these, but that's mostly nice for frameworks/libraries, i.e. code that you'd expect people to copy. For your own apps, you tend to have control over either the HTML or the server configuration.

@mathiasbynens
Copy link
Contributor

I think what you really want is to output non-ASCII symbols as-is unless they're lone surrogates or unless they're whitespace or non-printable.

@jridgewell
Copy link
Member Author

output non-ASCII symbols as-is unless they're lone surrogates or unless they're whitespace or non-printable.

Is there a jsesc setting for this?

@mathiasbynens
Copy link
Contributor

No, but we could add one :)

@mathiasbynens
Copy link
Contributor

Or, we could tweak the existing minimal option: https://github.com/mathiasbynens/jsesc#minimal That's probably best.

@nicolo-ribaudo nicolo-ribaudo added the PR: Polish 💅 A type of pull request used for our changelog categories label Apr 7, 2020
jsescOption: {
quotes: "double",
wrap: true,
minimal: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that sometimes user may prefer to preserve the escaped form, especially for the space characters. i.e.

"\u{00a0}" // NBSP 
"\u{2005}" // 1/4 em

and
https://github.com/angular/angular/blob/9e70bcb34f91d439f5203dc22a44f323d02c4648/packages/compiler/src/ml_parser/html_whitespaces.ts#L19

I suggest we plug in the minimal option only when there is no escape sequence inside.

Copy link
Member

@nicolo-ribaudo nicolo-ribaudo Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be better to implement this in jsesc (#11384 (comment) / #11384 (comment)), since it would also benefit non-Babel users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're only able to detect what the user actually typed via the raw value, which is handled before printing the cooked value. If we only have the cooked value, we can only guess what the user intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the whitespaces, I think we can add them to the whitelist of the minimal option.

Since it changes the default print behaviour, I consider this as a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raw value already captures this appropriately, though. We only run the cooked value through jsesc when there's not a raw value. At that point, we have no idea what the user actually wrote in the source file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I do think we want to always escape things like NBSP though.

Maybe we should special-case all non-ASCII entries in require('unicode-13.0.0/General_Category/Space_Separator/symbols.js'); in jsesc’s minimal option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[ '0x20',
  '0xa0',
  '0x1680',
  '0x2000',
  '0x2001',
  '0x2002',
  '0x2003',
  '0x2004',
  '0x2005',
  '0x2006',
  '0x2007',
  '0x2008',
  '0x2009',
  '0x200a',
  '0x202f',
  '0x205f',
  '0x3000' ]

I think we should continue to print regular space. I could understand escaping the rest, though. @mathiasbynens, are you open to doing that in jsesc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we wouldn't want to escape any ASCII spaces. We should take non-ASCII entries only.

Definitely open to a patch that does this in jsesc's minimal option. We could use src/data.js to export a list of special-cased symbols, based on reading require('unicode-13.0.0/General_Category/Space_Separator/symbols.js'); (or /code-points.js) and filtering out all ASCII entries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to v3.0.1, which now escapes non-ASCII whitespace.


/*#__PURE__*/
React.createElement("div", null, "w \xA0 w");
React.createElement("div", null, "w   w");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output became less useful in this case, IMHO.


/*#__PURE__*/
React.createElement("div", null, "this should parse as nbsp: \xA0 ");
React.createElement("div", null, "this should parse as nbsp:   ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

jsescOption: {
quotes: "double",
wrap: true,
minimal: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I do think we want to always escape things like NBSP though.

Maybe we should special-case all non-ASCII entries in require('unicode-13.0.0/General_Category/Space_Separator/symbols.js'); in jsesc’s minimal option.

@mathiasbynens
Copy link
Contributor

[email protected] now escapes lone surrogates even when minimal: true thanks to @jridgewell!

@JLHwung JLHwung added the PR: Breaking Change 💥 A type of pull request used for our changelog categories for next major release label Apr 9, 2020
@JLHwung JLHwung added this to the Babel 8.0 milestone Apr 9, 2020
Copy link
Contributor

@JLHwung JLHwung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is a breaking change, can you rebase onto next-8-dev?

Though technically bumping jsesc to version 3 is also a breaking change. I think it is okay to split this change into a dedicated PR so we can merge it in next minor. AFAIK the non-minimal (defaults in babel v7) behaviour is unchanged between 2 and 3, isn't it?

@nicolo-ribaudo nicolo-ribaudo mentioned this pull request Apr 9, 2020
34 tasks
@jridgewell
Copy link
Member Author

AFAIK the non-minimal (defaults in babel v7) behaviour is unchanged between 2 and 3, isn't it?

Yes. But it might just be easier to land everything on next-8-dev. If we were to upgrade the v3 and a user were using minimal, they would get changed behavior.

jridgewell and others added 6 commits April 9, 2020 19:22
This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.
@jridgewell jridgewell changed the base branch from master to next-8-dev April 9, 2020 23:23
recordAndTupleSyntaxType: opts.recordAndTupleSyntaxType,
};

if (opts.jsonCompatibleStrings) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option has been removed: #9958

@nicolo-ribaudo nicolo-ribaudo merged commit bb9e45f into babel:next-8-dev Apr 11, 2020
nicolo-ribaudo added a commit that referenced this pull request Apr 11, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Apr 11, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
@jridgewell jridgewell deleted the minimal-strings branch April 11, 2020 21:09
nicolo-ribaudo added a commit that referenced this pull request Apr 20, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
arku pushed a commit to arku/babel that referenced this pull request Apr 20, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Apr 21, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Apr 21, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Apr 21, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Apr 22, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request May 8, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request May 9, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Jun 18, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Jun 20, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
nicolo-ribaudo added a commit that referenced this pull request Jul 5, 2020
* Output minimal strings by default

This changes our behavior from always outputting ASCII-safe strings to outputting Unicode. It's smaller, supported since at least ES5, and maybe foreign language programmers will be less surprised with the output.

* Update packages/babel-generator/src/index.js

Co-Authored-By: Huáng Jùnliàng <[email protected]>

* Escape lone surrogates until jsesc does it for us

* Use regex's unicode mode

* Upgrade jsesc to v3

* Update to jsesc 3.0.1

* Remove logic for removed option

Co-authored-by: Huáng Jùnliàng <[email protected]>
Co-authored-by: Nicolò Ribaudo <[email protected]>
@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Jul 12, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 12, 2020
@JLHwung JLHwung removed this from the v8.0.0 milestone Aug 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

outdated A closed issue/PR that is archived due to age. Recommended to make a new issue PR: Breaking Change 💥 A type of pull request used for our changelog categories for next major release PR: Polish 💅 A type of pull request used for our changelog categories

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants