Skip to content

allow astral characters#190

Merged
puzrin merged 1 commit intonodeca:masterfrom
rlidwka:surrogates
May 13, 2015
Merged

allow astral characters#190
puzrin merged 1 commit intonodeca:masterfrom
rlidwka:surrogates

Conversation

@rlidwka
Copy link
Copy Markdown
Member

@rlidwka rlidwka commented May 13, 2015

According to YAML 1.2 spec, section 5.1, unicode range #x10000-#x10FFFF is considered printable characters.

However, parser excludes those characters trying to validate for #xD800-#xDFFF. This is fixed in this PR, so well-formed surrogate pairs pass validation, but lone surrogates don't.

It is still not a standard-compliant solution, because YAML seems to allow non-printables in quoted scalars for JSON compatibility, so validating the entire source code against non-printable regexp doesn't work well enough. I added a pending test case for this, suggesting to open another issue ticket about it.

@dervus
Copy link
Copy Markdown
Collaborator

dervus commented May 13, 2015

What's about performance of this?

@puzrin
Copy link
Copy Markdown
Member

puzrin commented May 13, 2015

That will be reworked. Wait.

@rlidwka
Copy link
Copy Markdown
Member Author

rlidwka commented May 13, 2015

@dervus, less than ideal probably. I'll try to speed that regexp up. But it is meant to be a temporary solution, because checking non-printables before parsing doesn't really work well (see the point about JSON compatibility, I'll create a separate ticket for it though).

@rlidwka
Copy link
Copy Markdown
Member Author

rlidwka commented May 13, 2015

@puzrin , I created tickets #191, #192 to track those two issues mentioned above

@rlidwka
Copy link
Copy Markdown
Member Author

rlidwka commented May 13, 2015

Joined both regexps into one.

If I didn't screw up, this could even be considered a performance improvement:

alex@y:/tmp/js-yaml/benchmark$ git checkout 3b92d922b^
HEAD is now at ad21884... Merge pull request #186 from preciousforever/improvement-custom_types-example
alex@y:/tmp/js-yaml/benchmark$ ./benchmark.js nodeca
Selected samples: (3 of 12)
 > document_nodeca_application
 > document_nodeca_database
 > document_nodeca_logger

Sample: document_nodeca_application.yaml (2055 characters)
 > current x 12,101 ops/sec ±0.57% (92 runs sampled)


Sample: document_nodeca_database.yaml (394 characters)
 > current x 39,270 ops/sec ±0.45% (96 runs sampled)


Sample: document_nodeca_logger.yaml (727 characters)
 > current x 9,787 ops/sec ±0.63% (97 runs sampled)

alex@y:/tmp/js-yaml/benchmark$ git checkout 3b92d922b
HEAD is now at 3b92d92... Process astral characters correctly
alex@y:/tmp/js-yaml/benchmark$ ./benchmark.js nodeca
Selected samples: (3 of 12)
 > document_nodeca_application
 > document_nodeca_database
 > document_nodeca_logger

Sample: document_nodeca_application.yaml (2055 characters)
 > current x 12,389 ops/sec ±1.22% (93 runs sampled)


Sample: document_nodeca_database.yaml (394 characters)
 > current x 40,647 ops/sec ±0.17% (99 runs sampled)


Sample: document_nodeca_logger.yaml (727 characters)
 > current x 12,859 ops/sec ±0.87% (98 runs sampled)

@dervus
Copy link
Copy Markdown
Collaborator

dervus commented May 13, 2015

OK, so is it ready to merge now?

According to YAML 1.2 spec, #xD800-#xDFFF unicode range is considered
non-printables. But unicode range #x10000-#x10FFFF, represented
in UTF-16 by the same characters is not.

NOTE: Non-printable characters are still allowed in quoted scalars,
added a pending test for this case.

Fixes: nodeca#191
puzrin pushed a commit that referenced this pull request May 13, 2015
@puzrin puzrin merged commit 65ad3b3 into nodeca:master May 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants