Fix handling of surrogate pseudocharacters under Python 3.#284
Closed
gnprice wants to merge 3 commits intoultrajson:mainfrom
Closed
Fix handling of surrogate pseudocharacters under Python 3.#284gnprice wants to merge 3 commits intoultrajson:mainfrom
gnprice wants to merge 3 commits intoultrajson:mainfrom
Conversation
This is a situation where we have a Python unicode string which doesn't
consist entirely of genuine Unicode characters -- some of the codepoints
in the string are surrogate codepoints, which occur in a UTF-16 encoding
of a string and were also repurposed in PEP 383 for losslessly encoding
arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python
strings. Currently, on Python 3, we cause a UnicodeEncodeError if we
try to encode such a string as JSON.
It's not 100% obvious what the right thing to do here is -- this
situation seems like it must reflect a bug somewhere else in the
program or its environment. But
* one way we can get such a string is by loading a JSON document
(perhaps an invalid JSON document? anyway, we load it without error):
>>> ujson.dumps(ujson.loads('"\\udcff"'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowed
* we already pass these strings through without complaint on Python 2;
* as the included test shows, passing these through matches the
behavior of the stdlib's `json` module.
So it seems best to pass them through.
Fixes ultrajson#156.
gnprice
added a commit
to zulip/zulip
that referenced
this pull request
Aug 29, 2017
See my PR upstream: ultrajson/ultrajson#284 . Fixes #6332.
hartwork
reviewed
Feb 20, 2020
There was a problem hiding this comment.
I'm not sure if passing through is the best approach — stdlib json does not pass through but escapes (avoiding invalid characters in the output), see:
In [11]: list(sys.version_info)
Out[11]: [3, 6, 10, 'final', 0]
In [12]: json.dumps('\udcff')
Out[12]: '"\\udcff"'
hartwork
suggested changes
Feb 25, 2020
Comment on lines
+53
to
+55
| #define PyUnicode_AsUTF8String(o) \ | ||
| (PyUnicode_AsEncodedString((o), "utf-8", "surrogatepass")) | ||
|
|
There was a problem hiding this comment.
This code seems unused?
If you're aiming for surrogatepass as a generic solution, it's a recipe for producing invalid UTF-8:
In [6]: '\udcff'.encode('utf-8', 'surrogatepass').decode('utf-8')
[..]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte
Are you aware?
JustAnotherArchivist
added a commit
to JustAnotherArchivist/ultrajson
that referenced
this pull request
Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The PyUnicode to char* conversion is moved into its own function, separated from the JSONTypeContext handling, so it can be reused for other things in the future. - Converting the char* output to a Python string with surrogates intact requires the string length for PyUnicode_Decode (or any of its alternatives). While strlen could be used, the length is already known inside the encoder, so the encoder function now also takes an extra size_t pointer argument to return that. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's __json__ method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284
JustAnotherArchivist
added a commit
to JustAnotherArchivist/ultrajson
that referenced
this pull request
Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284
JustAnotherArchivist
added a commit
to JustAnotherArchivist/ultrajson
that referenced
this pull request
Apr 17, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Supersedes ultrajson#284
JustAnotherArchivist
added a commit
to JustAnotherArchivist/ultrajson
that referenced
this pull request
May 30, 2022
This allows surrogates anywhere in the input, compatible with the json module from the standard library. This also refactors two interfaces: - The `PyUnicode` to `char*` conversion is moved into its own function, separated from the `JSONTypeContext` handling, so it can be reused for other things in the future (e.g. indentation and separators) which don't have a type context. - Converting the `char*` output to a Python string with surrogates intact requires the string length for `PyUnicode_Decode` & Co. While `strlen` could be used, the length is already known inside the encoder, so the encoder function now also takes an extra `size_t` pointer argument to return that and no longer NUL-terminates the string. This also permits output that contains NUL bytes (even though that would be invalid JSON), e.g. if an object's `__json__` method return value were to contain them. Fixes ultrajson#156 Fixes ultrajson#447 Fixes ultrajson#537 Supersedes ultrajson#284
Member
|
Superseded by #530. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a situation where we have a Python unicode string which doesn't
consist entirely of genuine Unicode characters -- some of the codepoints
in the string are surrogate codepoints, which occur in a UTF-16 encoding
of a string and were also repurposed in PEP 383 for losslessly encoding
arbitrary mostly-UTF-8 bytestrings (like Unix filenames) in Python
strings. Currently, on Python 3, we cause a UnicodeEncodeError if we
try to encode such a string as JSON.
It's not 100% obvious what the right thing to do here is -- this
situation seems like it must reflect a bug somewhere else in the
program or its environment. But
(perhaps an invalid JSON document? anyway, we load it without error):
>>> ujson.dumps(ujson.loads('"\\udcff"')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 0: surrogates not allowedwe already pass these strings through without complaint on Python 2;
as the included test shows, passing these through matches the
behavior of the stdlib's
jsonmodule.So it seems best to pass them through.
Fixes #156.