Skip to content

Inconsistent handling of surrogate pair characters #447

@TheShiftedBit

Description

@TheShiftedBit

Ultrajson does not correctly handle surrogate pair characters. Per RFC 7159, unmatched surrogate pair characters are permitted in JSON:

the ABNF in this specification allows member names and
string values to contain bit sequences that cannot encode Unicode
characters; for example, "\uDEAD" (a single unpaired UTF-16
surrogate).

This behavior also disagrees with the built-in json module, which can handle strings containing surrogates correctly.

Ultrajson will correctly parse strings containing surrogate pair characters, if they're written in an escaped form, but it will not parse unescaped surrogate pair characters, and it will never dump Python strings containing surrogate pairs to JSON. This means ujson fails to round-trip certain inputs; see the example below.

What did you do?

Attempted to .dumps() a string containing a surrogate pair character.

What did you expect to happen?

It is encoded into a string.

What actually happened?

An exception is raised:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udddd' in position 0: surrogates not allowed

What versions are you using?

  • OS: Debian 5.7.17-1rodete4 (2020-10-01) x86_64 GNU/Linux
  • Python: Python 3.8.6 (tags/v3.8.6:db455296be, Nov 22 2020, 18:14:31)
  • UltraJSON: 4.0.1 (latest)

Please include code that reproduces the issue.

import ujson
x = ujson.loads('"\\udead"')  # Works, produces a surrogate pair character
y = ujson.dumps(x)  # Raises an exception

Solving this problem is a bit annoying in native code. In Python, the solution is simply to add "surrogatepass" as a second parameter to str.encode(). That functionality isn't exposed in native code, however. The solution I used in Atheris (which found this bug) was to first try encoding the string using the normal method, and if that doesn't work, fall back to encoding it by dynamically making a call to the str.encode() function in Python. Since you only have to take the slow route if the fast route fails, this is not likely to meaningfully affect performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions