Skip to content

ujson causes UnicodeEncodeError in email mirror #6332

@gnprice

Description

@gnprice

We've gotten a few exceptions like this recently on zulipchat.com:

  File "./zerver/views/email_mirror.py", line 22, in email_mirror_message
    result = mirror_email_message(ujson.loads(request.POST['data']))
  File "./zerver/lib/email_mirror.py", line 382, in mirror_email_message
    lambda x: None
  File "./zerver/lib/queue.py", line 306, in queue_json_publish
    get_queue_client().json_publish(queue_name, event)
  File "./zerver/lib/queue.py", line 115, in json_publish
    self.publish(queue_name, ujson.dumps(body))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 6226: surrogates not allowed

The request data in this example looks like

- POST: {'secret': ['**********'], 'data': ['{"msg_text": "Received: from [...]\\udcef\\udcbf\\udcbd[...]", "recipient": "***"}']}

The core of the issue appears to be that ujson will decode a JSON document that looks like this, but will then fail if you ask it to encode the result:

In [39]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[39]: '\udcef\udcbf\udcbd'

In [40]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-40-6b0b01834cc7> in <module>()
----> 1 ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 0: surrogates not allowed

(The result of loads there is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER -- so there are probably multiple layers of wrongness involved here.)

Compare json from the stdlib, which handles the round-trip with no trouble:

In [41]: json.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[41]: '"\\udcef\\udcbf\\udcbd"'

The ujson tracker has had this issue open since 2014.

The number of segfaults, crashes, and memory corruption issues in the ujson open issues suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's json (aka simplejson).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions