ujson causes UnicodeEncodeError in email mirror

We've gotten a few exceptions like this recently on zulipchat.com:
```
  File "./zerver/views/email_mirror.py", line 22, in email_mirror_message
    result = mirror_email_message(ujson.loads(request.POST['data']))
  File "./zerver/lib/email_mirror.py", line 382, in mirror_email_message
    lambda x: None
  File "./zerver/lib/queue.py", line 306, in queue_json_publish
    get_queue_client().json_publish(queue_name, event)
  File "./zerver/lib/queue.py", line 115, in json_publish
    self.publish(queue_name, ujson.dumps(body))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 6226: surrogates not allowed
```

The request data in this example looks like
```
- POST: {'secret': ['**********'], 'data': ['{"msg_text": "Received: from [...]\\udcef\\udcbf\\udcbd[...]", "recipient": "***"}']}
```

The core of the issue appears to be that `ujson` will decode a JSON document that looks like this, but will then fail if you ask it to encode the result:
```
In [39]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[39]: '\udcef\udcbf\udcbd'

In [40]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-40-6b0b01834cc7> in <module>()
----> 1 ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 0: surrogates not allowed
```
(The result of `loads` there is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of [U+FFFD REPLACEMENT CHARACTER](http://unicode.org/cldr/utility/character.jsp?a=FFFD) -- so there are probably multiple layers of wrongness involved here.)

Compare `json` from the stdlib, which handles the round-trip with no trouble:
```
In [41]: json.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[41]: '"\\udcef\\udcbf\\udcbd"'
```

The ujson tracker has [had this issue open](https://github.com/esnme/ultrajson/issues/156) since 2014.

The number of segfaults, crashes, and memory corruption issues in the [ujson open issues](https://github.com/esnme/ultrajson/issues) suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's `json` (aka `simplejson`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ujson causes UnicodeEncodeError in email mirror #6332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ujson causes UnicodeEncodeError in email mirror #6332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions