We've gotten a few exceptions like this recently on zulipchat.com:
File "./zerver/views/email_mirror.py", line 22, in email_mirror_message
result = mirror_email_message(ujson.loads(request.POST['data']))
File "./zerver/lib/email_mirror.py", line 382, in mirror_email_message
lambda x: None
File "./zerver/lib/queue.py", line 306, in queue_json_publish
get_queue_client().json_publish(queue_name, event)
File "./zerver/lib/queue.py", line 115, in json_publish
self.publish(queue_name, ujson.dumps(body))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 6226: surrogates not allowed
The request data in this example looks like
- POST: {'secret': ['**********'], 'data': ['{"msg_text": "Received: from [...]\\udcef\\udcbf\\udcbd[...]", "recipient": "***"}']}
The core of the issue appears to be that ujson will decode a JSON document that looks like this, but will then fail if you ask it to encode the result:
In [39]: ujson.loads('"\\udcef\\udcbf\\udcbd"')
Out[39]: '\udcef\udcbf\udcbd'
In [40]: ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-40-6b0b01834cc7> in <module>()
----> 1 ujson.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcef' in position 0: surrogates not allowed
(The result of loads there is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER -- so there are probably multiple layers of wrongness involved here.)
Compare json from the stdlib, which handles the round-trip with no trouble:
In [41]: json.dumps(ujson.loads('"\\udcef\\udcbf\\udcbd"'))
Out[41]: '"\\udcef\\udcbf\\udcbd"'
The ujson tracker has had this issue open since 2014.
The number of segfaults, crashes, and memory corruption issues in the ujson open issues suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's json (aka simplejson).
We've gotten a few exceptions like this recently on zulipchat.com:
The request data in this example looks like
The core of the issue appears to be that
ujsonwill decode a JSON document that looks like this, but will then fail if you ask it to encode the result:(The result of
loadsthere is a string of three characters, each of which is not a real character but rather a surrogate value. The surrogates were originally carved out for use in UTF-16, to encode Unicode in 16-bit elements, but in Python they're now used for losslessly encoding mostly-UTF-8 bytestrings, like Unix filenames, in Python text strings. This particular sequence of three surrogates doesn't seem to fit either of those uses, so it's a mystery where it came from. If you take the low 8 bits from each of these surrogates as a sequence of bytes, you do get the UTF-8 encoding of U+FFFD REPLACEMENT CHARACTER -- so there are probably multiple layers of wrongness involved here.)Compare
jsonfrom the stdlib, which handles the round-trip with no trouble:The ujson tracker has had this issue open since 2014.
The number of segfaults, crashes, and memory corruption issues in the ujson open issues suggest this may not be a super reliable or vigorously-maintained library. The benchmarks in its README are impressive, but if we can we may be better off using the stdlib's
json(akasimplejson).