Add support for arbitrary size integers#548
Conversation
| static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length) | ||
| { | ||
| // PyLong_FromString requires a NUL-terminated string in CPython, contrary to the documentation: https://github.com/python/cpython/issues/59200 | ||
| char *buf = PyObject_Malloc(length + 1); | ||
| memcpy(buf, value, length); | ||
| buf[length] = '\0'; | ||
| return PyLong_FromString(buf, NULL, 10); | ||
| } | ||
|
|
There was a problem hiding this comment.
The memcpy here is certainly not ideal. It would probably be faster to buffer the next byte, replace it with a NUL, feed the char array through PyLong_FromString, and then restore it. I think that should always be possible/safe without buffer overruns, but I haven't spent a great deal of thought about it yet. Do you think it's worth exploring that sort of hack?
There was a problem hiding this comment.
Ooh, now that's a disgusting but tantalising thing to do. Two ways I imagine that causing damage are:
- The integer we're parsing is the last part of the JSON so the next byte would be out of bounds.
- Someone is reading the JSON string in multiple threads and the second thread happens to look at that byte when it's temporarily nullified.
1 is moot if the string we're parsing is NULL terminated (which I'm fairly certain that they already are). 2 feels like a very nasty way to derail something.
There was a problem hiding this comment.
The relevant part of the decoder boils down to:
- If the argument is a string, call
PyUnicode_AsUTF8Stringto convert tobytes. - Call
PyBytes_AsString, which returns a NULL-terminatedchar*that is the internal buffer of thebytesobject. - Iterate over that buffer.
So no out-of-bounds issue. As for multithreading, yes, that could be an issue (if the user passes bytes). But on the other hand, ujson already isn't thread-safe as far as I can tell. For example, modifying a dict from another thread while encoding it should crash the encoder.
There was a problem hiding this comment.
I ran some tests on this, parsing an array equivalent to [2**64 + i for i in range(1000)], i.e. 1000 ints all using this code path. On my machine, parsing such a string averages 128 μs with memcpy and 108 μs with the suggested hack.
Compared to the other code paths for smaller ints (note that the number of digits obviously has a direct impact on throughput as well):
| input | time |
|---|---|
2**64 + i, memcpy |
128 μs |
2**64 + i, buffer hack |
108 μs |
2**62 + i (newUnsignedLong) |
60 μs |
-2**63 + i (newLong) |
64 μs |
2**30 + i (newInt) |
45 μs |
-2**31 + i (newInt) |
45 μs |
Command, for reference: python3 -m timeit -s 'import ujson; s = "[" + ",".join(str(2**64 + i) for i in range(1000)) + "]"' 'ujson.loads(s)'
Implementation of the hack:
static JSOBJ Object_newIntegerFromString(void *prv, char *value, size_t length)
{
char buf = value[length];
value[length] = '\0';
PyObject *out = PyLong_FromString(value, NULL, 10);
value[length] = buf;
return out;
}There was a problem hiding this comment.
That's a pretty small speedup really. On my machine it's even less (115μs vs 110μs). Definitely not worth the hack in my mind.
There was a problem hiding this comment.
Yeah, agree. It's basically only relevant if all you ever do is parse huge integers. I don't think that's common enough to warrant the ugly hack.
|
@bwoodsend Small comment on the previous discussion in #440 just to avoid confusion: I realised that an extra type isn't needed. |
|
Just to confirm, I did more performance checks on both encoding and decoding. As expected, throughput for ints within the previously supported range |
b33ec84 to
5c9cfa8
Compare
Codecov Report
@@ Coverage Diff @@
## main #548 +/- ##
==========================================
- Coverage 91.83% 91.80% -0.03%
==========================================
Files 6 6
Lines 1824 1855 +31
==========================================
+ Hits 1675 1703 +28
- Misses 149 152 +3
Continue to review full report at Codecov.
|
|
Apparently PyPy's Edit: Issue filed: https://foss.heptapod.net/pypy/pypy/-/issues/3765 |
044701d to
655e146
Compare
|
Before merging, let's also test on the CI with PyPy on Windows. Obviously #552 is a blocker, so we can either wait for that to be fixed, or could temporarily skip its failing test on PyPy/Windows to enable testing here first? |
ab21a38 to
99709df
Compare
|
xfailing the offending test on Windows+PyPy, re-enabling CI for PyPy on either Windows or all platforms then rebasing this PR on top of that would be my plan. |
|
Either way is fine with me. |
Sounds good: #553 |
|
#553 is merged, we should be good to merge this too. Thanks both! |
This implements the idea I mentioned in #440: when an int is too large to be handled with C integers, do the conversion with Python's C functions instead and treat it as a char array internally.
The code already works, but it's obviously not ready for merging. I'm submitting it now for feedback.
I only looked at performance briefly so far. As the code is now, it shouldn't noticeably impact throughput when no large ints are in the input (just one additional jump on encoding and a bigger jump on decoding, I think). Given how complex they are compared to ujson's implementation, I was surprised to find that the CPython functions actually perform very well. On encoding, with some simple
python3 -m timeit -s 'import ujson' 'ujson.dumps(i)'tests, I found the following:i0910256257intobjects is not relevant.655369223372036854775807(2**63-1)long long922337203685477580818446744073709551615(2**64-1)unsigned long long18446744073709551616I also briefly looked into getting rid of the
PyLong_AsLongLongandPyLong_AsUnsignedLongLongcalls entirely and just using this method for everything (though I didn't clean everything up, just commented out that code inObject_beginTypeContext). For]-2**63, 2**63[, this is slower than the existing code. (The range[-9, 9]is marginally faster because CPython has optimised code for that.) Outside of that range, it performs better. Specifically,ujson.dumps(9223372036854775808)takes about 395 ns on my machine then, 20 % less than the existing code. I haven't thoroughly tested removing only the unsigned call, but it seems to not be of any value. In other words, if someone were to encode lots of large ints, this might be a considerable performance increase for them, and if most ints were single-digit, it would be about equal. For others, it would be significantly worse. I'm not sure what to do with this insight.