bpo-39087: Add _PyUnicode_GetUTF8Buffer()#17659
Conversation
Doc/c-api/unicode.rst
Outdated
| raised by the codec. | ||
|
|
||
|
|
||
| .. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view) |
There was a problem hiding this comment.
I am not sure about the order of parameters. There is a logic in placing the output parameter at the end, but in PyObject_GetBuffer() (for which I modelled the name) it is not the last parameter. Also, if we will add more similar functions in future, with multiple additional parameters (like flags and errors), it is better to place them at the end.
Modules/_testcapimodule.c
Outdated
| } | ||
| Py_ssize_t refcnt = Py_REFCNT(str); | ||
|
|
||
| if (PyUnicode_GetUTF8Buffer(str, NULL, &buf) < 0) { |
There was a problem hiding this comment.
It is not possible that PyUnicode_GetUTF8Buffer() fails for an ASCII string.
| def test_getutf8buffer(self): | ||
| from _testcapi import unicode_getutf8buffer | ||
|
|
||
| ascii_ = "foo" |
There was a problem hiding this comment.
Why not inline these variables?
There was a problem hiding this comment.
- Follow other tests
- Variable name is a hint why this value is tested.
Objects/unicodeobject.c
Outdated
| return PyBuffer_FillInfo(view, unicode, | ||
| PyUnicode_UTF8(unicode), | ||
| PyUnicode_UTF8_LENGTH(unicode), | ||
| 1, PyBUF_SIMPLE); |
There was a problem hiding this comment.
| 1, PyBUF_SIMPLE); | |
| 1 /* readonly */, PyBUF_SIMPLE); |
|
IMHO it's a bad old habit that test_capi runs "all" tests. I prefer that test_unicode tests the C API of Unicode objects. So rename the method instead. |
Modules/_testcapimodule.c
Outdated
| { | ||
| PyObject *unicode; | ||
| const char *errors = NULL; | ||
| if(!PyArg_ParseTuple(args, "U|s", &unicode, &errors)) { |
There was a problem hiding this comment.
U calls PyUnicode_READY(). This can affect the testing. I suggest to use O.
Modules/_testcapimodule.c
Outdated
| "without exception set. (%s:%d)", | ||
| __FILE__, __LINE__); | ||
| } | ||
| Py_DECREF(str); |
| // Test 3: There is a UTF-8 cache | ||
| // Reuse str of the previoss test. | ||
|
|
||
| const char *cache = PyUnicode_AsUTF8(str); |
There was a problem hiding this comment.
Would it be too difficult to test also that there was no cache before calling PyUnicode_AsUTF8()?
Doc/c-api/unicode.rst
Outdated
| raised by the codec. | ||
|
|
||
|
|
||
| .. c:function: int PyUnicode_GetUTF8Buffer(PyObject *unicode, const char errors, Py_buffer *view) |
Doc/c-api/unicode.rst
Outdated
| :c:function:`PyUnicode_AsUTF8AndSize`, this function does not cache the | ||
| UTF-8 representation of the string in the *unicode* object. | ||
| So this API is faster and more efficient when the *unicode* object is | ||
| not ASCII string and it is encoded into UTF-8 only once. |
There was a problem hiding this comment.
It's non obvious why avoiding a cache is more efficient. It seems like these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy, whereas PyUnicode_GetUTF8Buffer() doesn't. Maybe this sentence can be more vague about the rationale:
When the *unicode* object is not ASCII string and it is encoded into UTF-8 only once,
this API is more efficient than :c:function:`PyUnicode_AsUTF8` and
:c:function:`PyUnicode_AsUTF8AndSize`.
There was a problem hiding this comment.
It's non obvious why avoiding a cache is more efficient.
When the cache is never used, it consume memory with no benfit.
these functions are slowly since they need internally to copy the UTF-8 encoded bytes string using malloc + memcpy,
It is too detailed and fixable issue. See #18327.
| # Run tests wrtten in C. Raise an error when test failed. | ||
| unicode_test_getutf8buffer() | ||
|
|
||
| ascii_ = "foo" |
There was a problem hiding this comment.
I suggest to rename ascii_ to asciistr or ascii_str. The "_" suffix looks strange :-)
| { | ||
| Py_buffer buf; | ||
|
|
||
| // Test 1: ASCII string |
There was a problem hiding this comment.
Suggestion. Would it be possible to factorize the code of each test with an helper function? It seems like the code of each test is basically copy/pasted. I don't think that it matters to provide accurate error message.
Add a paremeter "utf8_cache" for test 3 to call or not PyUnicode_AsUTF8().
Co-Authored-By: Victor Stinner <[email protected]>
|
@methane: What's the status of this issue? |
|
@vstinner Now I have doubts about how this API is useful since PyUnicode_AsUTF8AndSize is as fast as this API. One of the merits of this API is efficiency in cross-interpreter. But HPy will provide better cross interpreter APIs. Will HPy have similar API? |
|
But HPy will provide better cross interpreter APIs. Will HPy have similar API?
|
|
What about adding this function first to private C API? If it significantly speeds up and simplifies the code we can rename it and make public. |
I did it. |
vstinner
left a comment
There was a problem hiding this comment.
LGTM. Let's add a private function and see if it's used or not :-)
My only worry is that it's not used in Python itself which makes me feel that it's not really worth it to use it.
I also added minor comments but I don't think that it's worth it to hold the PR for that. It's up to you to apply suggested changes or not.
GH-18984 is one example I found in the stdlib. I found cpython/Modules/_sqlite/connection.c Line 510 in c7ad974 This function used PyUnicode_AsUTF8 but utf8 cache of the py_val will not be used in the future. But the py_val will be freed soon after this function. So this example demonstrates this new API is not worth enough... I'm sorry about making garbage in commit log, but I will revert this pull request and abandon this new API. But I will spend some time to find more usage before the revert. |
Ref: bpo-39087