-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
TSK: Follow-up things for stringdtype #25693
Copy link
Copy link
Open
Labels
Description
Proposed new feature or change:
A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.
- The doc for
StringDTypehas a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...) - The
addufunc needs a promoter so addition with a python string works. - Add a cython interface for the stringdtype C API.
- It is likely better to use a flag for strings "long strings" (stored outside of the numpy array proper) instead of one for short ones (stored inside), so that an all-zero entry correctly implies a zero-length string (see API: Introduce stringdtype [NEP 55] #25347 (comment))
- Refactor the flags in the vstring implementation to use bitfields. This will improve clarity and eliminate complicated bitflag math.
- Possibly, the arena should have more recoverable flags/size information (see API: Introduce stringdtype [NEP 55] #25347 (comment))
- Investigate refactoring
new_stringdtype_instanceintotp_init - Replace the long
#defineincasts.cwith templating (or.c.src) - Replace ufunc wrappers with passing functions into
*auxdata(see here, here, here, and here) [e.g.,minimumandmaximum; the various comparison functions; the templated unary functions;find,rfindand maybecount;startswithandendswith;lstrip,rstripandstrip, pluswhitespaceversions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796 - Check
array2stringformatter overrides. - Adjust error messages referring to "object array" (e.g.,
a.view('2u8')currently errors with"Cannot change data-type for object array."). - Have some helper functions that make it easy for
StringDTypeufuncs to useoutarguments, also for in-place operations. - See whether null-handling code in ufunc loops and casts can be consolidated into a helper function to reduce code duplication.
- Add checks for very long strings, see MAINT: Ensure correct handling for very large unicode strings #27875
Things possibly for longer term
- Support in structured arrays (perhaps not super useful, but could be seen as similar to
object). - Expose more of the currently private
NpyStringAPI. This will depend on feedback from users. - Fix
longdoubleto string, which is listed as broken in aTODOitem incasts.c. isn'tdragonable to deal with it? - Add a DType API callback that triggers when the initial filling of a newly created array (e.g. after
PyArray_FromAnyfinishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25. - Might it make sense on 64bit systems, where normally the size is 16 bytes, to have a 8-byte version (short strings up to 7, only arena allocations for long ones; might use the
sizeargument...). - In principle,
.view(StringDType())could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...). - Dealing with array scalars vs
strscalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897. ENH: add a StringDType scalar type that wraps a UTF-8 string #28165
Small warts, possibly not solvable
StringDTypeis added toadd_dtype_helperlate in the initialization ofmultiarraymodule; can this be easier?- Can the cases of having and not having gil be factored out, so that one doesn't get the kind of hybrid stuff in
load_non_nullable_stringwith itshas_gilargument. - to have
dtype.hasobjectbe true is logical but not quite the right name.
Reactions are currently unavailable