Skip to content

TSK: Follow-up things for stringdtype #25693

@mhvk

Description

@mhvk

Proposed new feature or change:

A number of points arose in discussion of #25347 that were deemed better done in follow-up, to not distract from the meat of that (very large) PR. This issue is to remind us of those.

  1. The doc for StringDType has a size argument that was needed in development but not for actual release. It can be removed. This should be done before the NumPy 2.0 release because it is public API. (Though might it be useful for a short version that only uses arena? see below. Probably can put it back if needed...)
  2. The add ufunc needs a promoter so addition with a python string works.
  3. Add a cython interface for the stringdtype C API.
  4. It is likely better to use a flag for strings "long strings" (stored outside of the numpy array proper) instead of one for short ones (stored inside), so that an all-zero entry correctly implies a zero-length string (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  5. Refactor the flags in the vstring implementation to use bitfields. This will improve clarity and eliminate complicated bitflag math.
  6. Possibly, the arena should have more recoverable flags/size information (see API: Introduce stringdtype [NEP 55] #25347 (comment))
  7. Investigate refactoring new_stringdtype_instance into tp_init
  8. Replace the long #define in casts.c with templating (or .c.src)
  9. Replace ufunc wrappers with passing functions into *auxdata (see here, here, here, and here) [e.g., minimum and maximum; the various comparison functions; the templated unary functions; find, rfind and maybe count; startswith and endswith; lstrip, rstrip and strip, plus whitespace versions]. Attempt in MAINT: combine string ufuncs by passing on auxilliary data #25796
  10. Check array2string formatter overrides.
  11. Adjust error messages referring to "object array" (e.g., a.view('2u8') currently errors with "Cannot change data-type for object array.").
  12. Have some helper functions that make it easy for StringDType ufuncs to use out arguments, also for in-place operations.
  13. See whether null-handling code in ufunc loops and casts can be consolidated into a helper function to reduce code duplication.
  14. Add checks for very long strings, see MAINT: Ensure correct handling for very large unicode strings #27875

Things possibly for longer term

  • Support in structured arrays (perhaps not super useful, but could be seen as similar to object).
  • Expose more of the currently private NpyString API. This will depend on feedback from users.
  • Fix longdouble to string, which is listed as broken in a TODO item in casts.c. isn't dragon able to deal with it?
  • Add a DType API callback that triggers when the initial filling of a newly created array (e.g. after PyArray_FromAny finishes filling a new array). We could use this to trim the arena buffer to the exact size needed by the array. Currently we over-allocate because the buffer grows exponentially with a growth factor of 1.25.
  • Might it make sense on 64bit systems, where normally the size is 16 bytes, to have a 8-byte version (short strings up to 7, only arena allocations for long ones; might use the size argument...).
  • In principle, .view(StringDType()) could be possible in some cases (e.g., to change the NA behaviour). Would need to share the allocator (and thus have reference counts for that...).
  • Dealing with array scalars vs str scalars - see also more general discussion about array scalars in ENH: No longer auto-convert array scalars to numpy scalars in ufuncs (and elsewhere?) #24897. ENH: add a StringDType scalar type that wraps a UTF-8 string #28165

Small warts, possibly not solvable

  • StringDType is added to add_dtype_helper late in the initialization of multiarraymodule; can this be easier?
  • Can the cases of having and not having gil be factored out, so that one doesn't get the kind of hybrid stuff in load_non_nullable_string with its has_gil argument.
  • to have dtype.hasobject be true is logical but not quite the right name.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions