-
-
Notifications
You must be signed in to change notification settings - Fork 12.2k
ENH: Half-sized StringDType? #26059
Description
Proposed new feature or change:
(Mostly for @ngoldbaum probably)
The new StrngDTyype is quite beautiful but also fairly inefficient for arrays that contain mostly smaller strings, since it uses at least the 16 bytes per array element that store its information. In principle, an 8-byte version of StringDType exists, as that is used on 32 bit systems, but it cannot directly be used on 64 bit systems, since it cannot store the pointer to a long string stored outside the arena. However, this could be remedied if, instead, this pointer itself is stored inside the arena (and, thus, the actual array entry continues to contain the offset in the arena).
Overall, one could envision a string dtype with a slightly changed logic also for short string replacements, which would ensure that offsets to the arena are always kept:
- Initialization is exactly the same as is: short strings are stored in the array, everything else in the arena. The minimum length of an arena entry is thus 8 bytes.
- If a short string is replaced by a longer one (at least 8 bytes), a new entry in the arena is added.
- If an arena string is replaced by a shorter one, it is kept in the arena, even if short enough to be stored in the array, so that the arena offset is not lost (and repeated string replacements can thus not lead the arena to continue to grow; it can never have more entries than the size of the array).
- If an arena string is replaced by string too long to be stored in the existing entry, it is stored on the heap, with the corresponding pointer stored in the arena entry (which is at least 8 bytes in size, so this fits). The offset into the arena is thus now used to find the pointer to the heap.
- If an entry on the heap is replaced by one that fits in the corresponding arena entry, it will be stored there.
Compared to the existing StringDType, this shorter type would be equally efficient for first-time entries, but slightly less efficient for short or long string replacements, as those both require an extra lookup. It also has the limitation that the arena cannot be larger than 4 GB. It would generally use quite a bit less memory if the array is mostly short strings except if they are mostly 8-15 bytes long.
Would this be worth implementing, perhaps with a different initialization option for StringDType?