[String] a new component for object-oriented strings management with an abstract unit system#33553
Conversation
6a1137d to
c613662
Compare
|
no way to unignore case ? and no way to know whether the current object is ignoring case ? This makes the API unusable for code wanting to deal with the string in a case sensitive way while accepting an external string object. Also should we merge this new component in 4.4, which would mean that its first release is already non-experimental ? We are not allowed to have experimental components in LTS versions, per our LTS policy. |
That's something we need to decide indeed. On my side, I think we can make it non-experimental. |
|
and what happens for all methods accepting a string as argument, when passing non-UTF-8 strings to the method on a Regarding the naming, should Note that these comments are based purely on your PR description. I haven't looked at the code yet. |
c613662 to
8945735
Compare
an
I think UTF-8 is more common vocabulary. The previous PR used |
|
@nicolas-grekas but the whole component is about UTF-8 strings. AFAICT, even BinaryString is not meant to operate on other encodings, as it does not validate that the string is valid UTF-8 before converting it to other implementations. |
It does, dunno why you think otherwise. If you try to convert a random binary string to UTF-8/Grapheme, you'll get an
Thus the name of the component. |
|
To me, this should go as experimental in 5.0. |
|
Definitely supporting this :-) Nice ! |
|
Sorry to sound naive, but I can't find in this pull request or the previous one, some brief explanation about when/where should developers use this. Why/when should we use these classes/methods instead of the normal str_ PHP functions or the mb_str UTF8 functions? Thanks! Note: I'm not questioning this ... I just want to know where this fits in Symfony developers and Symfony itself. Thanks a lot! edit: see #33553 (comment) |
5ccd5a7 to
f9b903b
Compare
|
For your consideration, we could turn these 4 methods: function ensureLeft(string $prefix): self
function ensureRight(string $suffix): self
function padLeft(int $length, string $padStr = ' '): self
function padRight(int $length, string $padStr = ' '): selfInto these 2 methods if we change the order of the arguments: function padLeft(string $padStr = ' ', int $length = null): self
function padRight(string $padStr = ' ', int $length = null): selfExample: // BEFORE
$s1 = u('lorem')->ensureLeft('abc');
// $s1 = 'abclorem'
$s2 = u('lorem')->ensureRight('abc');
// $s2 = 'loremabc'
$s3 = u('lorem')->padLeft(8, 'abc');
// $s3 = 'abcabcablorem'
$s4 = u('lorem')->padRight(8, 'abc');
// $s4 = 'loremabcabcab'
// AFTER
$s1 = u('lorem')->padLeft('abc');
// $s1 = 'abclorem'
$s2 = u('lorem')->padRight('abc');
// $s2 = 'loremabc'
$s3 = u('lorem')->padLeft('abc', 8);
// $s3 = 'abcabcablorem'
$s4 = u('lorem')->padRight('abc', 8);
// $s4 = 'loremabcabcab' |
All the time would be fine. e.g. More specifically, I've observed ppl randomly add an |
This would be totally unexpected to me. I've seen no other libraries have this API and I'm not sure it works actually.
Absolutely! That's critical design concern, not just an implementation detail :) I added a note about it in the desription. Thanks for asking. |
azjezz
left a comment
There was a problem hiding this comment.
This is great, i believe this would make it easier for developers to deal with string encoding, just few notes about method naming :)
|
@nicolas-grekas thanks for the explanation. It's perfectly clear now! Another question: some methods are called "left", "right" instead of "prefix/suffix" or "start/end". What happens when the text is Arabic/Persian/Hebrew and uses right-to-text direction? For example, |
|
We have been using |
|
It looks amazing. I am curious how it would work together with the rest of the ecosystem. Lets say compatibility with doctrine, intl, symfony/validator, etc? I am sure it will take time before it trickles down to other components, but the future seems bright ;) |
azjezz
left a comment
There was a problem hiding this comment.
i suggest adding AbstractString::contains(string ...$needles): bool, where it returns true in case the string contains one of the needles.
if ($text->contains(...$blacklisted)) {
echo 'nope!';
}f868dc1 to
f1de9f7
Compare
f65aee2 to
bf60a77
Compare
bf60a77 to
3d9c6e0
Compare
|
Now with more tests in the 3rd commit, courtesy of @gharlan, who spotted issues while working on it, all solved now. Thank you! |
5676e2a to
f398091
Compare
xabbuh
left a comment
There was a problem hiding this comment.
I am not finished reviewing this PR, but here are some ideas I got so far.
| return \strlen($this->string) - \strlen($suffix) === ($this->ignoreCase ? strripos($this->string, $suffix) : strrpos($this->string, $suffix)); | ||
| } | ||
|
|
||
| public function equalsTo($string): bool |
There was a problem hiding this comment.
| public function equalsTo($string): bool | |
| /** | |
| * @param AbstractString|string|string[] $string | |
| */ | |
| public function equalsTo($string): bool |
Can be useful for autocompletion, static analysis and so on. Other methods could benefit from this doc.
There was a problem hiding this comment.
Any object that implements __toString() is allowed actually. That's what string means already to me. What's the relation with autocompletion?
There was a problem hiding this comment.
|
Thank you @nicolas-grekas. |
|
Thank you everyone for the reviews, it's been invaluable! |
I think this answer is incomplete, because there is nothing that stops someone from calling a function like This might also happen unintentionally when someone intends to do a lot of case insensitive operations, so they do You might say "Use the code in a wrong way and you will get wrong results", but i still think that this behavior is kind of odd. Maybe have the |

[EDIT: classes have been renamed in #33816]
This is a reboot of #22184 (thanks @hhamon for working on it) and a generalization of my previous work on the topic (patchwork/utf8). Unlike existing libraries (including
patchwork/utf8), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters.The unified API is defined by the
AbstractStringclass. It has 2 direct child classes:BinaryStringandAbstractUnicodeString, itself extended byUtf8StringandGraphemeString.All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types!
Two helper functions are provided to create such strings:
GraphemeStringis the most linguistic-friendly variant of them, which means it's the one ppl should use most of the time when dealing with written text.Future ideas:
is*()?,*Encode()?, etc.)width()to improvetruncate()andwordwrap()move methodsee [String] Introduce a locale-aware Slugger in the String component #33768slug()to a dedicated locale-aware service classOut of (current) scope:
Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go.
AbstractUnicodeStringadds these:and
BinaryString:Case insensitive operations are done with the
ignoreCase()method.e.g.
b('abc')->ignoreCase()->indexOf('B')will return1.For reference, CLDR transliterations (used in the
ascii()method) are defined here:https://github.com/unicode-org/cldr/tree/master/common/transforms