[String] a new component for object-oriented strings management with an abstract unit system by nicolas-grekas · Pull Request #33553 · symfony/symfony

nicolas-grekas · 2019-09-11T10:20:49Z

Q	A
Branch?	master
Bug fix?	no
New feature?	yes
Deprecations?	no
Tickets	-
License	MIT
Doc PR	symfony/symfony-docs#12376

[EDIT: classes have been renamed in #33816]

This is a reboot of #22184 (thanks @hhamon for working on it) and a generalization of my previous work on the topic (patchwork/utf8). Unlike existing libraries (including patchwork/utf8), this component provides a unified API for the 3 unit systems of strings: bytes, code points and grapheme clusters.

The unified API is defined by the AbstractString class. It has 2 direct child classes: BinaryString and AbstractUnicodeString, itself extended by Utf8String and GraphemeString.

All objects are immutable and provide clear edge-case semantics, using exceptions and/or (nullable) types!

Two helper functions are provided to create such strings:

new GraphemeString('foo') == u('foo'); // when dealing with Unicode, prefer grapheme units
new BinaryString('foo') == b('foo');

GraphemeString is the most linguistic-friendly variant of them, which means it's the one ppl should use most of the time when dealing with written text.

Future ideas:

improve tests
add more docblocks (only where they'd add value!)
consider adding more methods in the string API (is*()?, *Encode()?, etc.)
first class Emoji support
merge the Inflector component into this one
use width() to improve truncate() and wordwrap()
~~move method slug() to a dedicated locale-aware service class~~ see [String] Introduce a locale-aware Slugger in the String component #33768
propose your ideas (send PRs after merge)

Out of (current) scope:

what intl provides (collations, transliterations, confusables, segmentation, etc)

Here is the unified API I'm proposing in this PR, borrowed from looking at many existing libraries, but also Java, Python, JavaScript and Go.

function __construct(string $string = '');
static function unwrap(array $values): array
static function wrap(array $values): array
function after($needle, bool $includeNeedle = false, int $offset = 0): self;
function afterLast($needle, bool $includeNeedle = false, int $offset = 0): self;
function append(string ...$suffix): self;
function before($needle, bool $includeNeedle = false, int $offset = 0): self;
function beforeLast($needle, bool $includeNeedle = false, int $offset = 0): self;
function camel(): self;
function chunk(int $length = 1): array;
function collapseWhitespace(): self
function endsWith($suffix): bool;
function ensureEnd(string $suffix): self;
function ensureStart(string $prefix): self;
function equalsTo($string): bool;
function folded(): self;
function ignoreCase(): self;
function indexOf($needle, int $offset = 0): ?int;
function indexOfLast($needle, int $offset = 0): ?int;
function isEmpty(): bool;
function join(array $strings): self;
function jsonSerialize(): string;
function length(): int;
function lower(): self;
function match(string $pattern, int $flags = 0, int $offset = 0): array;
function padBoth(int $length, string $padStr = ' '): self;
function padEnd(int $length, string $padStr = ' '): self;
function padStart(int $length, string $padStr = ' '): self;
function prepend(string ...$prefix): self;
function repeat(int $multiplier): self;
function replace(string $from, string $to): self;
function replaceMatches(string $fromPattern, $to): self;
function slice(int $start = 0, int $length = null): self;
function snake(): self;
function splice(string $replacement, int $start = 0, int $length = null): self;
function split(string $delimiter, int $limit = null, int $flags = null): array;
function startsWith($prefix): bool;
function title(bool $allWords = false): self;
function toBinary(string $toEncoding = null): BinaryString;
function toGrapheme(): GraphemeString;
function toUtf8(): Utf8String;
function trim(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;
function trimEnd(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;
function trimStart(string $chars = " \t\n\r\0\x0B\x0C\u{A0}\u{FEFF}"): self;
function truncate(int $length, string $ellipsis = ''): self;
function upper(): self;
function width(bool $ignoreAnsiDecoration = true): int;
function wordwrap(int $width = 75, string $break = "\n", bool $cut = false): self;
function __clone();
function __toString(): string;

AbstractUnicodeString adds these:

static function fromCodePoints(int ...$codes): self;
function ascii(array $rules = []): self;
function codePoint(int $index = 0): ?int;
function folded(bool $compat = true): parent;
function normalize(int $form = self::NFC): self;
function slug(string $separator = '-'): self;

and BinaryString:

static function fromRandom(int $length = 16): self;
function byteCode(int $index = 0): ?int;
function isUtf8(): bool;
function toUtf8(string $fromEncoding = null): Utf8String;
function toGrapheme(string $fromEncoding = null): GraphemeString;

Case insensitive operations are done with the ignoreCase() method.
e.g. b('abc')->ignoreCase()->indexOf('B') will return 1.

For reference, CLDR transliterations (used in the ascii() method) are defined here:
https://github.com/unicode-org/cldr/tree/master/common/transforms

stof · 2019-09-11T10:29:39Z

no way to unignore case ? and no way to know whether the current object is ignoring case ? This makes the API unusable for code wanting to deal with the string in a case sensitive way while accepting an external string object.

Also should we merge this new component in 4.4, which would mean that its first release is already non-experimental ? We are not allowed to have experimental components in LTS versions, per our LTS policy.

nicolas-grekas · 2019-09-11T10:32:57Z

no way to unignore case ? and no way to know whether the current object is ignoring case ?

->ignoreCase() applies only to the very next call in the fluent API chain. This should answer both your questions. See AbstractString::__clone()

Also should we merge this new component in 4.4

That's something we need to decide indeed. On my side, I think we can make it non-experimental.

stof · 2019-09-11T10:33:08Z

and what happens for all methods accepting a string as argument, when passing non-UTF-8 strings to the method on a Utf8String or GraphemeString ?

Regarding the naming, should Utf8String be renamed to highlight it is about code points ? AFAIK, GraphemeString also expects the string to be in UTF-8.

Note that these comments are based purely on your PR description. I haven't looked at the code yet.

nicolas-grekas · 2019-09-11T10:37:51Z

what happens for all methods accepting a string as argument, when passing non-UTF-8 strings to the method on a Utf8String or GraphemeString ?

an InvalidArgumentException is thrown

should Utf8String be renamed to highlight it is about code points ?

I think UTF-8 is more common vocabulary. The previous PR used CodePoint indeed, but this is cryptic to many, and doesn't convey the technical encoding scheme (it could use UTF-16BE/LE, etc., nothing would tell). That's why I think Utf8String is better.

stof · 2019-09-11T10:40:47Z

@nicolas-grekas but the whole component is about UTF-8 strings. AFAICT, even BinaryString is not meant to operate on other encodings, as it does not validate that the string is valid UTF-8 before converting it to other implementations.
Btw, this means that naming the component String might also be too generic.

nicolas-grekas · 2019-09-11T10:43:36Z

BinaryString is not meant to operate on other encodings, as it does not validate that the string is valid UTF-8 before converting it to other implementations

It does, dunno why you think otherwise. If you try to convert a random binary string to UTF-8/Grapheme, you'll get an InvalidArgumentException too.

BinaryString is what it tells: it can handle any binary strings and doesn't care about the encoding, like the native PHP string functions, just using an OOP API.

Thus the name of the component.

fabpot · 2019-09-11T10:44:32Z

To me, this should go as experimental in 5.0.

drupol · 2019-09-11T10:44:36Z

Definitely supporting this :-) Nice !

javiereguiluz · 2019-09-11T10:46:37Z

Sorry to sound naive, but I can't find in this pull request or the previous one, some brief explanation about when/where should developers use this.

Why/when should we use these classes/methods instead of the normal str_ PHP functions or the mb_str UTF8 functions? Thanks!

Note: I'm not questioning this ... I just want to know where this fits in Symfony developers and Symfony itself. Thanks a lot!

edit: see #33553 (comment)

javiereguiluz · 2019-09-11T10:56:34Z

For your consideration, we could turn these 4 methods:

function ensureLeft(string $prefix): self
function ensureRight(string $suffix): self
function padLeft(int $length, string $padStr = ' '): self
function padRight(int $length, string $padStr = ' '): self

Into these 2 methods if we change the order of the arguments:

function padLeft(string $padStr = ' ', int $length = null): self
function padRight(string $padStr = ' ', int $length = null): self

Example:

// BEFORE
$s1 = u('lorem')->ensureLeft('abc');
// $s1 = 'abclorem'

$s2 = u('lorem')->ensureRight('abc');
// $s2 = 'loremabc'

$s3 = u('lorem')->padLeft(8, 'abc');
// $s3 = 'abcabcablorem'

$s4 = u('lorem')->padRight(8, 'abc');
// $s4 = 'loremabcabcab'


// AFTER
$s1 = u('lorem')->padLeft('abc');
// $s1 = 'abclorem'

$s2 = u('lorem')->padRight('abc');
// $s2 = 'loremabc'

$s3 = u('lorem')->padLeft('abc', 8);
// $s3 = 'abcabcablorem'

$s4 = u('lorem')->padRight('abc', 8);
// $s4 = 'loremabcabcab'

nicolas-grekas · 2019-09-11T11:08:59Z

when/where should developers use this.

All the time would be fine. e.g. $matches = $string->match('/some-regexp/) is a much more friendly API than preg_match('/some-regexp/', $string, $matches) (even more if you consider error handling).

More specifically, I've observed ppl randomly add an mb_ prefix to string functions and magically expect this to fix their encoding issues. This is way too complex right now, doing it correctly is hard. e.g. the Console component deals with utf-8 strings everywhere, it's not pretty. This component would help a lot there. Twig is another place where strings are heavily manipulated and where graphemes are missing actually. It would benefit from the component too.

nicolas-grekas · 2019-09-11T11:11:55Z

For your consideration, we could turn these 4 methods:
Into these 2 methods if we change the order of the arguments:

This would be totally unexpected to me. I've seen no other libraries have this API and I'm not sure it works actually.

is in your plans that the methods returning self return a new mutated reference keeping the original one intact? If not/yes, why?

Absolutely! That's critical design concern, not just an implementation detail :) I added a note about it in the desription. Thanks for asking.

azjezz

This is great, i believe this would make it easier for developers to deal with string encoding, just few notes about method naming :)

javiereguiluz · 2019-09-11T11:19:58Z

@nicolas-grekas thanks for the explanation. It's perfectly clear now!

Another question: some methods are called "left", "right" instead of "prefix/suffix" or "start/end". What happens when the text is Arabic/Persian/Hebrew and uses right-to-text direction? For example, trimRight() removes things at the end of English text ... but at the beginning of Arabic text?

leofeyer · 2019-09-11T11:20:17Z

We have been using tchwork/utf8 in Contao for years and it really is essential if you work with multiple languages beyond the ASCII character range. So +1 for adding this in Symfony and keep up the good work @nicolas-grekas. 👍

Devristo · 2019-09-11T11:55:01Z

It looks amazing. I am curious how it would work together with the rest of the ecosystem. Lets say compatibility with doctrine, intl, symfony/validator, etc? I am sure it will take time before it trickles down to other components, but the future seems bright ;)

azjezz

i suggest adding AbstractString::contains(string ...$needles): bool, where it returns true in case the string contains one of the needles.

if ($text->contains(...$blacklisted)) {
  echo 'nope!';
}

nicolas-grekas · 2019-09-23T10:27:34Z

Now with more tests in the 3rd commit, courtesy of @gharlan, who spotted issues while working on it, all solved now. Thank you!

xabbuh

I am not finished reviewing this PR, but here are some ideas I got so far.

tigitz · 2019-09-24T23:22:53Z

+        return \strlen($this->string) - \strlen($suffix) === ($this->ignoreCase ? strripos($this->string, $suffix) : strrpos($this->string, $suffix));
+    }
+
+    public function equalsTo($string): bool


Suggested change

public function equalsTo($string): bool

/**

* @param AbstractString|string|string[] $string

*/

public function equalsTo($string): bool

Can be useful for autocompletion, static analysis and so on. Other methods could benefit from this doc.

Any object that implements __toString() is allowed actually. That's what string means already to me. What's the relation with autocompletion?

I mean that when working with PhpStorm it seems that it can help you autocomplete the variable to pass in a function given what's available in the scope if the function is properly documented:

As you can see above $id is not suggested here as it doesn't respect the documented type hint

…an abstract unit system

fabpot · 2019-09-26T08:14:33Z

Thank you @nicolas-grekas.

nicolas-grekas · 2019-09-26T08:33:37Z

Thank you everyone for the reviews, it's been invaluable!
Please send PRs for any follow-ups now!

MaPePeR · 2019-11-21T12:53:23Z

no way to unignore case ? and no way to know whether the current object is ignoring case ?

->ignoreCase() applies only to the very next call in the fluent API chain. This should answer both your questions. See AbstractString::__clone()

I think this answer is incomplete, because there is nothing that stops someone from calling a function like foo(u("abc")->ignoreCase()), which will then result in every operation in that function being case insensitive unless the function creates a clone of the argument.

This might also happen unintentionally when someone intends to do a lot of case insensitive operations, so they do $a = u("abc")->ignoreCase(); and later a foo($a) is added and strange things might happen.

You might say "Use the code in a wrong way and you will get wrong results", but i still think that this behavior is kind of odd.
I like the style of writing/reading it - that part is ok - but because we cannot overwrite the assignment operator, like in C++, to clear the ignoreCase-Flag this can cause some weird side effects.

Maybe have the ignoreCase() function return an object of a different class, that is not part of the AbstractString hierarchy and implements a subset of the string functions, so we can at least protect against this with type hints?

nicolas-grekas added this to the next milestone Sep 11, 2019

nicolas-grekas mentioned this pull request Sep 11, 2019

[Utf8] New component with Bytes, CodePoints and Graphemes implementations of string objects #22184

Closed

nicolas-grekas force-pushed the string-component branch 2 times, most recently from 6a1137d to c613662 Compare September 11, 2019 10:25

nicolas-grekas force-pushed the string-component branch from c613662 to 8945735 Compare September 11, 2019 10:34

stof reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/CHANGELOG.md Outdated

ro0NL reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

ro0NL reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

nicolas-grekas force-pushed the string-component branch 2 times, most recently from 5ccd5a7 to f9b903b Compare September 11, 2019 10:54

azjezz reviewed Sep 11, 2019

View reviewed changes

ro0NL reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/GraphemeString.php Outdated

fancyweb reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/Exception/ExceptionInterface.php Outdated

BackEndTea reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/Exception/ExceptionInterface.php Outdated

Comment thread src/Symfony/Component/String/composer.json

azjezz reviewed Sep 11, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

nicolas-grekas force-pushed the string-component branch from f868dc1 to f1de9f7 Compare September 22, 2019 21:18

gharlan mentioned this pull request Sep 22, 2019

string component tests nicolas-grekas/symfony#32

Closed

nicolas-grekas force-pushed the string-component branch 2 times, most recently from f65aee2 to bf60a77 Compare September 23, 2019 06:47

terjebraten-certua reviewed Sep 23, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php

nicolas-grekas force-pushed the string-component branch from bf60a77 to 3d9c6e0 Compare September 23, 2019 09:58

nicolas-grekas force-pushed the string-component branch 4 times, most recently from 5676e2a to f398091 Compare September 23, 2019 14:07

jakzal reviewed Sep 23, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/GraphemeString.php

xabbuh reviewed Sep 23, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

tigitz suggested changes Sep 24, 2019

View reviewed changes

xabbuh requested changes Sep 25, 2019

View reviewed changes

Comment thread src/Symfony/Component/String/AbstractString.php Outdated

Comment thread src/Symfony/Component/String/Utf8String.php

Comment thread src/Symfony/Component/String/Utf8String.php Outdated

nicolas-grekas and others added 3 commits September 25, 2019 16:38

[String] a new component for object-oriented strings management with …

012e92a

…an abstract unit system

[String] add tests

82a0095

[String] add more tests

dd8745a

fabpot approved these changes Sep 26, 2019

View reviewed changes

fabpot mentioned this pull request Sep 26, 2019

[String] a new component for object-oriented strings management with an… symfony/symfony-docs#12376

Closed

barryvdh mentioned this pull request Sep 26, 2019

[7.x] Use external ascii package laravel/framework#29947

Merged

tigitz mentioned this pull request Sep 26, 2019

Notifier Component #33687

Merged

nicolas-grekas mentioned this pull request Oct 2, 2019

[String] renamed core classes to Byte/CodePoint/UnicodeString #33816

Merged

alexeyshockov mentioned this pull request Oct 17, 2019

Is this repository unmaintained? danielstjules/Stringy#198

Open

fabpot mentioned this pull request Nov 12, 2019

Release v5.0.0-BETA1 #34339

Merged

nreynis mentioned this pull request Nov 22, 2019

Add an extension point to the inheritance chain cocur/chain#44

Closed

Uh oh!

Conversation

nicolas-grekas commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stof commented Sep 11, 2019

Uh oh!

nicolas-grekas commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stof commented Sep 11, 2019

Uh oh!

nicolas-grekas commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stof commented Sep 11, 2019

Uh oh!

Uh oh!

nicolas-grekas commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabpot commented Sep 11, 2019

Uh oh!

drupol commented Sep 11, 2019

Uh oh!

Uh oh!

javiereguiluz commented Sep 11, 2019 • edited by nicolas-grekas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

javiereguiluz commented Sep 11, 2019

Uh oh!

nicolas-grekas commented Sep 11, 2019

Uh oh!

nicolas-grekas commented Sep 11, 2019

Uh oh!

azjezz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

javiereguiluz commented Sep 11, 2019

Uh oh!

leofeyer commented Sep 11, 2019

Uh oh!

Uh oh!

Uh oh!

Devristo commented Sep 11, 2019

Uh oh!

Uh oh!

Uh oh!

azjezz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicolas-grekas commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xabbuh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicolas-grekas commented Sep 11, 2019 •

edited

Loading

nicolas-grekas commented Sep 11, 2019 •

edited

Loading

nicolas-grekas commented Sep 11, 2019 •

edited

Loading

nicolas-grekas commented Sep 11, 2019 •

edited

Loading

javiereguiluz commented Sep 11, 2019 •

edited by nicolas-grekas

Loading

nicolas-grekas commented Sep 23, 2019 •

edited

Loading

nicolas-grekas Sep 25, 2019 •

edited

Loading

tigitz Sep 26, 2019 •

edited

Loading