Add ReadOnlySpan string-like ToLower/ToUpper API with globalization support #16379

ahsonkhan · 2018-02-14T03:44:34Z

Part of https://github.com/dotnet/corefx/issues/21395#issuecomment-359906138

ToLower
ToLowerInvariant
ToUpper
ToUpperInvariant

TODO:

~~Add the allocating counterpart for slow span in corefx and add tests there.~~

Related corefx PR: dotnet/corefx#27193

cc @jkotas, @stephentoub, @KrzysztofCwalina, @tarekgh, @JeremyKuhne

…upport

jkotas · 2018-02-14T06:40:41Z

src/mscorlib/shared/System/Span.NonGeneric.cs

    public static class Span
    {
        // s_invariantMode is defined for the perf reason as accessing the instance field is faster than access the static property GlobalizationMode.Invariant
        private static readonly bool s_invariantMode = GlobalizationMode.Invariant;


This cache does not make sense. GlobalizationMode.Invariant property should have the exact same perf as you local cache.

This was borrowed from CompareInfo:

coreclr/src/mscorlib/shared/System/Globalization/CompareInfo.cs

Line 77 in 34083fa

// _invariantMode is defined for the perf reason as accessing the instance field is faster than access the static property GlobalizationMode.Invariant

Should it be removed from there too?

Would calling the Invariant property result in this function call every time?

coreclr/src/mscorlib/src/System/Globalization/GlobalizationMode.cs

Line 14 in b60c91c

internal static bool GetInvariantSwitchValue()

Would calling the Invariant property result in this function call every time?

No, the syntax:

internal static bool Invariant { get; } = GetGlobalizationInvariantMode();

is equivalent to:

private readonly static bool backingFieldInvariant = GetGlobalizationInvariantMode(); internal static bool Invariant => backingFieldInvariant;

This was borrowed from CompareInfo:

The one in CompareInfo is instance field. This one is static field. The performance characteristics of instance and static fields are quite different (instance fields are faster).

Should it be removed from there too?

No, in CompareInfo it is an instance field and not static.

The performance characteristics of instance and static fields are quite different (instance fields are faster).

Good to know. I missed the detail of it being an instance field vs. a static field here.

No, the syntax... is equivalent to...

Ah I see. Thanks for clarifying.

jkotas · 2018-02-14T06:41:24Z

src/mscorlib/shared/System/Span.NonGeneric.cs

+
+            if (s_invariantMode)
+            {
+                CultureInfo.CurrentCulture.TextInfo.ToLowerAsciiInvariant(source, destination);


This should use the passed in culture; not CurrentCulture.

(Same in all other places.)

jkotas · 2018-02-14T06:42:17Z

src/mscorlib/shared/System/Span.NonGeneric.cs

+            if (s_invariantMode)
+            {
+                CultureInfo.CurrentCulture.TextInfo.ToLowerAsciiInvariant(source, destination);
+            }


You need to return here.

(Same in all other places.)

jkotas · 2018-02-14T22:02:48Z

src/mscorlib/shared/System/Globalization/TextInfo.Unix.cs

+            {
+                fixed (char* pResult = &MemoryMarshal.GetReference(destination))
+                {
+#if CORECLR


This CORECLR ifdef is not necessary.

jkotas · 2018-02-14T22:05:52Z

src/mscorlib/shared/System/Globalization/TextInfo.Unix.cs

+                    else
+#endif
+                    {
+                        ChangeCase(pSource, source.Length, pResult, destination.Length, toUpper);


Was there a conclusion on whether it is safe to depend on 1:1 mapping between lower and upper case?

https://github.com/dotnet/corefx/issues/21395#issuecomment-353233435

I know that the current String APIs make this assumption, but that assumption does not affect their public shape - if needed, we can fix them without changing their public signature.

It is not true for the Span APIs. If there is no 1:1 mapping, the Span APIs need to start returning number of characters copied to output buffer.

I think this is a reasonable suggestion for span (and maybe in the future we can support it for strings too).

@ahsonkhan could you discuss this with the design committee to update such APIs? I know we mentioned we are not supporting this today but this doesn't prevent us to start supporting it with the spans.

cc @KrzysztofCwalina

Good point. If we change our API to return the copied characters, though redundant atm, it will be flexible to possible future implementation changes.

To be clear, is the new proposal for the APIs as follows (including the Invariant versions)?

public static int ToLower(this ReadOnlySpan<char> source, Span<char> destination, CultureInfo culture); public static int ToLowerInvariant(this ReadOnlySpan<char> source, Span<char> destination); public static int ToUpper(this ReadOnlySpan<char> source, Span<char> destination, CultureInfo culture); public static int ToUpperInvariant(this ReadOnlySpan<char> source, Span<char> destination);

To be clear, is the new proposal for the APIs as follows (including the Invariant versions)?

Yes, Invariant is a culture :-)

Question, what we'll do if the destination span is not enough? we throw? or return -1?

I prefer that we throw just like we do atm for destination.Length < source.Length since the assumption is that the length required can be statically known at compile time prior to calling the method.

However, if the 1:1 mapping does not hold in the future, then my assumption breaks and throwing may not be the best solution.

do we throw only because of the length? if so, I then prefer not throw and return -1 in case not enough buffer.

For these APIs, should we enable partial processing (and ability to restart), like we do with some of our Base64 APIs, and return OperationStatus with charactersConsumed and charactersWritten out parameters? Or is that too much complexity for such a low level API, especially since there is a 1:1 mapping atm?

For example:

public static OperationStatus ToLower(this ReadOnlySpan<char>, Span<char> destination, CultureInfo culture, out char charactersConsumed, out char charactersWritten);

we may have a simple one which just returns -1 and we can have the overload that takes more parameters if we find a demand for that.

stephentoub · 2018-02-14T22:07:02Z

src/mscorlib/shared/System/Globalization/TextInfo.cs

+            destination[i] = (char)(source[i] | 0x20);
+            i++;
+
+            while (i < source.Length)


The string-based implementation of this has two separate loops like this because it wants to optimize the path where nothing needs to change and thus nothing needs to allocate. But here with span, we always need to look at every character and copy every character, potentially modified, to the output destination. Can we just consolidate it all to a single loop then? e.g.

internal void ToLowerAsciiInvariant(ReadOnlySpan<char> source, Span<char> destination) { Debug.Assert(destination.Length >= source.Length); for (int i = 0; i < source.Length; i++) { destination[i] = ToLowerAsciiInvariant(source[i]); } }

or is there some reason that's not as good?

What about the optimization which avoids having to do the or by 0x20 operation where not necessary (i.e when ((uint)(source[i] - 'A') > ('Z' - 'A'))?

I suspect the extra CopyTo probably hurts performance more, so I will stick to the simpler implementation. I will measure the perf and compare.

What about the optimization

That's in the char method as well:
https://github.com/dotnet/coreclr/pull/16379/files/59903ebba9bd23fa952f834e41c87bfe3b726b85#diff-64100f763acd6161cac733b7534a9301R381
We'd just want to make sure that method gets inlined.

stephentoub · 2018-02-15T14:20:09Z

src/mscorlib/shared/System/Span.NonGeneric.cs

+        {
+            // Assuming that changing case does not affect length
+            if (destination.Length < source.Length)
+                return -1;


Even in a hypothetical future, will invariant ever result in changing case affecting length?

This is possible. The sorting and casing can change for the Invariant :-(

stephentoub · 2018-02-15T14:22:42Z

src/mscorlib/shared/System/Span.NonGeneric.cs

+            if (GlobalizationMode.Invariant)
+                CultureInfo.InvariantCulture.TextInfo.ToLowerAsciiInvariant(source, destination);
+            else
+                CultureInfo.InvariantCulture.TextInfo.ChangeCase(source, destination, toUpper: false);


@tarekgh, is there a difference between these two operations?

Yes, there is a difference. ToLowerAsciiInvariant is just trying to be optimized for the Ascii but ChangeCase is not. for example, if you have a Turkish text would be handled wrong if you do it with ToLowerAsciiInvariant

Isn't this naming confusing then? Turning on the "Invariant" mode potentially produces different results than using the InvariantCulture?

well, Invariant mode means we'll use the Invariant culture data all the time (without going to call the underlying OS) this will include doing all sorting/casing operations as ordinal. I believe whoever introduced ToLowerAsciiInvariant meant to have this operation mainly for Ascii strings and at the same time not using a culture like Turkish to avoid the Turkish 'I' problem. but in general, you are right for the statement, the invariant mode can produce different results than invariant culture in the sorting and casing area.

Ok. Thanks.

stephentoub · 2018-02-15T14:27:26Z

src/mscorlib/src/System/Globalization/CompareInfo.Unix.cs

                char* b = bp;

-                while (length != 0 && (*a < 0x80) && (*b < 0x80) && (!HighCharTable[*a]) && (!HighCharTable[*b]))
+                while (length != 0 && (*a < 0x80) && (*b < 0x80) && (!s_highCharTable[*a]) && (!s_highCharTable[*b]))


Does the JIT successfully avoid bounds checks on s_highCharTable on each iteration of the loop?

It doesn't look like it is (tested on windows). We have two compares to 80h.

while (length != 0 && (*a < 0x80) && (*b < 0x80) && (!s_highCharTable[*a]) && (!s_highCharTable[*b])) 00007FF993402A25 test edx,edx 00007FF993402A27 je 00007FF993402A7C 00007FF993402A29 movzx r10d,word ptr [rcx] 00007FF993402A2D cmp r10d,80h 00007FF993402A34 jge 00007FF993402A7C 00007FF993402A36 movzx r11d,word ptr [rax] 00007FF993402A3A cmp r11d,80h 00007FF993402A41 jge 00007FF993402A7C 00007FF993402A43 mov rsi,23530CE28B0h 00007FF993402A4D mov rsi,qword ptr [rsi] 00007FF993402A50 mov rdi,rsi 00007FF993402A53 mov ebx,r10d 00007FF993402A56 mov ebp,dword ptr [rdi+8] 00007FF993402A59 cmp ebx,ebp 00007FF993402A5B jae 00007FF993402AC0 00007FF993402A5D movsxd rbx,ebx 00007FF993402A60 cmp byte ptr [rdi+rbx+10h],0 00007FF993402A65 jne 00007FF993402A7C 00007FF993402A67 mov edi,r11d 00007FF993402A6A cmp edi,ebp 00007FF993402A6C jae 00007FF993402AC0 00007FF993402A6E movsxd rdi,edi 00007FF993402A71 cmp byte ptr [rsi+rdi+10h],0 00007FF993402A76 je 00007FF9934029DD

cc @AndyAyersMS

I'm wondering if the code would result in better codegen if s_highCharTable were stored into a local before the loop...?

It doesn't get rid of the bounds checks.
We save a couple of instructions at the prologue/epilogue with pushing to/popping off the stack.

stephentoub · 2018-02-15T19:07:30Z

src/mscorlib/shared/System/Globalization/TextInfo.cs

+
+            for (int i = 0; i < source.Length; i++)
+            {
+                destination[i] = ToLowerAsciiInvariant(source[i]);


Were you able to confirm that ToLower/UpperAsciiInvariant gets inlined? Seems like it should, but it'd be good to double check.

Yes it does.

for (int i = 0; i < source.Length; i++) 00007FF9934127D0 sub rsp,28h 00007FF9934127D4 mov rax,qword ptr [rcx] 00007FF9934127D7 mov ecx,dword ptr [rcx+8] 00007FF9934127DA xor r8d,r8d 00007FF9934127DD test ecx,ecx 00007FF9934127DF jle 00007FF993412814 { destination[i] = ToLowerAsciiInvariant(source[i]); 00007FF9934127E1 cmp r8d,dword ptr [rdx+8] 00007FF9934127E5 jae 00007FF993412819 00007FF9934127E7 mov r9,qword ptr [rdx] 00007FF9934127EA movsxd r10,r8d 00007FF9934127ED lea r9,[r9+r10*2] 00007FF9934127F1 movzx r10d,word ptr [rax+r10*2] 00007FF9934127F6 lea r11d,[r10-41h] 00007FF9934127FA cmp r11d,19h 00007FF9934127FE ja 00007FF993412808 00007FF993412800 or r10d,20h 00007FF993412804 movzx r10d,r10w 00007FF993412808 mov word ptr [r9],r10w for (int i = 0; i < source.Length; i++) 00007FF99341280C inc r8d 00007FF99341280F cmp r8d,ecx 00007FF993412812 jl 00007FF9934127E1 00007FF993412814 add rsp,28h 00007FF993412818 ret 00007FF993412819 call 00007FF9F3069860 00007FF99341281E int 3

for (int i = 0; i < source.Length; i++) 00007FF993412A50 sub rsp,28h 00007FF993412A54 mov rax,qword ptr [rcx] 00007FF993412A57 mov ecx,dword ptr [rcx+8] 00007FF993412A5A xor r8d,r8d 00007FF993412A5D test ecx,ecx 00007FF993412A5F jle 00007FF993412A94 { destination[i] = ToUpperAsciiInvariant(source[i]); 00007FF993412A61 cmp r8d,dword ptr [rdx+8] 00007FF993412A65 jae 00007FF993412A99 00007FF993412A67 mov r9,qword ptr [rdx] 00007FF993412A6A movsxd r10,r8d 00007FF993412A6D lea r9,[r9+r10*2] 00007FF993412A71 movzx r10d,word ptr [rax+r10*2] 00007FF993412A76 lea r11d,[r10-61h] 00007FF993412A7A cmp r11d,19h 00007FF993412A7E ja 00007FF993412A88 00007FF993412A80 and r10d,0FFFFFFDFh 00007FF993412A84 movzx r10d,r10w 00007FF993412A88 mov word ptr [r9],r10w for (int i = 0; i < source.Length; i++) 00007FF993412A8C inc r8d 00007FF993412A8F cmp r8d,ecx 00007FF993412A92 jl 00007FF993412A61 00007FF993412A94 add rsp,28h 00007FF993412A98 ret 00007FF993412A99 call 00007FF9F3069860 00007FF993412A9E int 3

stephentoub

LGTM. Thanks, @ahsonkhan.

ahsonkhan · 2018-02-15T20:15:01Z

src/mscorlib/shared/System/Span.NonGeneric.cs

+        /// Copies the characters from the source span into the destination, converting each character to lowercase
+        /// using the casing rules of the invariant culture.
+        /// </summary>
+        public static int ToLowerInvariant(this ReadOnlySpan<char> source, Span<char> destination)


Should this call the ToLower with culture = CultureInfo.InvariantCulture passed in? Same with ToUpper.

It avoids code duplication at the cost of an extra null check.

Add ReadOnlySpan string-like ToLower/ToUpper API with globalization s…

59903eb

…upport

ahsonkhan self-assigned this Feb 14, 2018

ahsonkhan added the area-System.Memory label Feb 14, 2018

jkotas reviewed Feb 14, 2018

View reviewed changes

stephentoub reviewed Feb 14, 2018

View reviewed changes

Address PR feedback.

b88075e

stephentoub reviewed Feb 15, 2018

View reviewed changes

stephentoub approved these changes Feb 15, 2018

View reviewed changes

ahsonkhan commented Feb 15, 2018

View reviewed changes

ahsonkhan mentioned this pull request Feb 16, 2018

Add ROSpan ToLower and ToUpper string-like APIs with CultureInfo dotnet/corefx#27193

Merged

ahsonkhan merged commit 1494667 into dotnet:master Feb 16, 2018

ahsonkhan deleted the ToLowerUpper branch February 16, 2018 02:43

This was referenced Feb 22, 2018

Fix impl of ReadOnlySpan ToLower/ToUpper for Unix. #16496

Merged

Fix ascii range check #16535

Merged

Add ReadOnlySpan string-like ToLower/ToUpper API with globalization support #16379

Add ReadOnlySpan string-like ToLower/ToUpper API with globalization support #16379

Uh oh!

Conversation

ahsonkhan commented Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahsonkhan Feb 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ahsonkhan commented Feb 14, 2018 •

edited

Loading

jkotas Feb 14, 2018 •

edited

Loading

ahsonkhan Feb 15, 2018 •

edited

Loading

ahsonkhan Feb 15, 2018 •

edited

Loading