Skip to content

Support UTF-16BE#154

Merged
lemire merged 30 commits intosimdutf:masterfrom
NicolasJiaxin:support-be
Jul 25, 2022
Merged

Support UTF-16BE#154
lemire merged 30 commits intosimdutf:masterfrom
NicolasJiaxin:support-be

Conversation

@NicolasJiaxin
Copy link
Copy Markdown
Collaborator

Fix #3
Adds support of UTF-16BE by adding all analogous functions that we have for UTF-16LE. For example, validate_utf16 now comes in two flavours: validate_utf16le and validate_utf16be. However, the functions that implement the actual algorithms are now template-based on whether the input/output is little endian or big endian. If big endian, we swap the bytes of using a single shuffle when we load/store data. I also added a function to swap endianness called change_endianness_utf16.

@NicolasJiaxin
Copy link
Copy Markdown
Collaborator Author

I have not tested the new functions yet. I am also unsure of what change_endianness_utf16 should return. Could it just be a void function? For now, it returns something silly.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 22, 2022

I am also unsure of what change_endianness_utf16 should return. Could it just be a void function?

Definitively. I think it should return void.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 22, 2022

@NicolasJiaxin This code looks very good to me. I just think that we ought to test it (at least minimally).

@NicolasJiaxin
Copy link
Copy Markdown
Collaborator Author

I added all tests and I did have a few bugs. But it is all working now, except that the debug tests on Visual Studio are getting stuck and I don't know why.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 25, 2022

I added all tests and I did have a few bugs. But it is all working now, except that the debug tests on Visual Studio are getting stuck and I don't know why.

Let me investigate.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 25, 2022

@NicolasJiaxin When you do (for example)

    std::vector<char16_t> utf16be;
    utf16be.reserve(size);
    implementation.change_endianness_utf16(utf16le, size, utf16be.data());

You will have that utf16be.size() is zero. Your code is technically unsafe. The reserve call is just an optimization hint, it may not allocate the memory the way you think (though it will in practice).

So you want to go through and do the following...

    std::vector<char16_t> utf16be(size);
    implementation.change_endianness_utf16(utf16le, size, utf16be.data());

That should solve the issue. You can also replace reserve with resize.

Note that your code is 'practically correct' but memory checkers might complain so better fix it.

One problem with using std::vector is that the input is guaranteed to be zeroed. If you'd want to avoid that... You could also do the following...

std::unique_ptr<char16_t[]> utf16be_unique(new char16_t[size]);
char16_t * utf16be = utf16be_unique.get();
implementation.change_endianness_utf16(utf16le, size, utf16be);

But I am not sure what the benefit would be.

@NicolasJiaxin
Copy link
Copy Markdown
Collaborator Author

Ahh, thanks!

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 25, 2022

@NicolasJiaxin Try adding some bits of documentation to the README. And then I think we will merge.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jul 25, 2022

Great stuff. Merging.

@lemire lemire merged commit e53dece into simdutf:master Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support of UTF-16BE

2 participants