Skip to content

First draft of utf-8 to utf-16 conversion#5

Merged
lemire merged 14 commits intomasterfrom
dlemire/portingsseutf8decoder
Feb 19, 2021
Merged

First draft of utf-8 to utf-16 conversion#5
lemire merged 14 commits intomasterfrom
dlemire/portingsseutf8decoder

Conversation

@lemire
Copy link
Copy Markdown
Member

@lemire lemire commented Feb 18, 2021

  • This makes it so that UTF16 data is on char16_t pointers, while UTF8 data is on char pointers. Technically, we should use char8_t but I believe it is a C++17 feature.
  • It includes an ugly copy-paste of a fast utf8-to-utf16 for SSE.
  • It brings back runtime dispatch on x64 hardware (hopefully)
  • It changes some classes name from CamelCase to under_score style.

Comment thread benchmarks/CMakeLists.txt Outdated
Comment thread src/haswell/implementation.cpp
Comment thread tests/helpers/transcode_test_base.h Outdated
@lemire lemire changed the title (draft) Upcoming changes First draft of utf-8 to utf-16 conversion Feb 19, 2021
@lemire
Copy link
Copy Markdown
Member Author

lemire commented Feb 19, 2021

@WojciechMula This is ready for review, we could merge this. I will open related issues about what I do not like...

This is not meant to be good... just a first step.

Copy link
Copy Markdown
Collaborator

@WojciechMula WojciechMula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from minor comments, everything looks nice. Good job.

Comment thread benchmarks/src/cmdline.cpp Outdated
Comment thread tests/helpers/transcode_test_base.h Outdated
* transcode_utf8_to_utf16_test_base can be used to test UTF8 => UTF16 transcoding.
*/

transcode_utf8_to_utf16_test_base::transcode_utf8_to_utf16_test_base(GenerateCodepoint generate,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming convention remark: I personally use CamelCase for classes and enums, lowercase_with_undescore for methods/variables.Thus this class may be named: TranscodeUTF8toUTF16TestBase. Do we want to have a consistent naming style?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I do not care much, but as you may notice much_of_the_code_follows_the_lower_case_with_underscore_convention. :-)

It is common in C++... The standard library has from_chars and to_string and unordered_set.

The convention you seem to favour seems more common in Java/C#.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll follow this lower-case convention.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Feb 19, 2021

@WojciechMula If you are happy with this PR, we can merge. I have opened issue regarding the components that need work.

I am not in a hurry to merge per se, as I am working now on the NEON port and algorithmic issue, but if you want to do some work, we would prefer to avoid conflicts.

@WojciechMula
Copy link
Copy Markdown
Collaborator

Please merge the PR. Everything is fine.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Feb 19, 2021

Merging.

Please do not hesitate to criticize my code!

@lemire lemire merged commit 2c9bf3a into master Feb 19, 2021
@lemire lemire deleted the dlemire/portingsseutf8decoder branch July 7, 2021 19:34
pauldreik added a commit that referenced this pull request Oct 8, 2024
==17876==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x517000000307 at pc 0x561dd0419320 bp 0x7ffd11808360 sp 0x7ffd11808358
WRITE of size 16 at 0x517000000307 thread T0
    #0 0x561dd041931f in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&) /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5
    #1 0x561dd037a5d7 in simdutf::haswell::(anonymous namespace)::utf8_to_latin1::validating_transcoder::convert(char const*, unsigned long, char*) /home/pauldreik/code/delaktig/simdutf/src/generic/utf8_to_latin1/utf8_to_latin1.h:178:29
    #2 0x561dd037a06b in simdutf::haswell::implementation::convert_utf8_to_latin1(char const*, unsigned long, char*) const /home/pauldreik/code/delaktig/simdutf/src/haswell/implementation.cpp:309:20
    #3 0x561dd0286e12 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:93:33
    #4 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
    #5 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
    #6 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
    #7 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
    #8 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
    #9 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #10 0x7f31b415ae74 in __libc_start_main csu/../csu/libc-start.c:360:3
    #11 0x561dd01a9650 in _start (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x35650) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)

0x517000000307 is located 0 bytes after 647-byte region [0x517000000080,0x517000000307)
allocated by thread T0 here:
    #0 0x561dd0282be1 in operator new(unsigned long) (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x10ebe1) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)
    #1 0x561dd028f28e in std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/new_allocator.h:151:27
    #2 0x561dd0290023 in std::allocator<char>::allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/allocator.h:196:32
    #3 0x561dd0290023 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/alloc_traits.h:478:20
    #4 0x561dd0290023 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:380:20
    #5 0x561dd028fe20 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:398:33
    #6 0x561dd028f9d1 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:334:9
    #7 0x561dd028e4d8 in std::vector<char, std::allocator<char>>::vector(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:557:9
    #8 0x561dd0286d70 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:92:21
    #9 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
    #10 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
    #11 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
    #12 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
    #13 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
    #14 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5 in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&)
Shadow bytes around the buggy address:
  0x517000000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x517000000300:[07]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==17876==ABORTING
lemire pushed a commit that referenced this pull request Oct 8, 2024
)

==17876==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x517000000307 at pc 0x561dd0419320 bp 0x7ffd11808360 sp 0x7ffd11808358
WRITE of size 16 at 0x517000000307 thread T0
    #0 0x561dd041931f in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&) /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5
    #1 0x561dd037a5d7 in simdutf::haswell::(anonymous namespace)::utf8_to_latin1::validating_transcoder::convert(char const*, unsigned long, char*) /home/pauldreik/code/delaktig/simdutf/src/generic/utf8_to_latin1/utf8_to_latin1.h:178:29
    #2 0x561dd037a06b in simdutf::haswell::implementation::convert_utf8_to_latin1(char const*, unsigned long, char*) const /home/pauldreik/code/delaktig/simdutf/src/haswell/implementation.cpp:309:20
    #3 0x561dd0286e12 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:93:33
    #4 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
    #5 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
    #6 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
    #7 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
    #8 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
    #9 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #10 0x7f31b415ae74 in __libc_start_main csu/../csu/libc-start.c:360:3
    #11 0x561dd01a9650 in _start (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x35650) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)

0x517000000307 is located 0 bytes after 647-byte region [0x517000000080,0x517000000307)
allocated by thread T0 here:
    #0 0x561dd0282be1 in operator new(unsigned long) (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x10ebe1) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)
    #1 0x561dd028f28e in std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/new_allocator.h:151:27
    #2 0x561dd0290023 in std::allocator<char>::allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/allocator.h:196:32
    #3 0x561dd0290023 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/alloc_traits.h:478:20
    #4 0x561dd0290023 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:380:20
    #5 0x561dd028fe20 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:398:33
    #6 0x561dd028f9d1 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:334:9
    #7 0x561dd028e4d8 in std::vector<char, std::allocator<char>>::vector(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:557:9
    #8 0x561dd0286d70 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:92:21
    #9 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
    #10 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
    #11 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
    #12 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
    #13 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
    #14 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5 in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&)
Shadow bytes around the buggy address:
  0x517000000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x517000000280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x517000000300:[07]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x517000000580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==17876==ABORTING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants