First draft of utf-8 to utf-16 conversion#5
Conversation
…dutf into dlemire/portingsseutf8decoder
|
@WojciechMula This is ready for review, we could merge this. I will open related issues about what I do not like... This is not meant to be good... just a first step. |
WojciechMula
left a comment
There was a problem hiding this comment.
Apart from minor comments, everything looks nice. Good job.
| * transcode_utf8_to_utf16_test_base can be used to test UTF8 => UTF16 transcoding. | ||
| */ | ||
|
|
||
| transcode_utf8_to_utf16_test_base::transcode_utf8_to_utf16_test_base(GenerateCodepoint generate, |
There was a problem hiding this comment.
Just a naming convention remark: I personally use CamelCase for classes and enums, lowercase_with_undescore for methods/variables.Thus this class may be named: TranscodeUTF8toUTF16TestBase. Do we want to have a consistent naming style?
There was a problem hiding this comment.
Yes. I do not care much, but as you may notice much_of_the_code_follows_the_lower_case_with_underscore_convention. :-)
It is common in C++... The standard library has from_chars and to_string and unordered_set.
The convention you seem to favour seems more common in Java/C#.
There was a problem hiding this comment.
Ok, I'll follow this lower-case convention.
|
@WojciechMula If you are happy with this PR, we can merge. I have opened issue regarding the components that need work. I am not in a hurry to merge per se, as I am working now on the NEON port and algorithmic issue, but if you want to do some work, we would prefer to avoid conflicts. |
|
Please merge the PR. Everything is fine. |
|
Merging. Please do not hesitate to criticize my code! |
==17876==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x517000000307 at pc 0x561dd0419320 bp 0x7ffd11808360 sp 0x7ffd11808358
WRITE of size 16 at 0x517000000307 thread T0
#0 0x561dd041931f in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&) /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5
#1 0x561dd037a5d7 in simdutf::haswell::(anonymous namespace)::utf8_to_latin1::validating_transcoder::convert(char const*, unsigned long, char*) /home/pauldreik/code/delaktig/simdutf/src/generic/utf8_to_latin1/utf8_to_latin1.h:178:29
#2 0x561dd037a06b in simdutf::haswell::implementation::convert_utf8_to_latin1(char const*, unsigned long, char*) const /home/pauldreik/code/delaktig/simdutf/src/haswell/implementation.cpp:309:20
#3 0x561dd0286e12 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:93:33
#4 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
#5 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
#6 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
#7 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
#8 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
#9 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#10 0x7f31b415ae74 in __libc_start_main csu/../csu/libc-start.c:360:3
#11 0x561dd01a9650 in _start (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x35650) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)
0x517000000307 is located 0 bytes after 647-byte region [0x517000000080,0x517000000307)
allocated by thread T0 here:
#0 0x561dd0282be1 in operator new(unsigned long) (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x10ebe1) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416)
#1 0x561dd028f28e in std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/new_allocator.h:151:27
#2 0x561dd0290023 in std::allocator<char>::allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/allocator.h:196:32
#3 0x561dd0290023 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/alloc_traits.h:478:20
#4 0x561dd0290023 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:380:20
#5 0x561dd028fe20 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:398:33
#6 0x561dd028f9d1 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:334:9
#7 0x561dd028e4d8 in std::vector<char, std::allocator<char>>::vector(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:557:9
#8 0x561dd0286d70 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:92:21
#9 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1
#10 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58
#11 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9
#12 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3
#13 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1
#14 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5 in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&)
Shadow bytes around the buggy address:
0x517000000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x517000000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x517000000180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x517000000200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x517000000280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x517000000300:[07]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x517000000380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x517000000400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x517000000480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x517000000500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x517000000580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==17876==ABORTING
) ==17876==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x517000000307 at pc 0x561dd0419320 bp 0x7ffd11808360 sp 0x7ffd11808358 WRITE of size 16 at 0x517000000307 thread T0 #0 0x561dd041931f in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&) /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5 #1 0x561dd037a5d7 in simdutf::haswell::(anonymous namespace)::utf8_to_latin1::validating_transcoder::convert(char const*, unsigned long, char*) /home/pauldreik/code/delaktig/simdutf/src/generic/utf8_to_latin1/utf8_to_latin1.h:178:29 #2 0x561dd037a06b in simdutf::haswell::implementation::convert_utf8_to_latin1(char const*, unsigned long, char*) const /home/pauldreik/code/delaktig/simdutf/src/haswell/implementation.cpp:309:20 #3 0x561dd0286e12 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:93:33 #4 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1 #5 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58 #6 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9 #7 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3 #8 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1 #9 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #10 0x7f31b415ae74 in __libc_start_main csu/../csu/libc-start.c:360:3 #11 0x561dd01a9650 in _start (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x35650) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416) 0x517000000307 is located 0 bytes after 647-byte region [0x517000000080,0x517000000307) allocated by thread T0 here: #0 0x561dd0282be1 in operator new(unsigned long) (/home/pauldreik/code/delaktig/simdutf/build/paul_clang_18-Debug/tests/convert_utf8_to_latin1_tests+0x10ebe1) (BuildId: ab54037bea01a246225d9f551b72ba81eb4f6416) #1 0x561dd028f28e in std::__new_allocator<char>::allocate(unsigned long, void const*) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/new_allocator.h:151:27 #2 0x561dd0290023 in std::allocator<char>::allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/allocator.h:196:32 #3 0x561dd0290023 in std::allocator_traits<std::allocator<char>>::allocate(std::allocator<char>&, unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/alloc_traits.h:478:20 #4 0x561dd0290023 in std::_Vector_base<char, std::allocator<char>>::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:380:20 #5 0x561dd028fe20 in std::_Vector_base<char, std::allocator<char>>::_M_create_storage(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:398:33 #6 0x561dd028f9d1 in std::_Vector_base<char, std::allocator<char>>::_Vector_base(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:334:9 #7 0x561dd028e4d8 in std::vector<char, std::allocator<char>>::vector(unsigned long, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/14/../../../../include/c++/14/bits/stl_vector.h:557:9 #8 0x561dd0286d70 in test_impl_ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:92:21 #9 0x561dd028516f in ossfuzz_372067232(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:20:1 #10 0x561dd0299782 in simdutf::test::test_entry::operator()(simdutf::implementation const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.h:18:58 #11 0x561dd02926ca in (anonymous namespace)::run((anonymous namespace)::CommandLine const&) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:179:9 #12 0x561dd0290fc5 in simdutf::test::main(int, char**) /home/pauldreik/code/delaktig/simdutf/tests/helpers/test.cpp:207:3 #13 0x561dd028d611 in main /home/pauldreik/code/delaktig/simdutf/tests/convert_utf8_to_latin1_tests.cpp:293:1 #14 0x7f31b415adb9 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 SUMMARY: AddressSanitizer: heap-buffer-overflow /home/pauldreik/code/delaktig/simdutf/src/haswell/avx2_convert_utf8_to_latin1.cpp:28:5 in simdutf::haswell::(anonymous namespace)::convert_masked_utf8_to_latin1(char const*, unsigned long, char*&) Shadow bytes around the buggy address: 0x517000000080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x517000000100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x517000000180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x517000000200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x517000000280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x517000000300:[07]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x517000000380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x517000000400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x517000000480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x517000000500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x517000000580: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==17876==ABORTING
Uh oh!
There was an error while loading. Please reload this page.