-
Notifications
You must be signed in to change notification settings - Fork 8k
Add punycode encode decode functions #34439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add punycode encode decode functions #34439
Conversation
|
@Mergifyio update |
✅ Branch has been successfully updated |
|
Hi everybody! |
cea602c to
6130d33
Compare
| /* | ||
| punycode-sample.c 2.0.0 (2004-Mar-21-Sun) | ||
| http://www.nicemice.net/idn/ | ||
| Adam M. Costello |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved Adam's code to the contrib folder and added a license there (because it's a third-party code and it's better to separate it from our code).
| } | ||
|
|
||
| constexpr bool isASCII(const uint32_t u32) noexcept | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too complex, just
bool isASCII(const char32_t c) { return c <= 127; }
is enough
| { | ||
| return (input.length() >= (start + 4)) && (input[start] == 'x' ||input[start] == 'X') && | ||
| (input[start + 1] == 'n' || input[start + 1] == 'N') && | ||
| (input[start + 2] == '-' || input[start + 4] == '-'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem completely correct. "xn--" is the prefix of IDNA encoding and not punycode. I think it's better to have 4 functions instead of 2:
punycodeEncode('ввв') -> 'b1aaa'
punycodeDecode()
idnaEncode('ввв.яндекс.рф') -> 'xn--b1aaa.xn--d1acpjx3f.xn--p1ai'
idnaDecode()
| if (std::string::npos == start) | ||
| return {0, input.size() }; | ||
| std::string::size_type last = input.find('/', start + 3); | ||
| return {start + 3, std::string::npos == last ? input.size() : last}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is related to URL processing and not punycode or idna encoding. I believe punycode/idna mustn't try to extract a domain and suppose instead that the domain is already extracted. Anyway we already have the function domain() in ClickHouse which can be easily combined with our function:
SELECT idnaDecode(domain(...))
| } | ||
|
|
||
| size_t punycodeDecode(const std::string_view input, | ||
| UInt8 * decoded_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better and more safe to pass ColumnString::Chars & res_data to this function instead of UInt8 * decoded_string because this way you can just push back new characters to res_data without thinking about how much memory you need to reserve before.
| std::for_each(s32.begin(), s32.end(), [](auto & ch) { ch = Poco::Unicode::toLower(ch);}); | ||
| size_t decoded_length = punycodeEncodeInternal(s32, encoded_string + start); | ||
| if (0 == decoded_length) | ||
| throw std::runtime_error("Failed to encode"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something is wrong with arguments of functions usually we throw Exception(ErrorCodes::BAD_ARGUMENTS). Exception is our exception class which provides much more information than std::runtime_error and normally can caught and handled.
|
This is an automated comment for commit 6e1911a with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
|
We are highly interested in being able to use those 2 functions (on ClickHouse Cloud we can't use executable UDFs, and this can't be done with SQL UDFs). |
|
I took the freedom and quickly re-implemented this in #57969 |
Changelog category:
Changelog entry:
Implement functions for punycode encoding/decoding:
punycodeEncodeandpunycodeDecode.Implemented as a wrappers around the pure C version of the implementation:
RFC 3492
SELECT punycodeEncode('правда')
SELECT punycodeDecode('xn--mxacd')