-
Notifications
You must be signed in to change notification settings - Fork 8k
Implement IDNA and punycode encoding/decoding (2nd attempt) #58454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is an automated comment for commit dd2d9ff with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
src/Functions/punycode.cpp
Outdated
|
|
||
| /// Implementation of | ||
| /// - punycode(En|De)code[OrNull](), see [1] | ||
| /// - idna(En|De)code[OrNull](), see [2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand about punycodeDecode but how punycodeEncode can fail? I mean why do we need punycodeEncodeOrNull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, the only way for punycode encoding to fail currently is the (weird) edge that the input is not UTF-8. But from a ClickHouse POV, the ada-idna library API simply returns "failed" and there could be further sources of errors in future (e.g. a failure to allocate memory). For these reasons, I guess we should keep punycodeEncodeOrNull(). I can make the docs more concise tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking at the code of ada-idna it seems right now it returns "failed" statuses only when there are some coding errors, and not on allocation failures. If it can't allocate it will probably just throw std::bad_alloc.
We don't usually handle too much if the input is not a proper UTF-8 string, see upperUTF8 or base64Encode. I think it's better to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I changed the code so that these functions remain:
punycodeEncode(),punycodeDecode()andtryPunycodeDecode()idnaEncode(),tryIdnaEncode()andidnaDecode()
This is more consistent with other SQL functions.
Yes, it is still a bit unexpected why in the case of idna encoding may fail but this is required by the algorithm and I'll document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be idnaEncode(), tryIdnaDecode() and idnaDecode()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It is a bit confusing but it is really idnaEncode(), tryIdnaEncode() and idnaDecode().
There are actually strings that cannot be idna-encoded (in particular ones with xn-- prefix as far as I understand). idnaEncode() throws an exception then, tryIdnaEncode() returns an empty string.
And there is idnaDecode() but no tryIdnaDecode() because the IDNA decoding algorithm (as specified by the standard) requires that invalid input values are returned as is. There is no point in other error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. It will probably also fail for long domains, for example
idnaEncode('1234567890123456789012345678901234567890123456789012345678901234')
|
|
||
| const size_t utf8_length = ada::idna::utf8_length_from_utf32(value_utf32.data(), value_utf32.size()); | ||
| value_utf8.resize(utf8_length); | ||
| ada::idna::utf32_to_utf8(value_utf32.data(), value_utf32.size(), value_utf8.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's better to reassign utf8_length to the result of ada::idna::utf32_to_utf8 here.
2161f01 to
8169b3e
Compare
| const size_t value_length = offsets[row] - prev_offset - 1; | ||
|
|
||
| std::string_view value_view(value, value_length); | ||
| if (!value_view.empty()) /// to_ascii() expects non-empty input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't idnaEncode throw an exception for an empty input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty input is an edge case which is generally legal. ada::idna::to_ascii() handles empty inputs gracefully (it doesn't throw, it returns an empty output). The comment on l. 65 is only there because the API docs say so and I wanted to program against the API, not the implementation. In any case, it slightly simplifies the actual error handling (l. 68ff) which would otherwise be ambiguous in the empty string case.
This PR reverts #58277 which reverted #57969 and addresses the feedback.
Original PR description:
Inspired by #34439, especially #34439 (comment), which was dormant for too long. This PR also uses the ada-idna library instead of a 15 year old homegrown implementation.
More specifically, this PR
OrNullvariants of functionspunycodeEncode/Decode(): b5d7ea2idnaEncode/Decode[OrNull](): f652c60Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added functions
punycodeEncode(),punycodeDecode(),idnaEncode()andidnaDecode()which are useful for translating international domain names to an ASCII representation according to the IDNA standard.