Skip to content

Conversation

@AndreyTokmakov
Copy link

@AndreyTokmakov AndreyTokmakov commented Feb 9, 2022

Changelog category:

  • New Feature

Changelog entry:
Implement functions for punycode encoding/decoding: punycodeEncode and punycodeDecode.

Implemented as a wrappers around the pure C version of the implementation:
RFC 3492

SELECT punycodeEncode('правда')

  • Parameters
    • Input string (string to be encoded to punycode)
  • Return:
    • encoded string

SELECT punycodeDecode('xn--mxacd')

  • Parameters
    • Input string (string to be decoded from punycode)
  • Return:
    • decoded string

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 9, 2022
@CLAassistant
Copy link

CLAassistant commented Feb 9, 2022

CLA assistant check
All committers have signed the CLA.

@KochetovNicolai KochetovNicolai added the can be tested Allows running workflows for external contributors label Feb 9, 2022
@alesapin
Copy link
Member

alesapin commented Feb 9, 2022

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Feb 9, 2022

update

✅ Branch has been successfully updated

@AndreyTokmakov AndreyTokmakov changed the title WIP: Add punycode encode decode functions Add punycode encode decode functions Feb 11, 2022
@AndreyTokmakov AndreyTokmakov changed the title Add punycode encode decode functions WIP: Add punycode encode decode functions Feb 16, 2022
@novikd novikd self-assigned this Apr 19, 2022
@vitlibar vitlibar self-assigned this Jul 4, 2022
@OrlovDiga
Copy link

Hi everybody!
Is there any information when it will be released?

@vitlibar vitlibar force-pushed the add_punycode_encode_decode_functions branch from cea602c to 6130d33 Compare July 31, 2022 12:25
@robot-ch-test-poll robot-ch-test-poll added the submodule changed At least one submodule changed in this PR. label Jul 31, 2022
/*
punycode-sample.c 2.0.0 (2004-Mar-21-Sun)
http://www.nicemice.net/idn/
Adam M. Costello
Copy link
Member

@vitlibar vitlibar Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved Adam's code to the contrib folder and added a license there (because it's a third-party code and it's better to separate it from our code).

}

constexpr bool isASCII(const uint32_t u32) noexcept
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too complex, just

bool isASCII(const char32_t c) { return c <= 127; }

is enough

{
return (input.length() >= (start + 4)) && (input[start] == 'x' ||input[start] == 'X') &&
(input[start + 1] == 'n' || input[start + 1] == 'N') &&
(input[start + 2] == '-' || input[start + 4] == '-');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem completely correct. "xn--" is the prefix of IDNA encoding and not punycode. I think it's better to have 4 functions instead of 2:

punycodeEncode('ввв') -> 'b1aaa'
punycodeDecode()
idnaEncode('ввв.яндекс.рф') -> 'xn--b1aaa.xn--d1acpjx3f.xn--p1ai'
idnaDecode()

if (std::string::npos == start)
return {0, input.size() };
std::string::size_type last = input.find('/', start + 3);
return {start + 3, std::string::npos == last ? input.size() : last};
Copy link
Member

@vitlibar vitlibar Jul 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is related to URL processing and not punycode or idna encoding. I believe punycode/idna mustn't try to extract a domain and suppose instead that the domain is already extracted. Anyway we already have the function domain() in ClickHouse which can be easily combined with our function:

SELECT idnaDecode(domain(...))

}

size_t punycodeDecode(const std::string_view input,
UInt8 * decoded_string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better and more safe to pass ColumnString::Chars & res_data to this function instead of UInt8 * decoded_string because this way you can just push back new characters to res_data without thinking about how much memory you need to reserve before.

std::for_each(s32.begin(), s32.end(), [](auto & ch) { ch = Poco::Unicode::toLower(ch);});
size_t decoded_length = punycodeEncodeInternal(s32, encoded_string + start);
if (0 == decoded_length)
throw std::runtime_error("Failed to encode");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something is wrong with arguments of functions usually we throw Exception(ErrorCodes::BAD_ARGUMENTS). Exception is our exception class which provides much more information than std::runtime_error and normally can caught and handled.

@pufit pufit self-assigned this Jun 28, 2023
@alexey-milovidov alexey-milovidov changed the title WIP: Add punycode encode decode functions Add punycode encode decode functions Sep 15, 2023
@robot-clickhouse-ci-2
Copy link
Contributor

robot-clickhouse-ci-2 commented Sep 28, 2023

This is an automated comment for commit 6e1911a with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Docs CheckBuilds and tests the documentation❌ failure

@romaincointepas
Copy link

We are highly interested in being able to use those 2 functions (on ClickHouse Cloud we can't use executable UDFs, and this can't be done with SQL UDFs).

@rschu1ze
Copy link
Member

I took the freedom and quickly re-implemented this in #57969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.