Skip to content

Replace mySQL dateformatter M behaviour from minutes to month name#46536

Closed
JakeBamrah wants to merge 2 commits intoClickHouse:masterfrom
JakeBamrah:master
Closed

Replace mySQL dateformatter M behaviour from minutes to month name#46536
JakeBamrah wants to merge 2 commits intoClickHouse:masterfrom
JakeBamrah:master

Conversation

@JakeBamrah
Copy link
Copy Markdown
Contributor

@JakeBamrah JakeBamrah commented Feb 17, 2023

Fixes: #46184

Changelog category:

  • Backward Incompatible Change

Changelog entry:

Replace mySQL date-formatter %M to return full-month name rather than minutes.
Note: This is a backwards incompatible change.

Documentation entry for user-facing changes

  • Documentation has been written / updated as part of change

Motivation
Our current behaviour for mySQL date-formatter %M does not match the mySQL standard as it returns minutes [0-59] rather than the full month name February.

  • Example use:
select DATE_FORMAT(now(), '%M');
>>> February

TODO

This change is close to being complete, however, I'm running into memory leaks when formatting the output as full-month name multiple times. For example (leaks highlighted between <>):

SELECT DATE_FORMAT(toDate32('2022-03-01'), '%M%%M%M%M%M%M%M%%M%M%M%M%M%M%%M%M%M')
>>> March<ry>MarchMarchMarchMarchMarch<nu>MarchMarchMarchMarchMarch<yJ>MarchMarch

I'm not entirely sure how to resolve this, I would be super grateful to any guidance given! :)

@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-improvement Pull request with some product improvements label Feb 17, 2023
@JakeBamrah
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Hey Alexey! I'm trying to resolve a small issue that is a continuation of #46184, however, I'm running into an issue with memory leakage when trying to implement full-month formatting for dates (see TODO in description above).

I'm relatively new to C++ and I am looking to improving my ability to work with the language. Do you have any guidance / resources that I can study to help me solve this issue? Thanks and have a good weekend!

@qoega qoega added the can be tested Allows running workflows for external contributors label Feb 18, 2023
@alexey-milovidov
Copy link
Copy Markdown
Member

The code of formatDateTime expects a fixed-size string in the result. There are two options:

  • make a second code path for non-fixed size strings, and fall back to it if there are non-fixed size substitutions;
  • pad month names with whitespace to the longest month, so it will be May instead of May;

@JakeBamrah
Copy link
Copy Markdown
Contributor Author

The code of formatDateTime expects a fixed-size string in the result. There are two options:

- make a second code path for non-fixed size strings, and fall back to it if there are non-fixed size substitutions;

- pad month names with whitespace to the longest month, so it will be `May      ` instead of `May`;

Thanks! I attempted the padding approach but thought the output looked inconsistent compared to all of the other mysql date formatter outputs (no padding). I'll have a go at the first option :)

@rschu1ze rschu1ze self-assigned this Feb 23, 2023
@JakeBamrah JakeBamrah marked this pull request as ready for review February 24, 2023 17:06
@JakeBamrah JakeBamrah force-pushed the master branch 3 times, most recently from eb9a309 to 9bfb07d Compare February 25, 2023 16:30
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added pr-backward-incompatible Pull request with backwards incompatible changes and removed pr-improvement Pull request with some product improvements labels Feb 25, 2023
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Feb 25, 2023

I imagine it should be a completely separate code path - the logic will be the following:

  • do all substitutions produce fixed-size output?
  • then use the existing fast code path with the template;
  • otherwise, use an entirely separate code path;

@JakeBamrah
Copy link
Copy Markdown
Contributor Author

I imagine it should be a completely separate code path - the logic will be the following:

* do all substitutions produce fixed-size output?

* then use the existing fast code path with the template;

* otherwise, use an entirely separate code path;

Sorry for the delay, I was making this way more complicated than it needed to be. I needed to understand the code more (I spent hours trying to think of complex solutions because I didn't understand the code properly). I have made another attempt and I will keep an eye on the tests and update from here!

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Mar 2, 2023

Thanks for your great effort, highly appreciated. In general: the less code the better and the last iteration seems to be quite small 😉 ... I am taking a look now.

@JakeBamrah
Copy link
Copy Markdown
Contributor Author

Thanks for your great effort, highly appreciated. In general: the less code the better and the last iteration seems to be quite small 😉 ... I am taking a look now.

Thank you and thanks for you patience! I think this will pass tests but I still think the behavior is not quite right. I'm off of work tomorrow so I will take another look at it. The formatting appears fine in isolation (e.g. %M%M) but is still incorrect when combined with other strings formatters 😞

constexpr std::string_view monthsShort[]
= {"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"};

constexpr std::string_view dynamicDateFormatters[2] = {"M", "W"};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to check against %M and %W, otherwise this will match the format string Month: %m Weekday: %w (which does not contain dynamic formatters)

dst_offsets.resize(vec.size());

if constexpr (format_syntax == FormatDateTimeTraits::FormatSyntax::MySQL)
if (format_syntax == FormatDateTimeTraits::FormatSyntax::MySQL
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jake,

I thought about this a bit more ... in l. 861, the string output column is resized according to the computed (fixed) output width. String columns in ClickHouse are organized in two parts: 1. the "data part" which is an array of the 0-terminated raw strings (i.e. strings are separated by a zero byte). 2. the "offset part" which is an array of relative integer offsets into the data part.

In l. 904, the code writes for each row and each instruction into the previously determined position which is passed as char * dest into each instruction (e.g. static size_t mysqlMonth(char * dest, Time source, UInt64, UInt32, const DateLUTImpl & timezone), l. 342). Instructions return how many characters were written into the output (and thus by how many characters dest is advanced before the instruction executes). For fixed-length formatters, this value will be the same across all rows. For variable length-formatters, it won't. This creates two problems:

  1. Formatters %M and %W return "January" and "Monday" in the initial calculation of the width (l. 1053, l. 1053). Now imagine a simple example with format string %M %W and a single input row toDate('2023-11-09'). We will allocate 7 (January) + 6 (Monday) + 1 (trailing zero byte) bytes in the output string data part array. However, both instructions will actually write 8 (November) + 8 (Thursday) + 1 (trailing zero byte) into the output structure. That's a memory corruption. To solve this, the output width for %M and %W must be calculated based on the longest possible month/weekday (September, Wednesday).
  2. Doing so creates another problem: If the actual input dates create output strings shorter than the previously calculated "worst-case" width (and that will be very often the case), then the result array will have a part at the end which was allocated but not written by any of the instructions. The data in this area will depend on whether the initial resize (l. 861) initializes the result array with zeros or not (in which case the data will be arbitrary garbage). To avoid that, we should shrink/resize the result array once more after all instructions ran to the end of the actually written data. EDIT: I am surprised to see that this happens already in l. 911 even though so far only fixed-size formatters were used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Robert, (is it okay if I call you by that name?). Thanks for the detailed explanation 😊 That makes a lot of sense, my thoughts for the last few days have been to try and find a way to write the dynamic strings directly to the out_template but this led to a lot of code duplication and was just quite messy overall.

So I need to first ensure that I am using the largest possible string for both the M/W formatters to guarantee that we have enough memory allocated before we call our instructions that write to the output (l.904).

Despite the resizing (at l. 911) we are still getting garbage out for any fixed formatters adjacent to a dynamic formatter. For example:

select DATE_FORMAT(toDate32('2022-09-02'), '%r-%W' 
>>> 7;53Friday  // should be 12:00 AM-Friday

Does this mean that we need to find a way to resize as we write each instruction? Resizing the array after the all the results have been written will lead to data-loss somewhere along the entire string or do we look at moving up a layer and defining a new method altogether (alternative to executeType)? This would lead to quite a lot of duplicated code I imagine.

Lastly, this also means that the W change (already released) has introduced a bug into the codebase which wasn't caught by our tests, so I when/if I get this resolved I need to make sure I write new tests specifically for checking dynamic-formatters.

I must admit I am struggling a bit here but I'm learning a lot. Thanks as always for your patience 😊

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jake (yes, Robert is okay of course),

I don't have an explanation for the weird output, but I can perhaps debug tomorrow or so and check what goes wrong with %r-%W. In any case, a new top-level method next executeType would be quite ugly / add too much duplication, and I also don't think it is necessary. As far as I see, it is safe if each instruction (fixed or variable length) simply writes its output into the output structure which was allocated based on a worst-case length estimation.

Good point on W ... yes, there should definitely be a test for that, thanks for noticing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and please don't feel pressured to finish this PR. Just let me/us know if we should continue, after all we are payed $ (in my case €) to work on ClickHouse.

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Mar 4, 2023

I took the freedom to continue/finish this PR in #47246

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Mar 5, 2023

The missing piece which led to garbage in the output was the handling of characters different than formatters, e.g. the whitespaces in %M %i %q. If all formatters are fixed-width, then instructions assume such characters are already part of the result column (which is initialized based on the template) and skip them. If there is at least one variable-width formatter, this assumption is no longer true and we need extra instructions which copy the characters verbatim. If you are interested in the details, I added a comment to the code that explains this in more detail.

My PR blew a bit up because verbatim copy instructions need state to hold the characters to copy, and the previous Action class implementation based on pointers to static functions could not provide that. So I had to change that to pointers to member functions, which in turn led to some troubles with choosing the right Action constructor overload. As a result, the Action instance is now build incrementally. Shouldn't be any slower (but it is for sure uglier ^^).

Feel free to close this PR ... in any case, thanks for your effort!!

@JakeBamrah
Copy link
Copy Markdown
Contributor Author

The missing piece which led to garbage in the output was the handling of characters different than formatters, e.g. the whitespaces in %M %i %q. If all formatters are fixed-width, then instructions assume such characters are already part of the result column (which is initialized based on the template) and skip them. If there is at least one variable-width formatter, this assumption is no longer true and we need extra instructions which copy the characters verbatim. If you are interested in the details, I added a comment to the code that explains this in more detail.

My PR blew a bit up because verbatim copy instructions need state to hold the characters to copy, and the previous Action class implementation based on pointers to static functions could not provide that. So I had to change that to pointers to member functions, which in turn led to some troubles with choosing the right Action constructor overload. As a result, the Action instance is now build incrementally. Shouldn't be any slower (but it is for sure uglier ^^).

Feel free to close this PR ... in any case, thanks for your effort!!

Oh man, this is really awesome work, well done! I have just read through your P.R and I think it's great, the documentation and renamed variables make a surprisingly large difference in readability (especially to a newb like myself 😉 ). Unfortunately, I wouldn't have come up with a solution like this anytime soon, my cpp is just not there yet 😞 But, it is a good experience to see you tackle problems... it motivates me to improve. Thank you for your guidance and patience, I would love to attempt some more work for Clickhouse. If there are any tickets that you think would be a good fit, feel free to tag me in them (I will keep a lookout too).

@JakeBamrah
Copy link
Copy Markdown
Contributor Author

Resolved by #47246

@JakeBamrah JakeBamrah closed this Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backward-incompatible Pull request with backwards incompatible changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add function DATE_FORMAT as a compatibility alias.

6 participants