Replace mySQL dateformatter M behaviour from minutes to month name#46536
Replace mySQL dateformatter M behaviour from minutes to month name#46536JakeBamrah wants to merge 2 commits intoClickHouse:masterfrom
Conversation
|
@alexey-milovidov Hey Alexey! I'm trying to resolve a small issue that is a continuation of #46184, however, I'm running into an issue with memory leakage when trying to implement full-month formatting for dates (see TODO in description above). I'm relatively new to C++ and I am looking to improving my ability to work with the language. Do you have any guidance / resources that I can study to help me solve this issue? Thanks and have a good weekend! |
|
The code of
|
Thanks! I attempted the padding approach but thought the output looked inconsistent compared to all of the other mysql date formatter outputs (no padding). I'll have a go at the first option :) |
eb9a309 to
9bfb07d
Compare
|
I imagine it should be a completely separate code path - the logic will be the following:
|
Sorry for the delay, I was making this way more complicated than it needed to be. I needed to understand the code more (I spent hours trying to think of complex solutions because I didn't understand the code properly). I have made another attempt and I will keep an eye on the tests and update from here! |
|
Thanks for your great effort, highly appreciated. In general: the less code the better and the last iteration seems to be quite small 😉 ... I am taking a look now. |
Thank you and thanks for you patience! I think this will pass tests but I still think the behavior is not quite right. I'm off of work tomorrow so I will take another look at it. The formatting appears fine in isolation (e.g. |
| constexpr std::string_view monthsShort[] | ||
| = {"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"}; | ||
|
|
||
| constexpr std::string_view dynamicDateFormatters[2] = {"M", "W"}; |
There was a problem hiding this comment.
I guess you want to check against %M and %W, otherwise this will match the format string Month: %m Weekday: %w (which does not contain dynamic formatters)
| dst_offsets.resize(vec.size()); | ||
|
|
||
| if constexpr (format_syntax == FormatDateTimeTraits::FormatSyntax::MySQL) | ||
| if (format_syntax == FormatDateTimeTraits::FormatSyntax::MySQL |
There was a problem hiding this comment.
Hi Jake,
I thought about this a bit more ... in l. 861, the string output column is resized according to the computed (fixed) output width. String columns in ClickHouse are organized in two parts: 1. the "data part" which is an array of the 0-terminated raw strings (i.e. strings are separated by a zero byte). 2. the "offset part" which is an array of relative integer offsets into the data part.
In l. 904, the code writes for each row and each instruction into the previously determined position which is passed as char * dest into each instruction (e.g. static size_t mysqlMonth(char * dest, Time source, UInt64, UInt32, const DateLUTImpl & timezone), l. 342). Instructions return how many characters were written into the output (and thus by how many characters dest is advanced before the instruction executes). For fixed-length formatters, this value will be the same across all rows. For variable length-formatters, it won't. This creates two problems:
- Formatters
%Mand%Wreturn "January" and "Monday" in the initial calculation of the width (l. 1053, l. 1053). Now imagine a simple example with format string%M %Wand a single input rowtoDate('2023-11-09'). We will allocate 7 (January) + 6 (Monday) + 1 (trailing zero byte) bytes in the output string data part array. However, both instructions will actually write 8 (November) + 8 (Thursday) + 1 (trailing zero byte) into the output structure. That's a memory corruption. To solve this, the output width for%Mand%Wmust be calculated based on the longest possible month/weekday (September, Wednesday). - Doing so creates another problem: If the actual input dates create output strings shorter than the previously calculated "worst-case" width (and that will be very often the case), then the result array will have a part at the end which was allocated but not written by any of the instructions. The data in this area will depend on whether the initial resize (l. 861) initializes the result array with zeros or not (in which case the data will be arbitrary garbage). To avoid that, we should shrink/resize the result array once more after all instructions ran to the end of the actually written data. EDIT: I am surprised to see that this happens already in l. 911 even though so far only fixed-size formatters were used.
There was a problem hiding this comment.
Hey Robert, (is it okay if I call you by that name?). Thanks for the detailed explanation 😊 That makes a lot of sense, my thoughts for the last few days have been to try and find a way to write the dynamic strings directly to the out_template but this led to a lot of code duplication and was just quite messy overall.
So I need to first ensure that I am using the largest possible string for both the M/W formatters to guarantee that we have enough memory allocated before we call our instructions that write to the output (l.904).
Despite the resizing (at l. 911) we are still getting garbage out for any fixed formatters adjacent to a dynamic formatter. For example:
select DATE_FORMAT(toDate32('2022-09-02'), '%r-%W'
>>> 7;53Friday // should be 12:00 AM-FridayDoes this mean that we need to find a way to resize as we write each instruction? Resizing the array after the all the results have been written will lead to data-loss somewhere along the entire string or do we look at moving up a layer and defining a new method altogether (alternative to executeType)? This would lead to quite a lot of duplicated code I imagine.
Lastly, this also means that the W change (already released) has introduced a bug into the codebase which wasn't caught by our tests, so I when/if I get this resolved I need to make sure I write new tests specifically for checking dynamic-formatters.
I must admit I am struggling a bit here but I'm learning a lot. Thanks as always for your patience 😊
There was a problem hiding this comment.
Hi Jake (yes, Robert is okay of course),
I don't have an explanation for the weird output, but I can perhaps debug tomorrow or so and check what goes wrong with %r-%W. In any case, a new top-level method next executeType would be quite ugly / add too much duplication, and I also don't think it is necessary. As far as I see, it is safe if each instruction (fixed or variable length) simply writes its output into the output structure which was allocated based on a worst-case length estimation.
Good point on W ... yes, there should definitely be a test for that, thanks for noticing.
There was a problem hiding this comment.
Oh, and please don't feel pressured to finish this PR. Just let me/us know if we should continue, after all we are payed $ (in my case €) to work on ClickHouse.
|
I took the freedom to continue/finish this PR in #47246 |
|
The missing piece which led to garbage in the output was the handling of characters different than formatters, e.g. the whitespaces in My PR blew a bit up because verbatim copy instructions need state to hold the characters to copy, and the previous Feel free to close this PR ... in any case, thanks for your effort!! |
Oh man, this is really awesome work, well done! I have just read through your P.R and I think it's great, the documentation and renamed variables make a surprisingly large difference in readability (especially to a newb like myself 😉 ). Unfortunately, I wouldn't have come up with a solution like this anytime soon, my cpp is just not there yet 😞 But, it is a good experience to see you tackle problems... it motivates me to improve. Thank you for your guidance and patience, I would love to attempt some more work for Clickhouse. If there are any tickets that you think would be a good fit, feel free to tag me in them (I will keep a lookout too). |
|
Resolved by #47246 |
Fixes: #46184
Changelog category:
Changelog entry:
Replace mySQL date-formatter
%Mto return full-month name rather than minutes.Note: This is a backwards incompatible change.
Documentation entry for user-facing changes
Motivation
Our current behaviour for mySQL date-formatter
%Mdoes not match the mySQL standard as it returns minutes[0-59]rather than the full month nameFebruary.TODO
This change is close to being complete, however, I'm running into memory leaks when formatting the output as full-month name multiple times. For example (leaks highlighted between
<>):I'm not entirely sure how to resolve this, I would be super grateful to any guidance given! :)