Skip to content

add Window Function ExponentialSmoothingAlpha#1

Open
aleks5d wants to merge 10 commits intomasterfrom
exponential-smoothing
Open

add Window Function ExponentialSmoothingAlpha#1
aleks5d wants to merge 10 commits intomasterfrom
exponential-smoothing

Conversation

@aleks5d
Copy link
Copy Markdown
Owner

@aleks5d aleks5d commented May 6, 2023

Changelog category:

  • New Feature

Changelog entry:

Added a new aggregate function: exponentialSmoothingAlpha. It can be used to calculate exponential smoothing with given alpha parameter.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@aleks5d aleks5d self-assigned this May 6, 2023
Copy link
Copy Markdown

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I hope that the comments will be useful. Please add some functional SQL tests which demonstrate the new SQL functions.

M(679, IO_URING_SUBMIT_ERROR) \
M(690, MIXED_ACCESS_PARAMETER_TYPES) \
M(691, UNKNOWN_ELEMENT_OF_ENUM) \
M(692, ILLEGAL_VALUE_OF_ARGUMENT) \

This comment was marked as outdated.


#include <cmath>
#include <limits>
#include <stdexcept>

This comment was marked as outdated.


/// count of applied values. Using in calculating exponential smoothing.

unsigned long long int count = 0;

This comment was marked as outdated.


double value = 0;

/// count of applied values. Using in calculating exponential smoothing.

This comment was marked as outdated.

/// How much value decays after count_passed.
static double scale(unsigned long long int count_passed, double alpha)
{
/// using binary power because of low precision of pow().

This comment was marked as outdated.

}

/// Merge two counters. It is done by moving to the same point of reference and summing the values.
/// First counter will be 'main' one, and second will be 'additional' one.

This comment was marked as outdated.


/// first applied value. Using to avoid multiplying first value on alpha.

struct {

This comment was marked as outdated.

struct {
double value = 0;
unsigned long long int timestamp = 0;
bool was = false;

This comment was marked as outdated.

ExponentiallySmoothedAlphaWithTime(double current_value, unsigned long long int current_time, double first_value_, unsigned long long int first_timestamp_)
: value(current_value), timestamp(current_time)
{
first_value.value = first_value_;

This comment was marked as outdated.

double alpha)
{
unsigned long long int max_time = std::max(a.timestamp, b.timestamp);
if (!a.first_value.was || !b.first_value.was)

This comment was marked as outdated.

@aleks5d aleks5d force-pushed the exponential-smoothing branch from 3bd0def to c3b6553 Compare May 15, 2023 21:49
rschu1ze

This comment was marked as outdated.

@aleks5d aleks5d force-pushed the exponential-smoothing branch from c3b6553 to e3c0986 Compare May 16, 2023 14:18
@aleks5d aleks5d force-pushed the exponential-smoothing branch from 9c3b434 to 4ba5009 Compare May 16, 2023 17:42
@aleks5d aleks5d changed the title add AggregateFunctionExponentialSmoothingAlpha add Window Function ExponentialSmoothingAlpha May 18, 2023
name, maximal_arity);
}

template<std::size_t minimal_arity>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(kind of related: performance is not a concern for assertArityAtLeast() + assertArityAtMost() (these functions are called once during some initialization). Therefore, templatization is overkill. We could just pass the min/max arity as a normal size_t parameter and reduce the code size a little bit.)

#include <limits>

#include <stdexcept>
#include <optional>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Cosmetic: let's keep sorted things sorted (l. 3-6))

};

/// Helper struct contains functions for all Counters
struct DataHelper
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is too generic, what about ExponentiallySmoothedAlphaBase?

Or maybe even better: Don't use virtual inheritance at all (especially rather uncommonprivate inheritance, l. 204), instead name this struct ExponentialSmootingHelper and call the static methods directly, e.g. ExponentialSmootingHelper::scale_one_minus_value(...).

while (count)
{
if (count & 1)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cosmetic: it is common in ClickHouse to omit parentheses in single-line if/for/while statements)

struct DataHelper
{
/// equivalent of pow(value, count).
/// using binary power for better precision
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: "binary power" --> "binary exponentiation"

And: is this trick really used for "better precision" or also for performance reasons? (I am actually not sure how std::pow is implemented, maybe it uses a similar technique?)

@@ -0,0 +1,42 @@
/* exponentialSmoothingAlpha tests */
SELECT exponentialSmoothingAlpha()(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99; -- { serverError BAD_ARGUMENTS }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of generating the same table data over and over, we could create them once at the beginning of the test (use INSERT SELECT syntax) and then only reference these tables by the queries.

(also, numbers(100) will do the job just fine, the amount of data is too small to require multithreading)

SELECT exponentialSmoothingAlpha(0.5) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99; -- { serverError BAD_ARGUMENTS }
SELECT exponentialSmoothingAlpha(0)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(0.2)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(0.5)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 3 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using different values here (number * 2 in l. 9 vs. number * 3 in l. 10)? I think the results will be easier to verify when all queries use the same data.

EDIT: Thinking about it, maybe some manual test data will be sufficient already and more intuitive. E.g. CREATE a table, insert (e.g.) 6 rows manually, then run the queries with different alpha against that data. We would not need an OFFSET clause because there is only few rows. But because of that, it will also be easy to verify that everything is correct by looking at the expected results file.

SELECT exponentialSmoothingAlpha(0)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(0.2)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 2 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(0.5)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 3 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(0.8)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 4 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find one or two tests with more "advanced" window function syntax interesting: PARTITION BY clause, and a limited frame Rows BETWEEN ... PRECEDING AND CURRENT ROW) - just to check that it works as expected.

SELECT exponentialSmoothingAlpha(0.8)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 4 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;
SELECT exponentialSmoothingAlpha(1)(value) over (ORDER BY timestamp ASC) from (SELECT number as timestamp, number * 4 - (number % 7 - 3) / 7 as value from numbers_mt(100)) OFFSET 99;

/* exponentialSmoothingAlphaWithTime tests */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip: Use SELECT('exponentialSmoothingAlphaWithTime') instead of comments (l. 14). This will conveniently separate expected results for different sections in the reference file.

* Exponentially smoothed value is weighted average with weight proportional to some function of the time passed.
* In this class timestamps exist, so time is biggest timestamp minus value timestamp.
* Skipped values fill by current value of counter.
* For example, if alpha = 1/3 and it's values timestamps (x0, 0), (x1, 2), (x2, 4) added, result will be x0 * 36/81 + x1 * 18/81 + x2 * 27/81.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the example is not clear. How are 36, 18 and 27 calculated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants