Skip to content

add corrMatrix,covarSampMatrix,covarPopMatrix AggregateFunction#44680

Merged
pufit merged 10 commits intoClickHouse:masterfrom
FFFFFFFHHHHHHH:corrMatrix
Feb 2, 2023
Merged

add corrMatrix,covarSampMatrix,covarPopMatrix AggregateFunction#44680
pufit merged 10 commits intoClickHouse:masterfrom
FFFFFFFHHHHHHH:corrMatrix

Conversation

@FFFFFFFHHHHHHH
Copy link
Copy Markdown
Contributor

@FFFFFFFHHHHHHH FFFFFFFHHHHHHH commented Dec 28, 2022

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add corrMatrix Aggregatefunction, calculating each two columns.
In addition, since Aggregatefunctions covarSamp and covarPop are similar to corr, I add covarSampMatrix, covarPopMatrix by the way.
@alexey-milovidov closes #44587

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 28, 2022

CLA assistant check
All committers have signed the CLA.

@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-feature Pull request with new product feature label Dec 28, 2022
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Dec 28, 2022
INSERT INTO fh(a_value, b_value, c_value, d_value) VALUES (1, 5.6,-4.4, 2.6),(2, -9.6, 3, 3.3),(3, -1.3,-4, 1.2),(4, 5.3,9.7,2.3),(5, 4.4,0.037,1.222),(6, -8.6,-7.8,2.1233),(7, 5.1,9.3,8.1222),(8, 7.9,-3.6,9.837),(9, -8.2,0.62,8.43555),(10, -3,7.3,6.762);

SELECT corrMatrix(a_value) FROM (select a_value from fh limit 0);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just select corrMatrix(a_value) from fh?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are unstable and have some randomness.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the randomness is probably a floating-point numbers error. This can be dealt with a round function in the test query.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add some test later.


SELECT corrMatrix(a_value) FROM (select a_value from fh limit 0);

SELECT corrMatrix(a_value) FROM (select a_value from fh limit 1);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why limit 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the data has only zero row or one rows, the output is stable.

{
assertNoParameters(name, parameters);
for (const auto & argument_type : argument_types)
if (!isNumber(argument_type))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Decimal data type?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just isNativeNumber is enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'll fix this later.

template <StatisticsFunctionKind _kind>
struct StatFuncArbitraryArgData
{
using DataType = std::conditional_t<_kind == StatisticsFunctionKind::corr, CorrMoments<Float64>, CovarMoments<Float64>>;
Copy link
Copy Markdown
Collaborator

@ucasfl ucasfl Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all data type of the arguments are Float32, then result type is better be Float32.

Copy link
Copy Markdown
Contributor Author

@FFFFFFFHHHHHHH FFFFFFFHHHHHHH Jan 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forwarding all argument types to template is a problem.

template <typename StatFuncData>
class AggregateFunctionVarianceSimpleMatrix final
: public IAggregateFunctionDataHelper<StatFuncData, AggregateFunctionVarianceSimpleMatrix<StatFuncData>>
{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe no need to add a new class, we can add it to AggregateFunctionVarianceSimple, and control by a template parameter.

@ucasfl
Copy link
Copy Markdown
Collaborator

ucasfl commented Jan 30, 2023

Add functions to tests/fuzz/dictionaries/functions.dict

@FFFFFFFHHHHHHH
Copy link
Copy Markdown
Contributor Author

Add functions to tests/fuzz/dictionaries/functions.dict

OK👌

@ucasfl
Copy link
Copy Markdown
Collaborator

ucasfl commented Feb 1, 2023

@alexey-milovidov Hi, could you take a look at his first pull request?

@pufit pufit self-requested a review February 1, 2023 02:24
@@ -0,0 +1,24 @@
[[nan]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all test results nan?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because corr use a unstable algorithm, you could have a look at corr's test query in 00181_aggregate_functions_statistics.sql and 00181_aggregate_functions_statistics.reference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests look like very old ones. But, I suppose something like SELECT arrayMap(x -> arrayMap(y -> round(y, 5), x), corrMatrix(...)) FROM fh; should do the trick and return consistent results. And it will actually validate that inner math logic works correctly in this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Member

@pufit pufit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pufit pufit merged commit 0c13870 into ClickHouse:master Feb 2, 2023
@pufit pufit self-assigned this Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A function corrMatrix to calculate correlation across all the pairs of arguments.

7 participants