Skip to content

groupArraySample#8286

Merged
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
amosbird:grouparraysample
Jan 13, 2020
Merged

groupArraySample#8286
alexey-milovidov merged 3 commits intoClickHouse:masterfrom
amosbird:grouparraysample

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):

support reservior sampling for groupArray -- groupArraySample.

groupArraySample(10)(v) -- sample at most 10 records per group

groupArraySample(10, 1)(v) -- with a different seed (default 123456)

Currently the implementation is not uniform, and all groups have the RNG seeded in the same way. More variants can be added in the future.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov Dec 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, time to do some micro benchmarks :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just using pcg64?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be79166 was just a modernization of code base. Without much focus on performance on this particular function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate rng for every state looks too much. We can use thread_local_rng.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we need deterministic result?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can have a very unrandom sampler and a truely uniform sampler.

But do we need deterministic result?

I don't know and I'm always curious about the MergeTree's sample clause and it's deterministic characteristic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread_local_rng will be as random as separate rngs.

I'm always curious about the MergeTree's sample clause and it's deterministic characteristic.

Yes, but for groupArraySample it's not enough to just use deterministic rng, you should also run in single thread to get deterministic result. But it's not practical.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have a thought about "groupArrayLast" :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? what elements should be victimized in such case

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groupArrayLast is intended to select "last" elements. It can be implemented as a cycle buffer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it useful for one of my use cases, so here it is - #44521

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add performance test?

It will be very interesting to test various sampling methods.
For example, I still prefer something more simple and non-uniform like:
fill array, then replace every element with 1/2 probability, then replace every element with 1/3 probability, etc...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll try adding more samplers and make perf comp tests

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it on create.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, missed that interface :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very painful. Maybe rng is POD and we can just writePODBinary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately it's not POD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it's not POD but it should behave this way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused code?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, testing code

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants