Skip to content

Parallel parsing data formats#6553

Merged
nikitamikhaylov merged 36 commits intoClickHouse:masterfrom
nikitamikhaylov:parallel_parsing
Nov 21, 2019
Merged

Parallel parsing data formats#6553
nikitamikhaylov merged 36 commits intoClickHouse:masterfrom
nikitamikhaylov:parallel_parsing

Conversation

@nikitamikhaylov
Copy link
Copy Markdown
Member

@nikitamikhaylov nikitamikhaylov commented Aug 19, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • New feature

Changelog entry (up to few sentences, not needed for non-significant PRs):

(!) This feature enabled by default. (!)
Parallel parsing is carried out thanks to ParallelParsingBlockInputStream class. There are 3 diffrent roles: Segmantator, Parser and Reader. Only Parser is multithreaded. So, how this class works. Segmentator cut the original file (or smth from ReadBuffer) into small pieces (you can control it with min_chunk_size_for_parallel_parsing setting). Then many parsers (also you can use max_threads_for_parallel_parsing setting for tuning) turn these pieces into Blocks. After that Blocks will be inserted into table without any shuffling (because it is order-preserving parallel parsing).

Old PR:
#5372

@nikitamikhaylov nikitamikhaylov added the do not test disable testing on pull request label Aug 19, 2019
@akuzm
Copy link
Copy Markdown
Contributor

akuzm commented Aug 20, 2019

Do we really have to disable parallel parsings for so many tests? For several-line examples that are usually tested, parallel parsing doesn't make any difference. And I'm not sure what difference it makes for big files, so maybe we just have to leave it always on.

If we do have to disable it for the tests, we can disable it globally in the configuration of the server that is used for the tests.

@alexey-milovidov
Copy link
Copy Markdown
Member

And I'm not sure what difference it makes for big files, so maybe we just have to leave it always on.

It will make the order of data non-deterministic.

The task №2 for @nikitamikhaylov is to make an option for order-preserving parallel parsing of data formats.

@nikitamikhaylov nikitamikhaylov added can be tested pr-feature Pull request with new product feature and removed do not test disable testing on pull request labels Sep 17, 2019
@nikitamikhaylov nikitamikhaylov changed the title [WIP] Parallel parsing data formats Parallel parsing data formats Sep 24, 2019
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mind just goes blank when I see this function and how it's used. At the very least, it should have a sane name, and no bool flag that modifies behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants