Parallel parsing data formats by nikitamikhaylov · Pull Request #6553 · ClickHouse/ClickHouse

nikitamikhaylov · 2019-08-19T18:54:27Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

New feature

Changelog entry (up to few sentences, not needed for non-significant PRs):

(!) This feature enabled by default. (!)
Parallel parsing is carried out thanks to ParallelParsingBlockInputStream class. There are 3 diffrent roles: Segmantator, Parser and Reader. Only Parser is multithreaded. So, how this class works. Segmentator cut the original file (or smth from ReadBuffer) into small pieces (you can control it with min_chunk_size_for_parallel_parsing setting). Then many parsers (also you can use max_threads_for_parallel_parsing setting for tuning) turn these pieces into Blocks. After that Blocks will be inserted into table without any shuffling (because it is order-preserving parallel parsing).

Old PR:
#5372

akuzm · 2019-08-20T09:29:02Z

Do we really have to disable parallel parsings for so many tests? For several-line examples that are usually tested, parallel parsing doesn't make any difference. And I'm not sure what difference it makes for big files, so maybe we just have to leave it always on.

If we do have to disable it for the tests, we can disable it globally in the configuration of the server that is used for the tests.

alexey-milovidov · 2019-08-21T03:34:39Z

And I'm not sure what difference it makes for big files, so maybe we just have to leave it always on.

It will make the order of data non-deterministic.

The task №2 for @nikitamikhaylov is to make an option for order-preserving parallel parsing of data formats.

dbms/src/DataStreams/ParallelParsingBlockInputStream.h

dbms/src/DataStreams/ParallelParsingBlockInputStream.cpp

dbms/src/DataStreams/ParallelParsingBlockInputStream.h

dbms/src/Formats/FormatFactory.cpp

dbms/src/Processors/Formats/Impl/CSVRowInputFormat.cpp

akuzm · 2019-10-02T10:16:35Z

dbms/src/IO/ReadHelpers.h

My mind just goes blank when I see this function and how it's used. At the very least, it should have a sane name, and no bool flag that modifies behavior.

dbms/tests/queries/0_stateless/00395_nullable.sql

dbms/src/IO/SharedReadBuffer.h

dbms/src/Formats/FormatFactory.h

dbms/src/Core/Settings.h

dbms/src/DataStreams/ParallelParsingBlockInputStream.cpp

…el_parsing

…refix

nikitamikhaylov added the do not test disable testing on pull request label Aug 19, 2019

nikitamikhaylov requested a review from KochetovNicolai August 19, 2019 18:54

nikitamikhaylov force-pushed the parallel_parsing branch from 17a7d05 to 504566a Compare September 16, 2019 16:16

nikitamikhaylov added can be tested pr-feature Pull request with new product feature and removed do not test disable testing on pull request labels Sep 17, 2019

nikitamikhaylov changed the title ~~[WIP] Parallel parsing data formats~~ Parallel parsing data formats Sep 24, 2019

parallel parsing

d47d4cd

nikitamikhaylov force-pushed the parallel_parsing branch from a326910 to d47d4cd Compare October 1, 2019 10:49

lost files

d9c12f7