Parallel parsing of data formats. by overshov · Pull Request #5372 · ClickHouse/ClickHouse

overshov · 2019-05-21T20:49:37Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

New Feature
Performance Improvement

Short description (up to few sentences):
Parallel parsing of input data formats.

akuzm · 2019-05-24T10:21:30Z

dbms/src/Formats/CSVRowInputStream.cpp


+void registerChunkGetterCSV(FormatFactory & factory)
+{
+    factory.registerChunkGetter("CSV", [](


Could you make these just plain function instead of anonymous lambdas? That's easier to navigate and also has less indentation.

akuzm · 2019-05-24T10:22:40Z

dbms/src/Formats/FormatFactory.cpp

    target = output_creator;
 }

+void FormatFactory::registerChunkGetter(const String & name, ChunkCreator chunk_creator)


So is it a getter or a creator? I'd name it something like FileSegmentationEngine.

akuzm · 2019-05-24T10:33:01Z

dbms/src/Formats/FormatFactory.h

 class FormatFactory final : public ext::singleton<FormatFactory>
 {
+public:
+    using ChunkCreator = std::function<bool(


This needs a comment describing the requirements for the implementation of a file segmentation function, i.e. what it should do, what the arguments mean, when and in what context it is called etc.

Add comment

akuzm · 2019-05-24T10:37:41Z

dbms/src/IO/ReadHelpers.cpp

    }
 }

+bool safeInBuffer(ReadBuffer & in, DB::Memory<> & memory, char * & begin_pos, bool force)


What does this do, and what "force" means? Needs a comment and probably a better name.

dbms/src/Core/Settings.h

KochetovNicolai · 2019-05-24T11:17:17Z

dbms/src/IO/ReadHelpers.cpp

+    {
+        size_t old_size = memory.size();
+        memory.resize(old_size + static_cast<size_t>(in.position() - begin_pos));
+        memcpy(memory.data() + old_size, begin_pos, in.position() - begin_pos);


Note: maybe it's possible to use memcpySmallAllowReadWriteOverflow15 here.
DB::Memory has extra 15 bytes. As for ReadBuffer, can't be sure.

KochetovNicolai · 2019-05-24T11:24:17Z

dbms/src/IO/ReadHelpers.h

 /// Skip to next character after next unescaped \n. If no \n in stream, skip to end. Does not throw on invalid escape sequences.
 void skipToUnescapedNextLineOrEOF(ReadBuffer & buf);

+bool safeInBuffer(ReadBuffer & in, DB::Memory<> & memory, char * & begin_pos, bool force = false);


Wrong name:

safe -> save

we save data from buffer to memory, but names says opposite

it isn't mentioned that function may return false and do nothing (use try in prefix do it)

it's not understandable in which case function won't copy data

Comment is missing.

Make it better

dbms/src/Formats/CSVRowInputStream.cpp

KochetovNicolai · 2019-05-24T11:45:48Z

dbms/src/Formats/CSVRowInputStream.cpp


+void registerChunkGetterCSV(FormatFactory & factory)
+{
+    factory.registerChunkGetter("CSV", [](


Code is difficult to understand. (Can't be sure it's correct.)
Maybe it's possible create a class which will support invariant for begin_pos and copy data to memory when needed?

KochetovNicolai · 2019-05-24T11:53:45Z

dbms/src/Formats/FormatFactory.h

-    using Creators = std::pair<InputCreator, OutputCreator>;
+    struct Creators
+    {
+        InputCreator first;


Better to rename first and second to input_creator and output_creator.

dbms/src/Formats/FormatFactory.h

KochetovNicolai · 2019-05-24T12:01:21Z

dbms/src/IO/SharedReadBuffer.h

+namespace DB
+{
+
+class SharedReadBuffer : public BufferWithOwnMemory<ReadBuffer>


Missing comment.

KochetovNicolai · 2019-05-24T12:02:59Z

dbms/src/Formats/FormatFactory.cpp

+        auto buf_mutex = std::make_shared<std::mutex>();
+        for (size_t i = 0; i < max_threads_to_use; ++i)
+        {
+            buffers.emplace_back(std::make_unique<SharedReadBuffer>(buf, buf_mutex, chunk_getter, settings.min_bytes_in_chunk));


Need comment why SharedReadBuffer is used.

ivan-v-kush · 2019-05-24T20:39:41Z

you're making changes in your master
More convenient, if you'd create feature branch and work in it and merge it.
http://nvie.com/posts/a-successful-git-branching-model/

alexey-milovidov · 2019-05-24T21:25:36Z

you're making changes in your master

It's totally fine, IMO.

ivan-v-kush · 2019-05-24T22:13:12Z

dbms/src/IO/SharedReadBuffer.h

+        if (eof)
+            return false;
+
+        std::lock_guard<std::mutex> lock(*mutex);


you need somewhere check, that mutex is not nullptr

overshov · 2019-05-24T22:41:25Z

I have checked stateless tests logs from sandbox. Most diffs caused by rearrangement of output lines, because of parallel reading. Except 00418_input_format_allow_errors, there is some logical error.
Yes, each ReadBuffer has unique errors counter and parameter input_format_allow_errors_num does not work.

alexey-milovidov · 2019-05-25T12:57:21Z

I have checked stateless tests logs from sandbox. Most diffs caused by rearrangement of output lines, because of parallel reading.

This is the expected result (it is fine).

BTW, we have discussed an option for order-preserving parallel parsing of formats. It is a little bit more difficult.

Yes, each ReadBuffer has unique errors counter and parameter input_format_allow_errors_num does not work.

This is also the expected result. Local counters for parallel processing are fine.

alexey-milovidov · 2019-05-25T12:58:30Z

Now let's explicitly turn off the setting in those tests that depend on order or error counters.

…nters

amosbird · 2019-05-27T04:27:25Z

FYI https://github.com/yandex/ClickHouse/blob/96e3574/dbms/src/Interpreters/Aggregator.cpp#L1772
for order-preserving parallel processing.

alexey-milovidov · 2019-05-27T19:24:17Z

dbms/src/Core/Settings.h

    M(SettingBool, distributed_group_by_no_merge, false, "Do not merge aggregation states from different servers for distributed query processing - in case it is for certain that there are different keys on different shards.") \
    M(SettingBool, optimize_skip_unused_shards, false, "Assumes that data is distributed by sharding_key. Optimization to skip unused shards if SELECT query filters by sharding_key.") \
    \
+    M(SettingBool, enable_parallel_reading, true, "Enable parallel_reading for several data formats (JSON, TSV, TKSV, Values, CSV).") \


Rename to input_format_parallel_parsing. Otherwise the name is extremely stupid.

alexey-milovidov · 2019-05-27T19:26:44Z

dbms/src/Formats/CSVRowInputStream.cpp

+                }
+            } else
+            {
+                in.position() = find_first_symbols<'"','\r', '\n'>(in.position(), in.buffer().end());


Inconsistent whitespaces.

alexey-milovidov · 2019-05-27T19:26:59Z

dbms/src/Formats/CSVRowInputStream.cpp

+                {
+                    quotes = true;
+                    ++in.position();
+                } else if (*in.position() == '\n')


ivan-v-kush · 2019-05-30T08:46:27Z

dbms/src/IO/SharedReadBuffer.h

+    bool nextImpl() override
+    {
+        if (eof || !mutex)
+            return false;


maybe in the case of null mutex also write some LOG_ERROR?
What do you think @KochetovNicolai ?

kreuzerkrieg · 2019-06-24T16:10:52Z

dbms/src/Core/Settings.h

    \
+    M(SettingBool, input_format_parallel_parsing, true, "Enable parallel parsing for several data formats (JSON, TSV, TKSV, Values, CSV).") \
+    M(SettingUInt64, max_threads_for_parallel_reading, 0, "The maximum number of threads to parallel reading. By default, it is set to max_threads.") \
+    M(SettingUInt64, min_chunk_size_for_parallel_reading, 4, "The minimum chunk size, which each thread tries to parse under mutex in parallel reading.") \


Just curious, what this "4" means? Looks like number of chunks, but looking at CSVRowInputStream.cpp while loop it looks like 4 bytes. Am I missing something?

stale · 2019-10-20T09:25:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tavplubix · 2019-11-26T17:09:44Z

Continued in #6553

overshov added 3 commits May 13, 2019 17:38

MVP commit

35d18b5

Better

86ad0c1

Better

dcc9599

alexey-milovidov changed the title ~~MVP for hse project~~ Parallel parsing of data formats. May 21, 2019

alexey-milovidov added can be tested pr-feature Pull request with new product feature labels May 21, 2019

overshov added 5 commits May 22, 2019 15:52

Merge branch 'master' into master

3824d22

Style fixes

d157ac1

Style fixes

393bd14

Remove wrong file

df22ada

Build fixed

abd3bc2

akuzm reviewed May 24, 2019

View reviewed changes

KochetovNicolai requested changes May 24, 2019

View reviewed changes

overshov added 10 commits May 24, 2019 16:42

Rename ChunkGetter

0c4dbdd

Verify pointer before reading from it

d8df12f

Better settings names and descriptions

5e127ae

Better name and comment for safeInBuffer

205cd79

Build fixes

78951c1

Improved descriptions

f7a28f0

Enable parallel reading by default, check what happens

9534f1c

Fix build

3efbf88

Style fix

c154115

Style fix

dd16a6f

Empty chunk size is broken

f67da3f

ivan-v-kush reviewed May 24, 2019

View reviewed changes

Disable paralle reading in tests, which depends on order or error cou…

a9cd1af

…nters

alexey-milovidov reviewed May 27, 2019

View reviewed changes

overshov added 11 commits May 28, 2019 15:05

Style fixed

828f993

Rename setting

955cc27

Disable parallel parsing for more tests

3dd151d

Update setting new name

73a081e

Remove debug code

c9d2a03

Fix style check

521eaeb

Check mutex equals to nullptr

7662693

More tests fixed

571acc7

Update repo

8f600e5

More tests fixed

a5210eb

Disable parallel parsing in more tests

863517e

ivan-v-kush reviewed May 30, 2019

View reviewed changes

kreuzerkrieg reviewed Jun 24, 2019

View reviewed changes

alexey-milovidov assigned nikitamikhaylov Aug 13, 2019

nikitamikhaylov mentioned this pull request Aug 20, 2019

Parallel parsing data formats #6553

Merged

stale bot added the not planned Known issue, no plans to fix it currenlty label Oct 20, 2019

blinkov removed the not planned Known issue, no plans to fix it currenlty label Oct 20, 2019

tavplubix closed this Nov 26, 2019

filimonov mentioned this pull request May 11, 2020

When I import a large data set, I get an error "Allocator: Cannot mremap memory chunk from 2.00 GiB to 16.00 EiB., errno: 12, strerror: Cannot allocate memory" #10817

Closed

Conversation

overshov commented May 21, 2019 • edited by alexey-milovidov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivan-v-kush commented May 24, 2019

Uh oh!

alexey-milovidov commented May 24, 2019

Uh oh!

ivan-v-kush May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

overshov commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented May 25, 2019

Uh oh!

alexey-milovidov commented May 25, 2019

Uh oh!

amosbird commented May 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Oct 20, 2019

Uh oh!

tavplubix commented Nov 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

overshov commented May 21, 2019 •

edited by alexey-milovidov

Loading

ivan-v-kush May 24, 2019 •

edited

Loading

overshov commented May 24, 2019 •

edited

Loading