Read Parquet files faster#47964
Conversation
88777cd to
570ae57
Compare
src/Core/BackgroundSchedulePool.cpp
Outdated
There was a problem hiding this comment.
(This is unrelated to the rest of the PR. Without this, the server gives you a quiet SIGABRT if you set max_thread_pool_size too small. Now it'll log an error.)
|
Some speed numbers. Better than before, worse than what is possible.
|
|
I will start reviewing the changes. Looks very promising, great work! |
src/IO/ParallelReadBuffer.cpp
Outdated
|
|
||
| current_position += next_segment.size(); | ||
| } | ||
| chassert(false); |
There was a problem hiding this comment.
Just to make it obvious that this branch always returns, when skimming the code. But looks like it's more confusing than helpful, removing.
|
Looks good in general. I will go deep into details later. |
Couldn't actually remove any logic from FormatFactory, each bit seems useful, almost all of it existed already, in different places. Moved things around a little to hopefully make it more clear. Lmk if you have better ideas. Especially for how to organize the whole thing better, I don't like how awkward the whole
Same, couldn't think of a simpler way to do it (especially if we're going to make it decode columns in parallel too). Reorganized a little and added comments, open to suggestions. |
src/IO/ReadWriteBufferFromHTTP.h
Outdated
There was a problem hiding this comment.
This changes behavior slightly: previously StorageURL would send a HEAD request with "Range: 0-" for this check, now it's without "Range" (except if this parseFileInfo() is called from nextImpl(), which is not the important call site). I expect it's ok because HTTP servers are supposed to report "Accept-Ranges: bytes" anyway (?), but I don't actually know what HTTP servers are out there and what headers they send. For S3 this works, for althttpd this didn't work even before the change, didn't try anything else.
|
Still looking at the test failures. |
|
Ok, fixed, there were two pre-existing bugs in ReadBufferFromS3::{seek,setReadUntilPosition} that I didn't notice, which previously weren't being hit because these methods were mostly unused. |
|
|
||
| off_t ReadBufferFromS3::seek(off_t offset_, int whence) | ||
| { | ||
| if (offset_ == offset && whence == SEEK_SET) |
| impl.reset(); | ||
| } | ||
| read_until_position = position; | ||
| impl.reset(); |
There was a problem hiding this comment.
Two bug (impl was being destroyed without resetting working_buffer, which points into impl).
|
Tests still fail, at least some of them because order of rows is not preserved by default anymore. Should we make |
|
Set it to true by default in this pull request. Then we will merge. Then we will change it to false in another pull request. It will allow us to highlight it better in the changelog. |
|
Would be nice to add a virtual column for row index, then detect |
8f55cfd to
8dc28ad
Compare
kssenii
left a comment
There was a problem hiding this comment.
Couldn't reproduce the CachedOnDiskReadBufferFromFile assertion failure locally: https://s3.amazonaws.com/clickhouse-test-reports/47964/8dc28ad500c79074bb7d638b5ec5f06b3c6ed30e/stress_test__debug_.html .
I've never seen this assertion fail before, I'd think it is related to changes
another CachedOnDiskReadBufferFromFile wants to read the same segment. It takes the remote_file_reader, probably seeks it to 0 (within working_buffer), then fails an assert when getFileOffsetOfBufferEnd() (732) is different from downloaded_size (0).
I'd think the the problem is in another place, because see these lines of logs
2023.04.07 07:44:39.755076 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> CachedOnDiskReadBufferFromFile(00170_test/yaq/fqcczxfqquhmvacyzmwqwnoglwagg): Read 732 bytes, read type REMOTE_FS_READ_AND_PUT_IN_CACHE, position: 0, offset: 732
2023.04.07 07:44:39.777827 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Updated state from DOWNLOADING to PARTIALLY DOWNLOADED
2023.04.07 07:44:39.777959 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Resetting downloader from befe85ec-9b42-4d3c-b454-c75455a15a04:3363
2023.04.07 07:44:39.778193 [ 3363 ] {befe85ec-9b42-4d3c-b454-c75455a15a04} <Test> FileSegment(b2c2cee274e671de0e4265cbb005a440) : [0, 731]: Complete batch. (File segment: [0, 731], key: b2c2cee274e671de0e4265cbb005a440, state: PARTIALLY_DOWNLOADED, downloaded size: 0, reserved size: 732, downloader id: None, current write offset: 0, first non-downloaded offset: 0, caller id: befe85ec-9b42-4d3c-b454-c75455a15a04:3363, detached: 0, kind: Regular, unbound: 0)
732 bytes were read, then we returned from nextImpl without writing anything to cache and there is no error in log, this is not normal. Probably the reason is that we passed working_buffer.begin() instead of position() (therefore it could silently write nothing, let's add an assertion to FileSegment::write like assertdata != nullptr)?) but this is a bit strange, there was a guarantee that impl buffer is not seekable (the restricted_seek check guaranteed that) and we always passed external buffer forward into impl, so it was fine to use working_buffer.begin().
|
Undid most of the nonsense refactoring in CachedOnDiskReadBufferFromFile. Switched from making it accept pre-existing reader in any state to making sure that pre-existing reader is always in the correct state. |
46e4ad8 to
3fd74ca
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Reading files in Parquet format is now much faster. IO and decoding are parallelized (controlled by
max_threadssetting), and only required data ranges are read.This is still not very efficient. Possible future improvements: