Conversation
Doing this, see the new methods in SeekableReadBuffer. It turned out ok, probably better than the SeekableReadBufferFactory nonsense. Also added session reuse in ReadWriteBufferFromHTTP, it sped up |
d2ebd4f to
feb844f
Compare
…actory altogether
…rquet_preserve_order to parallelizeOutputAfterReading()
|
This is an automated comment for commit 67c6e75 with description of existing statuses. It's updated for the latest CI running
|
|
I wonder if it would be possible to break the PR to smaller gradual PRs (according to the description, it addresses many issues at once, so looks feasible). It's quite inconvenient for reviewer big PRs. Generally, smaller gradual changes are better. |
|
Ok, will split. |
|
@devcrafter, is there a good way to do dependent PRs? For #49539 , #49540 , and #49541 I've set base branch to the parent PR's branch. The list of commits in each PR is correct, but CI didn't run any of the checks. |
|
No good way in GitHub. But if you merge PR that incorporates some other PRs, they are marked as merged. So you can test and review separately and merge all at once |
We rarely ask to split PRs. There is an option to review changes per commit if history is split into separate commits |
As mentioned already, unfortunately there is no good way for dependent PRs. Gerrit does it much better, but it's a different story. |
imo, It makes sense here. To what degree, the author can decide himself (and it's nice if the author considers reviewer PoV)
It doesn't work since most of the time authors doesn't put effort to separate commits logically (it's also not necessary). Just tackling different issues in different PRs already helps. |
|
I'm ok with it here, it was a bit messy. But it did take 2-3 hours. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Parquet's parallel random reads from local file are now actually parallel (pread() calls or memcpy from mmap), instead of locking two mutexes for no reason.
Added some limited HTTP session pooling for StorageURL. Sessions are now reused by all (non-concurrent) read operations on the same ReadBuffer. This makes
SELECT count(*)from remote (different region) parquet files 2.5x faster!We should add similar pooling for StorageS3, but that'll take more work: there's no simple way to attach the pool to a ReadBufferFromS3 instance, so it's either a global pool (with more care for eviction policy and correctness) or some hacks to indirectly pass a pool pointer to PocoHTTPClient.
Fixed a confusing error message when getting HTTP 404 error: it used to say "Not a Parquet file" because asArrowFileLoadIntoMemory() was catching exceptions too eagerly.
The behavior for skipping not-found URLs for globs was kind of inconsistent, in part broken by me in #47964
It skipped files by pretending that they're empty. This didn't work when the file is compressed - the decompression would error out trying to decompress an empty string.
Fixed by explicitly skipping the URL if the initial HTTP request failed, without pretending that the file is empty.
input_format_parquet_preserve_ordernow impliesparallelize_output_from_storages.In #47964 I refactored things to use SeekableReadBufferFactory in more places, as the abstraction for random-access "files". I think that was a mistake.
This PR re-refactors everything in the opposite direction: it adds a thread-safe positioned-read method in SeekableReadBuffer and uses it whenever parallel random reading is needed (ParallelReadBuffer and ParquetBlockInputFormat). This turned out better than the factory stuff, I think: less (?) code, fewer boilerplate objects, less useless extra cpu work (like maintaining a pool of ReadBuffers to reuse), less awkward code like
ifs everywhere for buffer vs factory. The factory is removed altogether.Inverted how
attachSessionData()is used for marking whether HTTP sessions can be reused. Sessions were considered reusable by default, and ReadWriteBufferFromHTTP had to mark them as non-reusable on any error. It was missing some cases because there are lots of them. This PR flips it - the sessions are not reusable unless marked as reusable, which ReadWriteBufferFromHTTP does after receiving full response successfully. It seems silly that any of this needs to be done at all, instead of HTTPClientSession doing this tracking internally; but this was less work than changing HTTPClientSession.Split ReadWriteBufferFromHTTP from header-only into a .h and .cpp files. This had no (easily-) measurable effect on build time (I ran a clean build 3 times with and without the change, difference was within the noise, which was ~0.1%). So I probably won't waste time on this type of refactoring in future :)