Conversation
|
As I understand, you have to change base branch to master to run CI. |
|
This is an automated comment for commit 9a9a117 with description of existing statuses. It's updated for the latest CI running
|
|
Submodule was committed by mistake. |
f0c7991 to
bdd6577
Compare
253bc7b to
05bd029
Compare
05bd029 to
596b103
Compare
| /// The pool must outlive all session pointers created by it. | ||
| /// | ||
| /// Session is only reused if it has HTTPSessionReuseTag attached, see comment in HTTPCommon.h | ||
| class LocallyPooledSessionFactory |
There was a problem hiding this comment.
could we replace it with HTTPSessionPool used for s3 read buffer? it has a mode when new sessions will be allocated without waiting after limit is exceeded, but these excessive sessions will be removed when request is finished. imo it is better behaviour because there is no much sense is keeping bursted session forever - they are likely to expire and do reconnect on the next use anyway.
There was a problem hiding this comment.
I was too afraid to use it here:
- It never evicts endpoints from
endpoints_pool. If the user touches a million different hosts (even invalid ones) without restarting the server, it'll get quite big. - For each endpoint, it only evicts sessions when new requests come in. If the user touches 100 K different hosts with 30 sessions each, we'll be holding 300K fds indefinitely, if I'm reading the code right.
- I don't know enough about HTTP to be confident that matching tuple
<host, port, https, proxy_host, proxy_port, proxy_https, max_connections_per_endpoint, resolve_host, wait_on_pool_size_limit>is sufficient for reuse in all cases. Maybe URL can have some other connection settings not listed here? Maybe I'm being too paranoid.
These are all fine for S3, where the URLs are mostly under our control and not very diverse, but for arbitrary URLs from the user I chickened out. Guess I'll change it to use the global pool and add a setting to disable pooling altogether.
There was a problem hiding this comment.
If the user touches a million different hosts
you're talking about url() function? indeed, with s3 I don't think it is possible.
|
@al13n321, Please check what is with the test |
|
That PR is already old. I do not see the reasons to continue it. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Added some limited HTTP session pooling for StorageURL. Sessions are now reused by all (non-concurrent) read operations on the same ReadBuffer. This makes
SELECT count(*)from remote (different region) parquet files 2.5x faster!Inverted how
attachSessionData()is used for marking whether HTTP sessions can be reused. Sessions were considered reusable by default, and ReadWriteBufferFromHTTP had to mark them as non-reusable on any error. It was missing some cases because there are lots of them. This PR flips it - the sessions are not reusable unless marked as reusable, which ReadWriteBufferFromHTTP does after receiving full response successfully. It seems silly that any of this needs to be done at all, instead of HTTPClientSession doing this tracking internally; but this was less work than changing HTTPClientSession.We should add similar pooling for StorageS3, but that'll take more work: there's no simple way to attach the pool to a ReadBufferFromS3 instance, so it's either a global pool (with more care for eviction policy and correctness) or some hacks to indirectly pass a pool pointer to PocoHTTPClient.