Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests#34324
Use urllib handler for s3:// and gs://, improve url_exists through HEAD requests#34324haampie merged 4 commits intospack:developfrom
urllib handler for s3:// and gs://, improve url_exists through HEAD requests#34324Conversation
urllib handler for s3:// and gs://, improve url_exists through HEAD requests
urllib handler for s3:// and gs://, improve url_exists through HEAD requestsurllib handler for s3:// and gs://, improve url_exists through HEAD requests
0a3c15d to
fac0cce
Compare
Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.
df02ab8 to
ad9f686
Compare
scottwittenburg
left a comment
There was a problem hiding this comment.
This looks great, thanks @haampie! One thing I didn't understand from the description. You mentioned "Keep using standard HTTPRedirectHandler for HEAD requests", can you point me to where that is happening? I didn't find it in here or in the code right off the bat.
| with_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl.create_default_context())) | ||
|
|
||
| # One opener with HTTPS ssl disabled | ||
| without_ssl = build_opener(s3, gcs, HTTPSHandler(context=ssl._create_unverified_context())) |
There was a problem hiding this comment.
This is really nice, compared to how it was done in those blocks you removed!
|
One thing about these web/s3 PRs is they don't change any hashes and never rebuild anything. Additionally, I don't think the unit tests may do a very good job of testing the s3 bits. |
|
Yeah, some more tests around here would help (also it would force fewer globals...) I honestly think Python has a bug in urllib (HEAD on redirect becomes GET), for which I also submitted a PR to cpython, but didn't get a response yet, and it causes some issues with bitbucket redirects to an aws bucket, so in this pr I'm not touching it |
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (spack#34324)"" This reverts commit 8035eeb.
… `url_exists` through HEAD requests (#34324)"" (#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
…rough HEAD requests (spack#34324) * `url_exists` improvements (take 2) Make `url_exists` do HEAD request for http/https/s3 protocols Rework the opener: construct it once and only once, dynamically dispatch to the right one based on config.
…ists` through HEAD requests (spack#34324)" This reverts commit db8f115.
… `url_exists` through HEAD requests (spack#34324)"" (spack#34498) This reverts commit 8035eeb. And also removes logic around an additional HEAD request to prevent a more expensive GET request on wrong content-type. Since large files are typically an attachment and only downloaded when reading the stream, it's not an optimization that helps much, and in fact the logic was broken since the GET request was done unconditionally.
The idea of
urllibis to add handlers for custom protocols through<protocol>_open(req), so let's do that fors3://andgs://.Use
head_objectforRequest("s3://...", method="HEAD")requestsUse
HEADrequests inurl_existsto avoid expensive downloads from s3.Create a "custom"
urlopenthat instantiates an opener once and only once.Keep using the standard
HTTPRedirectHandlerforHEADrequests; in a previous iteration of this PR, I redirected HEAD requests as HEAD requests which led to issues when downloading from BitBucket.Before:
After: