persistent cache for link parsing and interpreter compatibility by cosmicexplorer · Pull Request #12258 · pypa/pip

cosmicexplorer · 2023-09-03T12:10:10Z

Closes #12184.

This change is on top of #12257, see the +264/-116 diff against at https://github.com/cosmicexplorer/pip/compare/link-parsing-cache...cosmicexplorer:pip:interpreter-compatibility-cache?expand=1.

Background: Caching `Link` Parsing In-Memory

In #7729, we cached the result of parsing an index for Links in memory, to avoid redoing it more than once per run. Given that this result changes as packages upload new versions to pypi, it didn't seem feasible to cache this persistently like we can do with #12256.

Proposal: Cache `Link` Parsing, then Interpreter Compatibility

See #12184 for an investigation into caching strategies.

If (as per #12257) we recognize that we have a cached HTTP response (which means no new versions of a package have been uploaded since we last checked), we can then implement two more levels of caching:

We can cache the HTML parsing and scanning for valid Links, and write that to a json file in the cache directory.
After we scan all the resulting Links to find the ones that match the current interpreter/platform, we can write that to a cache file as well (we have to ensure these are all invalidated when the HTTP response changes).

Result: Some Performance Improvement

This all turns out to be possible, but with some caveats:

It takes up a little more filesystem space (2.5M) than cache metadata lookups for sdists and lazy wheels #12256 or send HTTP caching headers for index pages to further reduce bandwidth usage #12257 (which, for comparison, both only wrote an additional 2KB and 60KB to disk after the same large resolve):

> du -b -s ~/.cache/pip/fetch-resolve/
2536640 /home/cosmicexplorer/.cache/pip/fetch-resolve/
> find ~/.cache/pip/fetch-resolve/ -name parsed-links | parallel bzcat | jq '.[]'
...
{
  "url": "https://files.pythonhosted.org/packages/59/c0/85c204545c4b32f9f4e6b5e49faf4311d0610815aa98aa0cbb5c67f3ea85/keras_nightly-2.5.0.dev2021021400-py2.py3-none-any.whl",
  "comes_from": "https://pypi.org/simple/keras-nightly/",
  "requires_python": null,
  "yanked_reason": null,
  "metadata_file_data": null,
  "hashes": {
    "sha256": "a46f37cd4486d8d0802f91230611bc1ffe1030762cdb8461c8cd807f4d67f20c"
  }
}
> find ~/.cache/pip/fetch-resolve/ -name '*.evaluation' | parallel bzcat | jq '.[]'
...
{
  "name": "astunparse",
  "version": "1.6.3",
  "link": {
    "url": "https://files.pythonhosted.org/packages/f3/af/4182184d3c338792894f34a62672919db7ca008c89abee9b564dd34d8029/astunparse-1.6.3.tar.gz",
    "comes_from": "https://pypi.org/simple/astunparse/",
    "requires_python": null,
    "yanked_reason": null,
    "metadata_file_data": null,
    "hashes": {
      "sha256": "5ad93a8456f0d084c3456d059fd9a92cce667963232cbf763eac3bc5b7940872"
    }
  }
}

It has only a mild performance benefit, converting a 5.9 second resolve to 3.9 seconds (comparing send HTTP caching headers for index pages to further reduce bandwidth usage #12257 vs this PR):
- This is technically a 1.5x speedup, but this doesn't scale linearly, and will only ever shave 2 seconds off the runtime.

# check out #12257
> git checkout link-parsing-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.940s
user    0m4.323s
sys     0m0.131s
> git checkout interpreter-compatibility-cache # apply the change in this PR
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m3.937s
user    0m2.774s
sys     0m0.169s

The implementation is a little complex.

Discussion

I've made this a draft because I'd like to gather opinions on whether this would be worth merging.

ichard26 · 2025-02-02T22:48:53Z

There has been some recent work optimizing link collection, i.e. #13128 and #13132. I'd say we probably don't need this unless/until this becomes a bigger problem (due to other optimizations).

I realize that this is literally the last item in the metadata resolve workstream, so whatevez \o/

When performing `install --dry-run` and PEP 658 .metadata files are available to guide the resolve, do not download the associated wheels. Rather use the distribution information directly from the .metadata files when reporting the results on the CLI and in the --report file. - describe the new --dry-run behavior - finalize linked requirements immediately after resolve - introduce is_concrete - funnel InstalledDistribution through _get_prepared_distribution() too - add test for new install --dry-run functionality (no downloading)

- catch an exception when parsing metadata which only occurs in CI - handle --no-cache-dir - call os.makedirs() before writing to cache too - catch InvalidSchema when attempting git urls with BatchDownloader - fix other test failures - reuse should_cache(req) logic - gzip compress link metadata for a slight reduction in disk space - only cache built sdists - don't check should_cache() when fetching - cache lazy wheel dists - add news - turn debug logs in fetching from cache into exceptions - use scandir over listdir when searching normal wheel cache - handle metadata email parsing errors - correctly handle mutable cached requirement - use bz2 over gzip for an extremely slight improvement in disk usage - handle new google_paste encoding breakage

- pipe in headers arg - provide full context in Link.comes_from - pull in etag and date and cache the outputs - handle --no-cache-dir - add NEWS - remove quotes from etag and use binary checksum to save a few bytes - parse http modified date to compress the cached representation - fix cache-control clobbering

- also compress the link parsing

psf-chronographer bot added the bot:chronographer:provided label Sep 3, 2023

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from c5c6466 to c468bc3 Compare September 3, 2023 12:19

cosmicexplorer mentioned this pull request Sep 3, 2023

memoizing dependency lookups with a local resolve index #12184

Closed

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 5 times, most recently from 2fa96fc to 9ab73d3 Compare September 3, 2023 23:43

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

cosmicexplorer mentioned this pull request Jan 11, 2024

perform 1-3 HTTP requests for each wheel using fast-deps #12208

Open

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 2 times, most recently from 34511b7 to 58a145a Compare January 11, 2024 06:20

cosmicexplorer mentioned this pull request Jan 11, 2024

Populate find_all_candidates cache from threadpool #10480

Open

cosmicexplorer force-pushed the interpreter-compatibility-cache branch 4 times, most recently from eeccc32 to 7a17fd3 Compare January 17, 2024 04:43

ddelange mentioned this pull request Mar 5, 2024

Expose the METADATA file of wheels in the simple API pypi/warehouse#8254

Closed

cosmicexplorer mentioned this pull request Mar 5, 2024

Consider using uv as an optional alternate resolver. pex-tool/pex#2371

Closed

cosmicexplorer mentioned this pull request Aug 13, 2024

Pip install --dry-run shouldn't download full wheels when metadata file available #12603

Closed

1 task

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from 7a17fd3 to f007a62 Compare August 13, 2024 10:56

cosmicexplorer mentioned this pull request Aug 17, 2024

metadata resolve workstream #12921

Open

10 tasks

cosmicexplorer added 3 commits August 21, 2025 00:06

cache "concrete" dists by Distribution instead of InstallRequirement

12fa45a

refactor much of RequirementPreparer and add lengthy docs

a0e34bb

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from f007a62 to cc21192 Compare August 21, 2025 14:20

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from cc21192 to 0ff2695 Compare August 21, 2025 14:49

cosmicexplorer force-pushed the interpreter-compatibility-cache branch from d7615e2 to 620d7b0 Compare August 21, 2025 14:57

cosmicexplorer added 5 commits August 21, 2025 10:57

Remove the in-memory caching of parse_links()

585bdda

cache Link parsing and interpreter compatibility checking

cb27bbe

- also compress the link parsing

make caching correct even if you change the executing python version

620d7b0

remove functools cache for the hand-written one

ea67706

cosmicexplorer mentioned this pull request Sep 25, 2025

improve performance of clingo program generation spack/spack#51365

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent cache for link parsing and interpreter compatibility#12258

persistent cache for link parsing and interpreter compatibility#12258
cosmicexplorer wants to merge 9 commits intopypa:mainfrom
cosmicexplorer:interpreter-compatibility-cache

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

Uh oh!

ichard26 commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cosmicexplorer commented Sep 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background: Caching Link Parsing In-Memory

Proposal: Cache Link Parsing, then Interpreter Compatibility

Result: Some Performance Improvement

Discussion

Uh oh!

ichard26 commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cosmicexplorer commented Sep 3, 2023 •

edited

Loading

Background: Caching `Link` Parsing In-Memory

Proposal: Cache `Link` Parsing, then Interpreter Compatibility