joinpath() and exists() are linear-time

The time complexity of `exists()` and `joinpath()` is linear in the number of entries in the .zip file. This is pretty unintuitive and unexpected, as the equivalent operations in pathlib are constant time. I'm assuming this is also unintended.

This leads to pathologies. For example, the search() function in importlib_metadata calls joinpath() on every entry in the zipfile, making it quadratic in the number of entries: https://gitlab.com/python-devs/importlib_metadata/blob/master/importlib_metadata/__init__.py#L440 . 

On a zipfile with ~6K entries, we're seeing it take 2-3 seconds to do a single joinpath() operation (!).

The underlying reason that `joinpath()` and `exists()` are linear time is that they call `_names()`, which computes an expanded entries list (adding implied directories to it) on the fly, every time.
It does so in order to support zipfiles that are mutating. However in the importlib_metadata use case at least, the zipfiles are read-only and this expanded entries list could be cached.

Options to fix this include:

1. Don't support mutating zipfiles. Always cache the expanded entries list.
2. Detect the case of mutating zipfiles, cache the expanded entries list in all other cases.
3. Don't worry about implied dirs. Only consider entries that are explicitly listed in the zipfile index. This may break client code expectations, I don't know enough about those to be sure.
4. For `joinpath()`, just mechanically join the argument, don't worry about normalizing trailing slashes depending on what the path actually is. Note that the join method in pathlib doesn't do any existence checking, it is perfectly reasonable to generate paths that don't exist. It always strips trailing slashes and has no opinion on whether the resulting path represents a file or a directory (because again, it may not exist at all).  Whereas here `joinpath()` has the odd behavior that what it returns changes based on whether the underlying entry exists and is a file or a dir. 

Note that option 4 doesn't solve the problem for `exists()`.
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

joinpath() and exists() are linear-time #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

joinpath() and exists() are linear-time #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions