The time complexity of exists() and joinpath() is linear in the number of entries in the .zip file. This is pretty unintuitive and unexpected, as the equivalent operations in pathlib are constant time. I'm assuming this is also unintended.
This leads to pathologies. For example, the search() function in importlib_metadata calls joinpath() on every entry in the zipfile, making it quadratic in the number of entries: https://gitlab.com/python-devs/importlib_metadata/blob/master/importlib_metadata/__init__.py#L440 .
On a zipfile with ~6K entries, we're seeing it take 2-3 seconds to do a single joinpath() operation (!).
The underlying reason that joinpath() and exists() are linear time is that they call _names(), which computes an expanded entries list (adding implied directories to it) on the fly, every time.
It does so in order to support zipfiles that are mutating. However in the importlib_metadata use case at least, the zipfiles are read-only and this expanded entries list could be cached.
Options to fix this include:
- Don't support mutating zipfiles. Always cache the expanded entries list.
- Detect the case of mutating zipfiles, cache the expanded entries list in all other cases.
- Don't worry about implied dirs. Only consider entries that are explicitly listed in the zipfile index. This may break client code expectations, I don't know enough about those to be sure.
- For
joinpath(), just mechanically join the argument, don't worry about normalizing trailing slashes depending on what the path actually is. Note that the join method in pathlib doesn't do any existence checking, it is perfectly reasonable to generate paths that don't exist. It always strips trailing slashes and has no opinion on whether the resulting path represents a file or a directory (because again, it may not exist at all). Whereas here joinpath() has the odd behavior that what it returns changes based on whether the underlying entry exists and is a file or a dir.
Note that option 4 doesn't solve the problem for exists().
The time complexity of
exists()andjoinpath()is linear in the number of entries in the .zip file. This is pretty unintuitive and unexpected, as the equivalent operations in pathlib are constant time. I'm assuming this is also unintended.This leads to pathologies. For example, the search() function in importlib_metadata calls joinpath() on every entry in the zipfile, making it quadratic in the number of entries: https://gitlab.com/python-devs/importlib_metadata/blob/master/importlib_metadata/__init__.py#L440 .
On a zipfile with ~6K entries, we're seeing it take 2-3 seconds to do a single joinpath() operation (!).
The underlying reason that
joinpath()andexists()are linear time is that they call_names(), which computes an expanded entries list (adding implied directories to it) on the fly, every time.It does so in order to support zipfiles that are mutating. However in the importlib_metadata use case at least, the zipfiles are read-only and this expanded entries list could be cached.
Options to fix this include:
joinpath(), just mechanically join the argument, don't worry about normalizing trailing slashes depending on what the path actually is. Note that the join method in pathlib doesn't do any existence checking, it is perfectly reasonable to generate paths that don't exist. It always strips trailing slashes and has no opinion on whether the resulting path represents a file or a directory (because again, it may not exist at all). Whereas herejoinpath()has the odd behavior that what it returns changes based on whether the underlying entry exists and is a file or a dir.Note that option 4 doesn't solve the problem for
exists().