Skip to content

joinpath() and exists() are linear-time #32

@benjyw

Description

@benjyw

The time complexity of exists() and joinpath() is linear in the number of entries in the .zip file. This is pretty unintuitive and unexpected, as the equivalent operations in pathlib are constant time. I'm assuming this is also unintended.

This leads to pathologies. For example, the search() function in importlib_metadata calls joinpath() on every entry in the zipfile, making it quadratic in the number of entries: https://gitlab.com/python-devs/importlib_metadata/blob/master/importlib_metadata/__init__.py#L440 .

On a zipfile with ~6K entries, we're seeing it take 2-3 seconds to do a single joinpath() operation (!).

The underlying reason that joinpath() and exists() are linear time is that they call _names(), which computes an expanded entries list (adding implied directories to it) on the fly, every time.
It does so in order to support zipfiles that are mutating. However in the importlib_metadata use case at least, the zipfiles are read-only and this expanded entries list could be cached.

Options to fix this include:

  1. Don't support mutating zipfiles. Always cache the expanded entries list.
  2. Detect the case of mutating zipfiles, cache the expanded entries list in all other cases.
  3. Don't worry about implied dirs. Only consider entries that are explicitly listed in the zipfile index. This may break client code expectations, I don't know enough about those to be sure.
  4. For joinpath(), just mechanically join the argument, don't worry about normalizing trailing slashes depending on what the path actually is. Note that the join method in pathlib doesn't do any existence checking, it is perfectly reasonable to generate paths that don't exist. It always strips trailing slashes and has no opinion on whether the resulting path represents a file or a directory (because again, it may not exist at all). Whereas here joinpath() has the odd behavior that what it returns changes based on whether the underlying entry exists and is a file or a dir.

Note that option 4 doesn't solve the problem for exists().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions