Skip to content

Implement lazy submodule importing#5101

Merged
grlee77 merged 41 commits intoscikit-image:mainfrom
stefanv:lazy-imports
Oct 29, 2021
Merged

Implement lazy submodule importing#5101
grlee77 merged 41 commits intoscikit-image:mainfrom
stefanv:lazy-imports

Conversation

@stefanv
Copy link
Copy Markdown
Member

@stefanv stefanv commented Dec 4, 2020

Note: This is an experimental PR, so that we can discuss the idea and get feedback on its implementation.

It allows scikit-image to be used as follows:

import skimage as ski
ski.filters.gaussian(...)

The lazy loading strategy could also be implemented inside submodules,
to speed up imports of that submodule. E.g., we could do:

sla = lazy.require('scipy.linalg')

The linalg module is only loaded once it is used in the code, avoiding the
standard workaround of having to move the import inside of a function
call.

UPDATE: On a suggestion from @jni, I also implemented lazy importing of functions. Now, if you only ever use one function from skimage.filters, only that function will be imported (instead of all the other functions in the skimage.filters submodule). For this proof-of-concept, I've only done skimage.filters.

@stefanv stefanv marked this pull request as draft December 4, 2020 07:12
@scikit-image scikit-image deleted a comment from pep8speaks Dec 4, 2020
@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 4, 2020

Our test runners are unhappy, but I swear it works locally ;) Clearly some details to sort out around what numpy.distutils expects.

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 4, 2020

Tests passed for the previous commit; now added lazy functions for skimage.filters.

@scikit-image scikit-image deleted a comment from pep8speaks Dec 4, 2020
@scikit-image scikit-image deleted a comment from pep8speaks Dec 4, 2020
@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 4, 2020

3.6 is expected to fail. 3.8 failures are unrelated. Still waiting on the others.

@hmaarrfk
Copy link
Copy Markdown
Member

hmaarrfk commented Dec 4, 2020

Does PyPy support __getattr__? I think that is why some people have been keeping 3.6 alive. is there a way to upgrade your lazy module to fallback on Python 3.6 behavior?

@jarrodmillman
Copy link
Copy Markdown
Contributor

jarrodmillman commented Dec 4, 2020

I haven't tested PyPy's 3.7 implementation yet, since I've only been using it as part of our CI testing. Fortunately, GH actions will soon support testing with PyPy3.7:

Are you still planning to support Python 3.6 for the next release? Or is the concern mostly related to CI testing? Will this still be an issue once GH actions sorts things out or do you need another CI system to support PyPy3.7?

Out of curiosity, do you have many users using PyPy? We try to support it for NetworkX, but we've never had a user ask us anything about it. And none of our developers use it. Do you have a way to tell if scikit-image is being run on PyPy in the wild? If so, I would love to know so I can track it for NetworkX.

(I spend a lot of time trying to sort out CI stuff for PyPy and we've spent some time jumping through hoops trying to ensure we mostly work with PyPy. It would be nice to know how many people (if any) are benefiting from the extra work.)

@jarrodmillman
Copy link
Copy Markdown
Contributor

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 4, 2020

This only works with 3.7 up. Python 3.6 will no longer be supported in the next release. I don't know whether this works on PyPy---I haven't tested. But, I also don't know whether scikit-image works on PyPy.

Copy link
Copy Markdown
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really promising. The "demo" case in skimage/filters/__init__.py is great for showing how it's a radical departure from the way imports are currently specified (and arguably looks even better).

One thing that came to mind when thinking about how other libraries could adopt this is how modules that already define a __getattr__ could be accommodated. Huge 👍 on the idea and thanks so much for working this up, I'm looking forward to learning more!

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 6, 2020

@rossbar, good question. The easiest way would be to implement __getattr__ as follows:

lazy_getattr, lazy_dir = install_lazy(...)

def __getattr__(name):
    if name in some_collection_I_want_to_handle:
        ...
    else:
        lazy_getattr(name)

def __dir__():
    my_list = [...]
    return lazy_dir() + my_list

Copy link
Copy Markdown
Contributor

@Erotemic Erotemic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are my 2cents.

@Erotemic
Copy link
Copy Markdown
Contributor

Erotemic commented Dec 22, 2020

@jarrodmillman just pointed me here, and I'm liking the idea of this PR a lot. I left some comments about the implementation of lazy_init.

I'm going to probably spend a day or so working on adding my own variant of this functionality to my mkinit package. I'm hoping to add a formater that generates a __getattr__ function or something like the way the lazy_init function is used here for a module. Assuming that works out, that should make it very easy to port to a new lazy init style. Ideally you will just write mkinit <path-to-init.py> --lazy and it will write the appropriate code to stdout. I'll comment again after I take a look at that.

EDIT: Note: I could configure mkinit to generate an alternate code path that used explicit imports for testing / Python <3.7 compatibility. That might also be interesting to explore while 3.6 is still widely used.

@Erotemic
Copy link
Copy Markdown
Contributor

Update on mkinit

I've started playing around with this in mkinit. I have a working version. I've update the README that walks through a simple example that illustrates what mkinit does: https://github.com/Erotemic/mkinit/tree/dev/0.3.0#the-pitch This ends with a discussion of lazy importing. Note that I haven't finalized this feature yet and its progress is being tracked in Erotemic/mkinit#10

With these new changes I ran: mkinit skimage/filters/__init__.py --relative --lazy which generated this file:
https://gist.github.com/Erotemic/2b2b38bc87386d97ef91ede3b349f5d6

The tests seem to work, but it exposes many more attributes than the current init.py file. This can be controlled. By specifying the __submodules__ : List[str] attribute in the __init__.py, only modules specified in that variable will have themselves and their attributes exposed. Also, if the __all__ variable in a submodule is specified, then only those attributes are exposed via mkinit.

I find it slightly weird that the only module currently exposed is rank and all other exposed names are attributes of submodules. There are mkinit CLI flags to only expose modules or only expose attributes, but mixing requires a bit of manual specification in the header of the __init__.py file (mkinit only clobbers the end of the file after all comments / special directives are read).

General note on __getattr__ efficiency

On a more general note, inside the __getattr__ function I'm specifying globals()[name] = attr, which will replace the more expensive function call with a dictionary lookup on all subsequent times lazy-attributes are accessed. Note, I do believe this only works in the case where __getattr__ is defined in the module itself. Of course we could pass globals() to lazy_install or get fancy with inspect and stack frames if lazy_install is defined outside of this module.

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Dec 24, 2020

@Erotemic Thanks for your feedback! I like the optimizations you propose and will integrate them as soon as I get back after the break. Generalizing this for use in other packages is in scope too.

@Erotemic
Copy link
Copy Markdown
Contributor

Erotemic commented Dec 27, 2020

I've released the new version of mkinit 0.3.0 on pypi that contains the option to generate lazy imports.

For now, you have to use the in-file boilerplate (i.e. every __init__.py defines its own lazy_import function). However, I have infrastructure setup so you will be able to pass in an option --lazy_boilerplate, which is a string that can contain custom code that overwrites how lazy_import is defined. (Thus if you had an internal definition you could just specify an import statement to get rid of all of the extra boilerplate).

Again, the current version might expose more names than the existing API, but that can be fixed by specifying the __all__ attribute in each submodule to define what the user-facing API should be. (IMO that's better practice).

If you only want certain submodules exposed you can specify special global attributes:

    `__submodules__` - indicates the list of submodules to be
        introspected, if unspecified all submodules are introspected.

    `__protected__` -  Protected modules are exposed, but their attributes are not.

    `__private__` - Private modules and their attributes are not exposed.

Each of these global attributes should be a list of strings: (e.g. ['{modname1'}, ...] )

Effectively, __submodules__ is a passlist, __private__ is a blocklist, and __protected__ will add a submodule to the API, but not its attributes. (Useful for forcing things like numpy.linalg).

Note, if you use mkinit with -w specified (i.e. it will overwrite the file), it will only overwrite contents AFTER all lines involved in defining these variables and ALL comments have been parsed. So, you can perform any custom logic you want, just place a comment after it and the autogenerated code will always be written after it. (you can also specify --diff) to look at what mkinit would do.

With the current dev/0.3.1 branch I semi-autogenerated this:

__private__ = ['setup']
__protected__ = ['rank']

__submodules__ = [
    'rank',
    'lpi_filter',
    '_gaussian',
    'edges',
    '_rank_order',
    '_gabor',
    'thresholding',
    'ridges',
    '_median',
    '_sparse',
    '_unsharp_mask',
    '_window',
]
# End of custom init code
from ..util.lazy import lazy_import


__getattr__ = lazy_import(
    __name__,
    submodules={
        'rank',
    },
    submod_attrs={
        '_gabor': [
            'gabor',
            'gabor_kernel',
        ],
        '_gaussian': [
            'difference_of_gaussians',
            'gaussian',
        ],
        '_median': [
            'median',
        ],
        '_rank_order': [
            'rank_order',
        ],
        '_sparse': [
            'correlate_sparse',
        ],
        '_unsharp_mask': [
            'unsharp_mask',
        ],
        '_window': [
            'window',
        ],
        'edges': [
            'HFARID_WEIGHTS',
            'HPREWITT_WEIGHTS',
            'HSCHARR_WEIGHTS',
            'HSOBEL_WEIGHTS',
            'PREWITT_EDGE',
            'PREWITT_SMOOTH',
            'ROBERTS_ND_WEIGHTS',
            'ROBERTS_PD_WEIGHTS',
            'SCHARR_EDGE',
            'SCHARR_SMOOTH',
            'SOBEL_EDGE',
            'SOBEL_SMOOTH',
            'VFARID_WEIGHTS',
            'VPREWITT_WEIGHTS',
            'VSCHARR_WEIGHTS',
            'VSOBEL_WEIGHTS',
            'd1',
            'farid',
            'farid_h',
            'farid_v',
            'laplace',
            'p',
            'prewitt',
            'prewitt_h',
            'prewitt_v',
            'roberts',
            'roberts_neg_diag',
            'roberts_pos_diag',
            'scharr',
            'scharr_h',
            'scharr_v',
            'sobel',
            'sobel_h',
            'sobel_v',
        ],
        'lpi_filter': [
            'LPIFilter2D',
            'constrained_least_squares',
            'eps',
            'forward',
            'inverse',
            'wiener',
        ],
        'ridges': [
            'compute_hessian_eigenvalues',
            'frangi',
            'hessian',
            'meijering',
            'sato',
        ],
        'thresholding': [
            'apply_hysteresis_threshold',
            'threshold_isodata',
            'threshold_li',
            'threshold_local',
            'threshold_mean',
            'threshold_minimum',
            'threshold_multiotsu',
            'threshold_niblack',
            'threshold_otsu',
            'threshold_sauvola',
            'threshold_triangle',
            'threshold_yen',
            'try_all_threshold',
        ],
    },
)


def __dir__():
    return __all__

__all__ = ['HFARID_WEIGHTS', 'HPREWITT_WEIGHTS', 'HSCHARR_WEIGHTS',
           'HSOBEL_WEIGHTS', 'LPIFilter2D', 'PREWITT_EDGE', 'PREWITT_SMOOTH',
           'ROBERTS_ND_WEIGHTS', 'ROBERTS_PD_WEIGHTS', 'SCHARR_EDGE',
           'SCHARR_SMOOTH', 'SOBEL_EDGE', 'SOBEL_SMOOTH', 'VFARID_WEIGHTS',
           'VPREWITT_WEIGHTS', 'VSCHARR_WEIGHTS', 'VSOBEL_WEIGHTS',
           'apply_hysteresis_threshold', 'compute_hessian_eigenvalues',
           'constrained_least_squares', 'correlate_sparse', 'd1',
           'difference_of_gaussians', 'eps', 'farid', 'farid_h', 'farid_v',
           'forward', 'frangi', 'gabor', 'gabor_kernel', 'gaussian', 'hessian',
           'inverse', 'laplace', 'median', 'meijering', 'p', 'prewitt',
           'prewitt_h', 'prewitt_v', 'rank', 'rank_order', 'roberts',
           'roberts_neg_diag', 'roberts_pos_diag', 'sato', 'scharr',
           'scharr_h', 'scharr_v', 'sobel', 'sobel_h', 'sobel_v',
           'threshold_isodata', 'threshold_li', 'threshold_local',
           'threshold_mean', 'threshold_minimum', 'threshold_multiotsu',
           'threshold_niblack', 'threshold_otsu', 'threshold_sauvola',
           'threshold_triangle', 'threshold_yen', 'try_all_threshold',
           'unsharp_mask', 'wiener', 'window']

This is the exact lazy analog of this:

__private__ = ['setup']
__protected__ = ['rank']

__submodules__ = [
    'rank',
    'lpi_filter',
    '_gaussian',
    'edges',
    '_rank_order',
    '_gabor',
    'thresholding',
    'ridges',
    '_median',
    '_sparse',
    '_unsharp_mask',
    '_window',
]
# End of custom init code
import rank

from .lpi_filter import (LPIFilter2D, constrained_least_squares, eps, forward,
                         inverse, wiener,)
from ._gaussian import (difference_of_gaussians, gaussian,)
from .edges import (HFARID_WEIGHTS, HPREWITT_WEIGHTS, HSCHARR_WEIGHTS,
                    HSOBEL_WEIGHTS, PREWITT_EDGE, PREWITT_SMOOTH,
                    ROBERTS_ND_WEIGHTS, ROBERTS_PD_WEIGHTS, SCHARR_EDGE,
                    SCHARR_SMOOTH, SOBEL_EDGE, SOBEL_SMOOTH, VFARID_WEIGHTS,
                    VPREWITT_WEIGHTS, VSCHARR_WEIGHTS, VSOBEL_WEIGHTS, d1,
                    farid, farid_h, farid_v, laplace, p, prewitt, prewitt_h,
                    prewitt_v, roberts, roberts_neg_diag, roberts_pos_diag,
                    scharr, scharr_h, scharr_v, sobel, sobel_h, sobel_v,)
from ._rank_order import (rank_order,)
from ._gabor import (gabor, gabor_kernel,)
from .thresholding import (apply_hysteresis_threshold, threshold_isodata,
                           threshold_li, threshold_local, threshold_mean,
                           threshold_minimum, threshold_multiotsu,
                           threshold_niblack, threshold_otsu,
                           threshold_sauvola, threshold_triangle,
                           threshold_yen, try_all_threshold,)
from .ridges import (compute_hessian_eigenvalues, frangi, hessian, meijering,
                     sato,)
from ._median import (median,)
from ._sparse import (correlate_sparse,)
from ._unsharp_mask import (unsharp_mask,)
from ._window import (window,)

__all__ = ['HFARID_WEIGHTS', 'HPREWITT_WEIGHTS', 'HSCHARR_WEIGHTS',
           'HSOBEL_WEIGHTS', 'LPIFilter2D', 'PREWITT_EDGE', 'PREWITT_SMOOTH',
           'ROBERTS_ND_WEIGHTS', 'ROBERTS_PD_WEIGHTS', 'SCHARR_EDGE',
           'SCHARR_SMOOTH', 'SOBEL_EDGE', 'SOBEL_SMOOTH', 'VFARID_WEIGHTS',
           'VPREWITT_WEIGHTS', 'VSCHARR_WEIGHTS', 'VSOBEL_WEIGHTS',
           'apply_hysteresis_threshold', 'compute_hessian_eigenvalues',
           'constrained_least_squares', 'correlate_sparse', 'd1',
           'difference_of_gaussians', 'eps', 'farid', 'farid_h', 'farid_v',
           'forward', 'frangi', 'gabor', 'gabor_kernel', 'gaussian', 'hessian',
           'inverse', 'laplace', 'median', 'meijering', 'p', 'prewitt',
           'prewitt_h', 'prewitt_v', 'rank', 'rank_order', 'roberts',
           'roberts_neg_diag', 'roberts_pos_diag', 'sato', 'scharr',
           'scharr_h', 'scharr_v', 'sobel', 'sobel_h', 'sobel_v',
           'threshold_isodata', 'threshold_li', 'threshold_local',
           'threshold_mean', 'threshold_minimum', 'threshold_multiotsu',
           'threshold_niblack', 'threshold_otsu', 'threshold_sauvola',
           'threshold_triangle', 'threshold_yen', 'try_all_threshold',
           'unsharp_mask', 'wiener', 'window']

with this: mkinit skimage/filters/__init__.py --relative --lazy --nomods -w --lazy_boilerplate="from ..util.lazy import lazy_import". I believe this is a slightly tighter superset better overlap of current functionality than what I had previously. I could likely find ways to filter out the global attrs, or I could just set __all__ in each submodule.

Given this demo, is there any interest in using mkinit with scikit-learn (or networkx)? If so I can make modifications to improve the generated code based on suggestions, but if not, then I'm glad my prototyping was helpful.

@pep8speaks
Copy link
Copy Markdown

pep8speaks commented Feb 12, 2021

Hello @stefanv! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 97:1: E402 module level import not at top of file

Line 18:80: E501 line too long (82 > 79 characters)

Comment last updated at 2021-10-26 07:02:43 UTC

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Feb 12, 2021

@Erotemic Thank you again for your feedback; I've incorporated it, and simplified the importing mechanism to avoid one level of indirection. I'd appreciate it if you could take another look.


def __getattr__(name):
if name in submodules:
return importlib.import_module(f'{module_name}.{name}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous version of this there was a "require" logic. It did stuff like importlib.util.find_spec and making importlib.util.LazyLoader(spec.loader). I'm not sure if that is necessary or not. If its not necessary, great, I like getting rid of superfluous code, but I'm curious why it was there to begin with and what made you remove it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was primarily because of the evolution of the idea—I started with lazy module imports, and then added the other layers on top.

With __dir__ in place, you can inspect the module, and only on __getattr__ should the module be loaded. The loaded module has the same setup, so it won't be expensive to import either.

You have a keen eye, I must say!

@Erotemic
Copy link
Copy Markdown
Contributor

@stefanv I took a look at this and added a few comments.

Something else I saw is that you aren't adding the returned attribute to the globals dictionary. I would modify install_lazy to accept globals() as a parameter, and then when you return a lazy attribute, add it to the globals() dictionary so you don't incur function overhead when you subsequently access the variable. E.g:

def install_lazy(module_name, submodules=None, submod_attrs=None, globals_=None):
    ...
    def __getattr__(name):
        ...
           attr = importlib.import_module(f'{module_name}.{name}')
           ...
           attr = getattr(submod, name
       ... 
       globals_[name] = attr
       return attr
    ...

replace the returns by assigning the variable to attr.

You could even use the inspect module to grab the globals from the parent frame if they aren't passed in:

def get_stack_frame(n=0, strict=True):
    """
    Gets the current stack frame or any of its ancestors dynamically

    Args:
        n (int): n=0 means the frame you called this function in.
                 n=1 is the parent frame.
        strict (bool): (default = True)

    Returns:
        frame: frame_cur

    Example:
        >>> frame_cur = get_stack_frame(n=0)
        >>> print('frame_cur = %r' % (frame_cur,))
        >>> assert frame_cur.f_globals['frame_cur'] is frame_cur
    """
    frame_cur = inspect.currentframe()
    # Use n+1 to always skip the frame of this function
    for ix in range(n + 1):
        frame_next = frame_cur.f_back
        if frame_next is None:  # nocover
            if strict:
                raise AssertionError('Frame level %r is root' % ix)
            else:
                break
        frame_cur = frame_next
    return frame_cur

...

if globals_ is None:
    globals_ = get_stack_frame(n=1).f_globals

(Side note I'm going to release 0.3.1 of mkinit later today Erotemic/mkinit#11 is there any sort of integration with that that we should consider here?)

Copy link
Copy Markdown
Member

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready as far as you're concerned @stefanv? It's 👍 from me! (though we should talk through @hmaarrfk's concerns before merging)

@hmaarrfk
Copy link
Copy Markdown
Member

hmaarrfk commented Jul 6, 2021

i think there is a simpler way to get the one benefit you listed. but i need to think about it more.

this shouldn't stop others from merging but i would be a little more concerned in propagating this pattern too far.

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Jul 6, 2021

It might be worth reading through the SPEC for motivation. I am definitely not trying to force this onto scikit-image, but I think the benefits are:

a) Easy interactive exploration of the entire API without special imports
b) Developers don't have to jump through special hoops to ensure that modules are written to have small import times (see also the provided lazy external module importer, which I haven't yet used in this PR)

This all happens transparently to the user (if not entirely to developers who add new submodules).

@hmaarrfk The external module importer may well address the pooch import times.

@jni I'm pretty happy with this for now. @dschult made some great suggestions which I incorporated. There were some documentation language fixes that I wasn't sure about; maybe you can help me cast an eye over those?

@hmaarrfk
Copy link
Copy Markdown
Member

hmaarrfk commented Jul 6, 2021

Thanks for the reading material @stefanv very informative!

I can't say i love how the file skimage/filters/__init__.py reads, but i've tried to use lazy importing before and the other organization I came up with before read even more poorly.

👍🏾

@imagesc-bot
Copy link
Copy Markdown

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/why-is-data-dir-a-default-import/57081/3

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Oct 22, 2021

@grlee77 @hmaarrfk @jni I've fixed this up again, so should be ready to go.

@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Oct 26, 2021

@grlee77 I've updated the nomenclature. This should also satisfy @dschult's concern.

Also merged in the latest master.

@grlee77
Copy link
Copy Markdown
Contributor

grlee77 commented Oct 27, 2021

Great, this seems ready to merge. @stefanv, do you want to do the Squash+Merge and trim some of the intermediate commit messages at that time? (I could do it, but thought you may have a clearer view of what you prefer to trim/summarize)

@grlee77 grlee77 merged commit efbb21b into scikit-image:main Oct 29, 2021
@stefanv
Copy link
Copy Markdown
Member Author

stefanv commented Oct 29, 2021

Thanks, Greg, sorry that I didn't get to the editing of the commit message quickly enough!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.