Skip to content

Conversation

@embray
Copy link
Member

@embray embray commented Jul 9, 2015

As an alternative to #3929, create copies of BaseColumn and subclasses on an as-needed basis when using those columns to store multi-dimensional data. The new classes have the slower __getitem__ implementation, while the stock classes go straight through ndarray.__getitem__ and retain its performance (the shimmed classes are also cached, so this not have much performance impact on Column creation either).

On my laptop, before this patch (on current master)

In [1]: from astropy.table import Column

In [2]: c = Column([1, 2, 3])

In [3]: %timeit c[1]
1000000 loops, best of 3: 1.8 µs per loop

After this patch

In [1]: from astropy.table import Column

In [2]: c = Column([1, 2, 3])

In [3]: %timeit c[1]
1000000 loops, best of 3: 195 ns per loop

Note, multi-dimensional columns still have to use the slower __getitem__, but exhibit the correct behavior:

In [4]: c = Column([[1, 2], [3, 4], [5, 6]])

In [5]: c[1]
Out[5]: array([3, 4])

In [6]: %timeit c[1]
100000 loops, best of 3: 1.97 µs per loop

In [7]: c[1:]
Out[7]: 
<Column dtype='int64' shape=(2,) length=2>
3 .. 4
5 .. 6

I think this is probably an acceptable balance, given that the multi-dimensional case is far less common. All the tests passed locally.

@embray embray added table Affects-dev PRs and issues that do not impact an existing Astropy release Performance labels Jul 9, 2015
@mhvk
Copy link
Contributor

mhvk commented Jul 9, 2015

Looks good! Though while I like the generality of generating possibly more types of classes on demand, I wonder if for code clarity it wouldn't be simpler to just define the two cases we need explicitly. I guess one issue is that the names would clash, correct? ... ... ... Hmm, thinking more, your approach also has the advantage that any subclasses automatically do the right thing. So, would seem all fine.

@embray
Copy link
Member Author

embray commented Jul 9, 2015

Right, it had to be done this way to work for any BaseColumn subclasses that come along as well (for example, it works transparently for the Column class too).

One disadvantage is that it could affect subclasses that for whatever reason do not need this workaround, or implement a different __getitem__ entirely. It's a corner case but it could happen. I think I'll add one tweak to enable that...

@embray
Copy link
Member Author

embray commented Jul 9, 2015

Actually, nevermind, doesn't need any tweak. Since _get_nd_proxy_class is a classmethod, any subclass can still override it. They could even do:

@classmethod
def _get_nd_proxy_class(cls):
    return cls

to make it effectively a no-op, if they don't need this work-around.

…bclasses on an as-needed basis when using those columns to store multi-dimensional data. The new classes have the slower __getitem__ implementation, while the stock classes go straight through ndarray.__getitem__ and retain its performance.
@embray embray force-pushed the table/column-getitem branch from 3f3af0a to 4dea3de Compare July 9, 2015 14:34
@taldcroft
Copy link
Member

This looks cool, I learned something new! Just one question and one request.

First, is this MRO at all a problem (that it includes Column twice)?

In [7]: c.__class__.__mro__
Out[7]: 
(astropy.table.column.Column,
 astropy.table.column._NDColumnMixin,
 astropy.table.column.Column,
 astropy.table.column.BaseColumn,
 numpy.ndarray,
 object)

Second, can you just name this_NDColumn instead of _NDColumnMixin? Within Table we have co-opted the Mixin suffix to mean something specific and different (in particular that it doesn't behave entirely like a Column).

@embray
Copy link
Member Author

embray commented Jul 9, 2015

First, is this MRO at all a problem (that it includes Column twice)?

No, both of those "Columns" are actually different classes. The first is the new class that gets generated with the bases (_NDColumnMixin, Column), where the second "Column" is the original Column class. I ping-ponged on whether or not to make it take the same name, because it is obviously confusing, but for this to work entirely transparently it has to (otherwise you'll get funky results on any code that uses self.__class__.__name__ for example). It's just something to be aware of I guess.

Second, can you just name this_NDColumn instead of _NDColumnMixin?

It really is a mixin class though. It's composed with other, more concrete classes through multiple-inheritance to add/override one piece of functionality. Does it matter so long as it's an internal implementation detail most people will never see?

@mdboom
Copy link
Contributor

mdboom commented Jul 9, 2015

Seems like a nice solution.

Don't hold the PR up for this, but I wonder if it would benefit from more generality if _get_nd_proxy_class took the data as an argument and could act accordingly (moving the if self_data.ndim > 1: into _get_nd_proxy_class). That way subclasses (or future necessary workarounds like this) could act on more features of the data than just the fact that it's n-dimensional.

@taldcroft
Copy link
Member

It really is a mixin class though. It's composed with other, more concrete classes through multiple-inheritance to add/override one piece of functionality. Does it matter so long as it's an internal implementation detail most people will never see?

I agree that this is a technically accurate name in the broader context, but for anyone that starts looking at Table code it will be confusing because of the ubiquitous and conflicting usage of mixin.

@embray
Copy link
Member Author

embray commented Jul 9, 2015

@taldcroft Okay, this PR is confusing enough as it is I don't want to make it worse. What about _NDColumnShim? :) (I just don't want to give the impression that it's a column type itself).

@embray
Copy link
Member Author

embray commented Jul 9, 2015

Don't hold the PR up for this, but I wonder if it would benefit from more generality if _get_nd_proxy_class took the data as an argument

I thought of that, and then didn't do it. Not sure why though--now that I think about it again I like it.

embray added 2 commits July 9, 2015 13:30
@taldcroft
Copy link
Member

Looks good to me @embray, thanks!

@mdboom
Copy link
Contributor

mdboom commented Jul 9, 2015

👍 from me!

embray added a commit that referenced this pull request Jul 13, 2015
Speed up single-element Column.__getitem__ for 1-D columns.
@embray embray merged commit 331ef59 into astropy:master Jul 13, 2015
@embray embray deleted the table/column-getitem branch July 13, 2015 14:37
embray added a commit to embray/astropy that referenced this pull request Aug 13, 2015
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
embray added a commit to embray/astropy that referenced this pull request Aug 13, 2015
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
embray added a commit to embray/astropy that referenced this pull request Aug 27, 2015
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
embray added a commit to embray/astropy that referenced this pull request Sep 30, 2015
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Dec 17, 2015
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Jun 12, 2016
… base classes. This eliminates most of the overhead associated with calling pure-Python __getitem__, and speeds up the body of the method as well.
@dhomeier dhomeier mentioned this pull request Oct 30, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Affects-dev PRs and issues that do not impact an existing Astropy release Performance table

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants