Skip to content

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Nov 29, 2013

I made another attempt at getting Column's to hold Quantity and other ndarray subclasses, one which I think is mostly complementary to @taldcroft's #1833, as I on purpose did not (yet) include Time here. It is also simpler than my other attempt (#1835), a bit more in the spirit of #1583, in that the essence is to create a new Column subclass whose other subclass is Quantity (or whatever; note that I did not adjust MaskedColumn yet, since we do not really have other MaskedArray subclasses yet). The new piece is that new subclasses are defined automatically (with Quantity input -> QuantityColumn class, etc.)

With this PR:

from astropy.table import Table; from astropy.coordinates import Angle, Longitude, Latitude; import numpy as np; import astropy.units as u
q = np.arange(3.)*u.m
l = Longitude(np.arange(10.,13.), 'deg')
b = Table([q,l], names=['q','l'])

In [7]: print(b)
  q       l    
----- ---------
0.0 m 10d00m00s
1.0 m 11d00m00s
2.0 m 12d00m00s

In [8]: b['q'], b['l']
Out[8]: 
(QuantityColumn name='q' unit='m' format=None description=None>
<Quantity [ 0., 1., 2.] m>
<LongitudeColumn name='l' unit='deg' format=None description=None>
<Longitude [ 10., 11., 12.] deg>)

In [9]: b['q'] += 1.*u.cm

In [10]: print(b)
  q        l    
------ ---------
0.01 m 10d00m00s
1.01 m 11d00m00s
2.01 m 12d00m00s

@taldcroft
Copy link
Member

@mhvk - This is looking promising! I think we can effectively separate the problems of supporting ndarray subclasses like Quantity from supporting arbitrary array-like container classes like Time or other generalized column concepts like FunctionColumn. So perhaps you should rename this PR something like "Support Quantity as Table column", and then stay within that agenda. I'd like to continue #1833 for generalized non-ndarray columns, taking inspiration as needed from #1835.

You might also consider closing #1583 as being superceded by this PR.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

@taldcroft - yes, the separation works out quite well; I'm still somewhat amazed that by just using __class__ and __dict__ from an ndarray subclass, one can completely recover Quantity (subclass) objects.

I renamed the PR in the spirit of your suggestion and closed #1583.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

p.s. I would have loved to have a more general Column also deal with MaskedArray subclasses, but unfortunately it is not quite as easy as the masked arrays are effectively two arrays, which are not directly tied. In that sense, they still more similar to Time to me, in that a near-trivial conversion between some two-element dtype (with second bool) could be used for MaskedArray<->Column conversion. Anyway, since this subclass does not work, that is for another PR.

Relatedly, currently, Column still uses np.asarray for any data it does not recognize, so masks get stripped, which is fine, but this also means only Quantity and its subclasses can be held for now. It would be nice to generalize this, but that needs some thought about how: the most important requirement would seem to be that slicing does not nothing more than slice the underlying ndarray (i.e., does not also slice metadata like the mask). Anyway, probably best also for another PR!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you take these out of BaseColumn? The behavior that boolean comparisons return a plain ndarray (or MaskedArray) was discussed and implemented in #1446 and #1685.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because the comparison should not be by the raw ndarray data for, e.g., Quantity columns. But I can instead try to rewrite these to use the content property if it exists..

Aside: this would be a very good example of why PR should not be considered complete without a test case that ensures the issue does not recur! More mundanely, mentioning those issue numbers in the comments would have been helpful... Would you have time to do a quick PR to add tests, and I'll rebase and make sure I pass them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both those PRs had tests.

@taldcroft
Copy link
Member

Right now this doesn't really play with masking, which is perfectly fine. So what about raising an exception for any attempt to have a generalized ndarray column in a masked table?

@astrofrog
Copy link
Member

This will need rebasing now that #1842 has been merged (it makes Travis work again)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding __getitem__ here is pretty worrisome, especially with this implementation that requires going through the content property which does a dict update. I see about a factor of 20 performance hit for:

c = Column([1, 2])
timeit c[1]

Also, without having carefully thought through the code here, I'm suspicious that this will end up changing the API behavior somewhere since it's not just falling through to the base __getitem__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, factor 20 is definitely too much; I'll see if I can rewrite it so it spends most time only when really needed.

(I'm afraid some of this are the effects of putting out the PR late at night...)

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

@taldcroft - thanks for all the detailed comments. Most I can just implement; I'll think a bit more about how to ensure __getitem__ remains fast.

I'll also see if I can add some checks for MaskedColumn (and will see if I can move a little more common code to BaseColumn)

@taldcroft
Copy link
Member

I've looked at this more today and I think it's going to work. This is pretty impressive stuff. Some of this is somewhat intricate in terms of the logic and use of numpy subclassing techniques, so it wouldn't hurt to add comments to help future readers (both of us included) to understand the flow.

@taldcroft
Copy link
Member

One question we're going to have to answer is how to introduce this API change. Right now if you set a table column with a Quantity you get a Column with the right units. After this PR you will get a QuantityColumn. This is a non-trivial change for anyone that was actually using the former behavior. I'm not sure of the best answer now, just wanted to get this question out there.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

OK, thanks! I'll add comments, documentation, and tests... (sure way to find some problems too :-)

Small aside on those comparison functions: while comparing to how we dealt with it in Quantity (through the quantity_helper functions), I noticed just overwriting the comparison functions does not give complete coverage. Currently, one has the same problem that a Column of bool is produced for some other numpy functions. E.g.,

In [5]: np.isfinite(Table([[1., 3., 5.]])['col0'])
Out[5]: 
<Column name='col0' unit=None format=None description=None>
array([ True,  True,  True], dtype=bool)

I'll try to catch this in __array_wrap__ instead; this may even be somewhat faster.

@taldcroft
Copy link
Member

The great thing about this approach is that we can expect success for introducing Quanity columns for most of the other tests, but still, tests are needed... I'm curious if there is any breakage for the high-level table operations like join and grouping.

These are looming in my mind for the non-ndarray columns since they need to go by a different path. I'm pretty sure (at least initially) that the non-ndarray cols won't work as key columns in any high level operations because I'm tied to ndarray methods in the implementation.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

On the API change: you're right, I think this requires thought and at the very least a deprecation warning.

Similarly, right now, if you give a plain array and a unit, the result is not a Quantity, which I do not particularly like.

And finally, we have to decide what happens if you set the unit. This is not allowed for a Quantity, but is allowed for a Column. One option would be to allow setting it only if it is None.

I guess this already brings up the larger question touched upon earlier, whether one should have columns in which the unit is only some otherwise meaning piece of metadata (I feel no; that is what meta is for... but I'm greatly helped by being relatively new and unburdened with code which would assume otherwise!).

@mhvk
Copy link
Contributor Author

mhvk commented Nov 29, 2013

Agreed that the non-ndarray columns are interesting; ideally, they behave as close as possible to columns. But e.g. on grouping by Time, one may need to tell uses that it only works on, say, the string representation.

@eteq
Copy link
Member

eteq commented May 14, 2014

@mhvk - what is the status of this? I see it got pushed to 1.0 at some point... Is that definite, or is this actually nearly ready?

Or is there another PR that subsumes this or otherwise makes Table and Quantity work better together?

@mhvk
Copy link
Contributor Author

mhvk commented May 18, 2014

@eteq - this is being held up mostly by getting things to work with masked columns as well; I've been thinking to try to separate that out better, but have not spent much time recently. I think it is alright to have this for the next major release (with what seems to quite a few other improvements to Table!).

@mhvk mhvk force-pushed the table-generalized-columns-subclasses branch from 1b14e0f to ac89546 Compare September 17, 2014 20:19
@mhvk
Copy link
Contributor Author

mhvk commented Sep 17, 2014

@eteq - I rebased this. Would you like to try it and see if it works in your code? So far, it is rather light on testing (just your examples from #2000)...

@eteq
Copy link
Member

eteq commented Sep 24, 2014

@mhvk - this is fantastic! Exactly what I want to see! I think just a few standard tests that they behave like Quantitys should suffice at this point. (and your question on astropy-dev should probably get resolved? Maybe that can be in a later PR, though.)

BTW, is this still a stepping stone to something like #1833 ? I'm particularly interested in trying to stuff SkyCoords into tables...

Also, looks like it needs another rebase, probably due to #2963 ...?

@mhvk
Copy link
Contributor Author

mhvk commented Sep 24, 2014

@eteq - thanks! I'll try to add further tests soonish. I'd definitely still want something along the lines of #1833, #1835 or #2790, but I think this is independent of all those.

My real hesitation is that it will not work with MaskedColumn, which is likely to bite the unwary. It may be wise to explicitly raise an exception if one tries.

@taldcroft
Copy link
Member

@mhvk - I've been thinking about tables and revisiting the previous efforts on these lines. I'll be looking at stuff this week.

@embray embray removed this from the v1.0.0 milestone Jan 20, 2015
@mhvk mhvk deleted the table-generalized-columns-subclasses branch June 5, 2015 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants