Proof of concept for Table mixins - Time and FunctionColumn #1833

taldcroft · 2013-11-27T02:55:44Z

This is an initial demonstration of the idea of Table mixin columns. These are table columns that are not subclassed from Column or ndarray but which implement a protocol that allows them to be part of a Table. Essentially they have a sufficiently close API and can do the operations required by the Table API.

The implementation for Time is the primary demonstration.

>>> from astropy.table import Table, Column
>>> from astropy.time import Time
>>> a = Column([3, 4, 5], name='a')
>>> time_obj = Time([1, 2, 3], format='unix').replicate(format='iso')
>>> t1 = Table([a])
>>> t1['b'] = time_obj
>>> print t1
 a             b           
--- -----------------------
  3 1970-01-01 00:00:01.000
  4 1970-01-01 00:00:02.000
  5 1970-01-01 00:00:03.000

>>> t1['b']  # This is really a Time object
<Time object: scale='utc' format='iso' value=['1970-01-01 00:00:01.000' '1970-01-01 00:00:02.000'
 '1970-01-01 00:00:03.000']>

>>> t1._data  # Internal JD1 and JD2 stored in the array but not visible in table
array([(3, 2440588.0, -0.4999884259259259),
       (4, 2440588.0, -0.49997685185185187),
       (5, 2440588.0, -0.4999652777777778)], 
      dtype=[('a', '<i8'), ('b__jd1', '<f8'), ('b__jd2', '<f8')])

There is also a FunctionColumn class that demonstrates defining a virtual column which is the functional output of any table columns. This also shows roughly the minimal implementation of the "mixin protocol".

Lastly there is the beginnings of a Coordinates mixin. Unfortunately there is a technical snag that I don't understand.

One of the key concepts here is that (where applicable, e.g. in Time) the internal representation of the object should be stored within the ndarray that is the basis of the Table instance. But now as I'm writing this, maybe that isn't really a hard requirement and everything might just be simpler without doing that. Hmm.

In any case here it is for discussion.

mhvk · 2013-11-27T04:20:50Z

@taldcroft - as you'll see, @wkerzendorf and I had been working on the same problem, but from quite a different angle (#1835). I'll look at your implementation in more detail tomorrow.

wkerzendorf · 2013-11-27T15:48:07Z

@taldcroft so basically your Column is a time object. I'm was wondering - what do we need Columns for then? We have Quantity to store numbers and units, NDData to store numbers, units and metadata, time, coordinates ... . The table could just deal with the formatting. in #1835 there's still a small shim for Columns, but I'm wondering if we couldn't drop columns completely (table would know the column name).

taldcroft · 2013-11-27T16:36:30Z

@wkerzendorf - currently Quantity only handles numbers, not strings, or objects. Both of these need representation in a table. Since NDData is a container class (not inherited from ndarray) it is unsuitable for use as a column class (see below).

Basically my view on any fundamental changes to Table is that they satisfy:

All current tests pass
The behavior of every allowed numpy operation on table columns is unchanged from current. Having deep numpy compatibility is crucial. We decided early on to inherit from ndarray and I think this has proved a good choice.
Full support for all numpy array data types.
Full support for masking (again with no API changes). Even though numpy.ma has its problems, it basically works quite well and there are no competing alternatives. We do not want to roll our own.

Even though Quantity is really great in a lot of contexts, I don't think it is ideal for a base column class. As we learned from Coordinates, there can be substantial overheads that aren't apparent in simple testing. But more importantly it is sufficiently different from ndarray that it would almost certainly create a lot of API breakage.

One thing I have taken away from #1835 is the question of whether we might do away with the underlying structured array that contains the table data in favor of a list of Column or MaskedColumn objects (where I mean the current versions). That might actually simplify things a fair bit, at the expense of making the row access code more complex and slower. A lot of the subtlety in the current code is maintaining the columns as views into the structured array. That might not be needed and would make support for generalized columns much easier.

mhvk · 2013-11-27T16:49:17Z

@taldcroft - While I agree generally with your conditions, I think it is important to note that some current behaviour of Table is wrong. In particular, I would argue that if a Column has a unit, this should be taken into account, and all numpy routines should behave as if this Column held a Quantity. (This is not to say that Column should be a Quantity; string-valued columns are an obvious countercase.)

taldcroft · 2013-11-27T17:14:22Z

I think that treating a column with units as Quantity should be an opt-in choice, not the default. There is a very long legacy of people writing computational code where the units are handled manually and we should not presume to change that tradition over night. I for one do not want columns automatically turning into Quantity. So I could imagine a table method like enable_quantity() that converts Column to Quantity where possible.

wkerzendorf · 2013-11-27T18:17:50Z

@taldcroft Maybe, I should clarify - I didn't mean that a Column would be Quantity. I think a column can be Quantity. A column can also be an ndarray (of the right dimension) that can hold for example strings (objects) and no unit in that case. I wanna do completely away with columns (wondering if that works) and not bring about a new column construct that just inherits from Quantity. Similar where you say now t['b'] == TimeObject. I want that actually to be a time object stored in the table. I envision that for most columns a Quantity would be used (just a few numbers and a unit).

mhvk · 2013-11-27T18:18:28Z

@taldcroft - I'm not sure I agree on the ability to have columns with a unit that are not Quantity, I don't think one should be able to add wrong units in astropy altogether; e.g., do you really want to allow one to join two Column's that have different units?
That said, as far as I currently implemented it, as long as a Column holds an ndarray, its unit attribute is essentially treated as part of meta-data that carries no meaning in doing any calculation.

Anyway, I'm not sure we need to decide this now;; I really need to look at what you do in some detail, and think it would be more productive to focus first on, e.g., the Time-side of things, or, more generally, what does a class need to provide to enable it to become a column in a Table. As you'll see from the TimeArrayRepr class I wrote, our idea was that the only requirement would be for the class to present itself as a single, though possibly structured array, so that it can be view'd as different classes (i.e., column/table data, or back to Time), modulo some trivial resetting of properties. The nice part about this is that Time does not have to know anything about how tables work.

taldcroft · 2013-11-27T19:14:30Z

@mhvk - on the second point of what to decide now, I think we are actually not so far apart in the general view. The idea is that a class like Time needs to be able to participate in all the supported operations of a Table. You did this with a wrapper class TimeArrayRepr while I did it with a protocol that says "to participate in a Table a class must provide certain behaviors (methods, attributes)". The key points in my version are the two methods that basically answer two questions:

Give me the columns corresponding to the "internal" data to put in the table ndarray.
Create on object of this class from specified "internal" data from the table ndarray.

Then there is the data property that gives access to the representation of the object as a single column of values. Note that both of our implementations give you the actual class object when accessing a column. I.e. type(t['time_col']) is Time.

Having done all this, I'm not convinced my approach of putting the internal data into the table _data ndarray is the best path. It works pretty nicely for Time, but I ran into problems for Coordinates where the internal data were segmented in memory (for reasons I don't understand). The added complexity might not be worth it, though I haven't entirely given up.

For these mixin columns there is perhaps no reason to store the internal data in the table _data at all. That simplifies the protocol and implementation considerably and starts to look like your version. The important part of the protocol idea is that when writing table code you need to make certain assumptions about how a column behaves and what attributes are available.

As mentioned earlier one can go further and not have a _data container at all! In this case all the Column objects would hold their own data independently. This is more extreme, and even though _data is nominally private I bet anything there is plenty of code out there that uses it...

mhvk · 2013-11-27T21:36:33Z

@taldcroft - indeed, the background ideas are definitely similar, with the class simply needing to provide the ability to be turned into correctly sequentiable ndarray and the reverse. The one (small) difference is that in my implementation type(t['time_col'] is still a Column, though every attribute including all operators that are not defined in the Column class get interpreted as attributes/methods of the underlying Time object, i.e., the column behaves as if it is a virtual TimeColumn(Column, Time) (reminds me -- I should ensure that isinstance(t['time_col'], Time)==True). The possible advantage is that one still has access to self.parent_table, BaseColumn.__repr__, etc.

The part that I perhaps like most about what I did, is that any ndarray subclass can trivially be made into a Column, with no changes whatsoever needed (at least for the ones we have so far, Quantity, Angle, etc.); I maybe should try to split this out.

But better back to some astronomy now...

taldcroft · 2013-11-28T18:44:12Z

Passing tests now (at least locally).

astrofrog · 2013-11-29T17:50:34Z

This will need rebasing now that #1842 has been merged (it makes Travis work again)

embray · 2013-12-02T17:41:27Z

I need to take a closer look at this, but 👍 in principle. This might end up being very, very useful for FITS tables.

embray · 2013-12-02T18:06:48Z

astropy/table/table.py

Perhaps rather than do all this, it would be better to make a TableMixin abstract base class with this all marked as abstract methods/properties as appropriate. With the appropriate TableMixin.__subclasshook__ classes that we want to use as Table columns don't even need to subclass TableMixin directly (as is currently the case).

taldcroft · 2014-10-21T14:51:03Z

Closing because this PR is incompatible with #2790 and is superceded by #3011.

mhvk mentioned this pull request Nov 27, 2013

Table with generalized columns #1835

Closed

taldcroft mentioned this pull request Nov 27, 2013

Sample implementation allowing Quantity initialisation from Columns #1540

Closed

taldcroft mentioned this pull request Nov 27, 2013

Refactor table Column code into a separate file #1837

Merged

taldcroft added 17 commits November 28, 2013 09:36

Initial proof of concept for ViewColumn

e77f34a

Add basic ViewColumn support in Row (not efficient though)

5481819

Remove units setter and fix problem adding column

93497c4

Change columns to use __print_format__ protocol instead of format attr

5f0ea41

Move ViewColumn into mixins.py

2cafdb8

Add initial column mixin functionality to time

1789503

Add FunctionColumn mixin

4f3e5b4

Pass table to mixin functions instead of just the ndarray data

616f315

Make FunctionColumn use cols not col names

e4f6bd3

Minor additional fixes to get Time mixin working after rebase

02272a6

Rework to make mixins work for typical Table copy and init syntax

53a36cc

Add initial tests of Time mixin

bd13df6

Add test of mixin item access

3794152

Make the FunctionColumn mixin class work and add a test

09f8de7

WIP (broken) Coordinates mixin column

71ceb3c

Fix problem from rebase

181432f

Fix test regressions created by mixin changes

9f54b45

taldcroft added 2 commits November 28, 2013 19:19

Add _has_mixins attr to allow original row code when possible

bf26a43

Fix regression in TimeDelta behavior

9ec5702

mhvk mentioned this pull request Nov 29, 2013

Supporting ndarray subclasses in table columns #1839

Closed

embray reviewed Dec 2, 2013
View reviewed changes

embray mentioned this pull request Dec 11, 2013

Easier subclassing for Table #1888

Merged

astrofrog mentioned this pull request Mar 5, 2014

Release final 0.4.0 version #1802

Closed

24 tasks

astrofrog added this to the v1.0.0 milestone May 3, 2014

taldcroft mentioned this pull request May 14, 2014

Handling nested columns in astropy.table #2485

Closed

astrofrog mentioned this pull request Jul 17, 2014

Release final 1.0 version #2758

Closed

14 tasks

taldcroft mentioned this pull request Jul 24, 2014

Remove numpy structured array as data container in Table #2790

Merged

embray force-pushed the master branch from cd9dfdf to 5d7acd6 Compare August 29, 2014 21:15

taldcroft mentioned this pull request Oct 2, 2014

Support Table mixin columns #3011

Merged

taldcroft self-assigned this Oct 21, 2014

taldcroft added the table label Oct 21, 2014

taldcroft closed this Oct 21, 2014

taldcroft deleted the table/mixin branch October 21, 2014 14:51

embray removed this from the v1.0.0 milestone Oct 21, 2014

Uh oh!

Proof of concept for Table mixins - Time and FunctionColumn #1833

Proof of concept for Table mixins - Time and FunctionColumn #1833

Uh oh!

Conversation

taldcroft commented Nov 27, 2013

Uh oh!

mhvk commented Nov 27, 2013

Uh oh!

wkerzendorf commented Nov 27, 2013

Uh oh!

taldcroft commented Nov 27, 2013

Uh oh!

mhvk commented Nov 27, 2013

Uh oh!

taldcroft commented Nov 27, 2013

Uh oh!

wkerzendorf commented Nov 27, 2013

Uh oh!

mhvk commented Nov 27, 2013

Uh oh!

taldcroft commented Nov 27, 2013

Uh oh!

mhvk commented Nov 27, 2013

Uh oh!

taldcroft commented Nov 28, 2013

Uh oh!

astrofrog commented Nov 29, 2013

Uh oh!

embray commented Dec 2, 2013

Uh oh!

embray Dec 2, 2013

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Oct 21, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants