Skip to content

Conversation

@taldcroft
Copy link
Member

This is an initial demonstration of the idea of Table mixin columns. These are table columns that are not subclassed from Column or ndarray but which implement a protocol that allows them to be part of a Table. Essentially they have a sufficiently close API and can do the operations required by the Table API.

The implementation for Time is the primary demonstration.

>>> from astropy.table import Table, Column
>>> from astropy.time import Time
>>> a = Column([3, 4, 5], name='a')
>>> time_obj = Time([1, 2, 3], format='unix').replicate(format='iso')
>>> t1 = Table([a])
>>> t1['b'] = time_obj
>>> print t1
 a             b           
--- -----------------------
  3 1970-01-01 00:00:01.000
  4 1970-01-01 00:00:02.000
  5 1970-01-01 00:00:03.000

>>> t1['b']  # This is really a Time object
<Time object: scale='utc' format='iso' value=['1970-01-01 00:00:01.000' '1970-01-01 00:00:02.000'
 '1970-01-01 00:00:03.000']>

>>> t1._data  # Internal JD1 and JD2 stored in the array but not visible in table
array([(3, 2440588.0, -0.4999884259259259),
       (4, 2440588.0, -0.49997685185185187),
       (5, 2440588.0, -0.4999652777777778)], 
      dtype=[('a', '<i8'), ('b__jd1', '<f8'), ('b__jd2', '<f8')])

There is also a FunctionColumn class that demonstrates defining a virtual column which is the functional output of any table columns. This also shows roughly the minimal implementation of the "mixin protocol".

Lastly there is the beginnings of a Coordinates mixin. Unfortunately there is a technical snag that I don't understand.

One of the key concepts here is that (where applicable, e.g. in Time) the internal representation of the object should be stored within the ndarray that is the basis of the Table instance. But now as I'm writing this, maybe that isn't really a hard requirement and everything might just be simpler without doing that. Hmm.

In any case here it is for discussion.

@mhvk
Copy link
Contributor

mhvk commented Nov 27, 2013

@taldcroft - as you'll see, @wkerzendorf and I had been working on the same problem, but from quite a different angle (#1835). I'll look at your implementation in more detail tomorrow.

@wkerzendorf
Copy link
Member

@taldcroft so basically your Column is a time object. I'm was wondering - what do we need Columns for then? We have Quantity to store numbers and units, NDData to store numbers, units and metadata, time, coordinates ... . The table could just deal with the formatting. in #1835 there's still a small shim for Columns, but I'm wondering if we couldn't drop columns completely (table would know the column name).

@taldcroft
Copy link
Member Author

@wkerzendorf - currently Quantity only handles numbers, not strings, or objects. Both of these need representation in a table. Since NDData is a container class (not inherited from ndarray) it is unsuitable for use as a column class (see below).

Basically my view on any fundamental changes to Table is that they satisfy:

  1. All current tests pass
  2. The behavior of every allowed numpy operation on table columns is unchanged from current. Having deep numpy compatibility is crucial. We decided early on to inherit from ndarray and I think this has proved a good choice.
  3. Full support for all numpy array data types.
  4. Full support for masking (again with no API changes). Even though numpy.ma has its problems, it basically works quite well and there are no competing alternatives. We do not want to roll our own.

Even though Quantity is really great in a lot of contexts, I don't think it is ideal for a base column class. As we learned from Coordinates, there can be substantial overheads that aren't apparent in simple testing. But more importantly it is sufficiently different from ndarray that it would almost certainly create a lot of API breakage.

One thing I have taken away from #1835 is the question of whether we might do away with the underlying structured array that contains the table data in favor of a list of Column or MaskedColumn objects (where I mean the current versions). That might actually simplify things a fair bit, at the expense of making the row access code more complex and slower. A lot of the subtlety in the current code is maintaining the columns as views into the structured array. That might not be needed and would make support for generalized columns much easier.

@mhvk
Copy link
Contributor

mhvk commented Nov 27, 2013

@taldcroft - While I agree generally with your conditions, I think it is important to note that some current behaviour of Table is wrong. In particular, I would argue that if a Column has a unit, this should be taken into account, and all numpy routines should behave as if this Column held a Quantity. (This is not to say that Column should be a Quantity; string-valued columns are an obvious countercase.)

@taldcroft
Copy link
Member Author

I think that treating a column with units as Quantity should be an opt-in choice, not the default. There is a very long legacy of people writing computational code where the units are handled manually and we should not presume to change that tradition over night. I for one do not want columns automatically turning into Quantity. So I could imagine a table method like enable_quantity() that converts Column to Quantity where possible.

@wkerzendorf
Copy link
Member

@taldcroft Maybe, I should clarify - I didn't mean that a Column would be Quantity. I think a column can be Quantity. A column can also be an ndarray (of the right dimension) that can hold for example strings (objects) and no unit in that case. I wanna do completely away with columns (wondering if that works) and not bring about a new column construct that just inherits from Quantity. Similar where you say now t['b'] == TimeObject. I want that actually to be a time object stored in the table. I envision that for most columns a Quantity would be used (just a few numbers and a unit).

@mhvk
Copy link
Contributor

mhvk commented Nov 27, 2013

@taldcroft - I'm not sure I agree on the ability to have columns with a unit that are not Quantity, I don't think one should be able to add wrong units in astropy altogether; e.g., do you really want to allow one to join two Column's that have different units?
That said, as far as I currently implemented it, as long as a Column holds an ndarray, its unit attribute is essentially treated as part of meta-data that carries no meaning in doing any calculation.

Anyway, I'm not sure we need to decide this now;; I really need to look at what you do in some detail, and think it would be more productive to focus first on, e.g., the Time-side of things, or, more generally, what does a class need to provide to enable it to become a column in a Table. As you'll see from the TimeArrayRepr class I wrote, our idea was that the only requirement would be for the class to present itself as a single, though possibly structured array, so that it can be view'd as different classes (i.e., column/table data, or back to Time), modulo some trivial resetting of properties. The nice part about this is that Time does not have to know anything about how tables work.

@taldcroft
Copy link
Member Author

@mhvk - on the second point of what to decide now, I think we are actually not so far apart in the general view. The idea is that a class like Time needs to be able to participate in all the supported operations of a Table. You did this with a wrapper class TimeArrayRepr while I did it with a protocol that says "to participate in a Table a class must provide certain behaviors (methods, attributes)". The key points in my version are the two methods that basically answer two questions:

  • Give me the columns corresponding to the "internal" data to put in the table ndarray.
  • Create on object of this class from specified "internal" data from the table ndarray.

Then there is the data property that gives access to the representation of the object as a single column of values. Note that both of our implementations give you the actual class object when accessing a column. I.e. type(t['time_col']) is Time.

Having done all this, I'm not convinced my approach of putting the internal data into the table _data ndarray is the best path. It works pretty nicely for Time, but I ran into problems for Coordinates where the internal data were segmented in memory (for reasons I don't understand). The added complexity might not be worth it, though I haven't entirely given up.

For these mixin columns there is perhaps no reason to store the internal data in the table _data at all. That simplifies the protocol and implementation considerably and starts to look like your version. The important part of the protocol idea is that when writing table code you need to make certain assumptions about how a column behaves and what attributes are available.

As mentioned earlier one can go further and not have a _data container at all! In this case all the Column objects would hold their own data independently. This is more extreme, and even though _data is nominally private I bet anything there is plenty of code out there that uses it...

@mhvk
Copy link
Contributor

mhvk commented Nov 27, 2013

@taldcroft - indeed, the background ideas are definitely similar, with the class simply needing to provide the ability to be turned into correctly sequentiable ndarray and the reverse. The one (small) difference is that in my implementation type(t['time_col'] is still a Column, though every attribute including all operators that are not defined in the Column class get interpreted as attributes/methods of the underlying Time object, i.e., the column behaves as if it is a virtual TimeColumn(Column, Time) (reminds me -- I should ensure that isinstance(t['time_col'], Time)==True). The possible advantage is that one still has access to self.parent_table, BaseColumn.__repr__, etc.

The part that I perhaps like most about what I did, is that any ndarray subclass can trivially be made into a Column, with no changes whatsoever needed (at least for the ones we have so far, Quantity, Angle, etc.); I maybe should try to split this out.

But better back to some astronomy now...

@taldcroft
Copy link
Member Author

Passing tests now (at least locally).

@astrofrog
Copy link
Member

This will need rebasing now that #1842 has been merged (it makes Travis work again)

@embray
Copy link
Member

embray commented Dec 2, 2013

I need to take a closer look at this, but 👍 in principle. This might end up being very, very useful for FITS tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rather than do all this, it would be better to make a TableMixin abstract base class with this all marked as abstract methods/properties as appropriate. With the appropriate TableMixin.__subclasshook__ classes that we want to use as Table columns don't even need to subclass TableMixin directly (as is currently the case).

@taldcroft
Copy link
Member Author

Closing because this PR is incompatible with #2790 and is superceded by #3011.

@taldcroft taldcroft closed this Oct 21, 2014
@taldcroft taldcroft deleted the table/mixin branch October 21, 2014 14:51
@embray embray removed this from the v1.0.0 milestone Oct 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants