Create ColumnInfo class for mixin columns #3731

taldcroft · 2015-04-26T23:14:28Z

This is an initial implementation for #3518 to create ColumnInfo (aka ColumnAttribute) class for mixin columns.

One point is that for classes with built-in mixin support the info attribute is a class property wrapping _info as a ColumnInfo object, while for external classes (e.g. pandas Series), info is just a plain attribute which is a ColumnInfo object.

Currently all the calls to col_setattr and col_getattr have been left in place, but those methods are now trivial calls to inspect the col.info object. Will get around to doing the replacement once @mhvk is happy with the direction in this PR.

Closes #3518.

mhvk · 2015-05-03T19:22:10Z

@taldcroft - just so you know: this is definitely on my radar and now that teaching/exams are over, I hope to get to it soonish.

taldcroft · 2015-05-03T19:27:40Z

@mhvk - copy. I took off the WIP in the title since I have now entirely removed col_getattr and col_setattr. This is definitely a good thing in terms of the way code now looks. The trade is that any Table column must have the info attribute/property. I think this is worth it.

embray · 2015-05-04T17:50:45Z

This looks super useful 👍

embray · 2015-05-04T17:51:32Z

astropy/coordinates/sky_coordinate.py

Could just use

@lazyproperty def info(self): return ColumnInfo()

:)

@taldcroft - a first comment only yet, but I thought that rather than duplicate code over all classes, one could do something like:

info = property(get_column_info())

where get_column_info is from utils as well and takes a class as an argument, i.e., something like:

def get_column_info(cls=ColumnInfo): def column_info(self): if not hasattr(self, '_info'): self._info = cls(self) return self._info return column_info

For Quantity, one would then have

info = property(get_column_info(QuantityColumnInfo))

But I need to think through this a bit more. In particular, I worry info is a sufficiently general name that for some out-of-astropy classes it is going to be used already.

@mhvk That's what @lazyproperty is. No need for that extra complexity. If you wanted to be able to have different ColumnInfo subclasses that could probably live as an attribute on the Column class or something.

OK, I need to think better. Though we do have a different subclass in Quantity, but that could just be defined like you did above.

@embray - yup, the @lazyproperty looks nice here.

@mhvk - I've also been a little worried about info as being too generic. attrs is probably less common (and also less informative in code). Anyway col_info would be OK by me, though you end up with a lot of col.col_info. Other suggestions?

If I understand in quantity we would just have

@lazyproperty def col_info(self): return QuantityColumnInfo()

That would do it. If you really wanted to go wild, if QuantityColumn is a subclass of Column you could do something in the base class like like

ColumnInfoClass = ColumnInfo @lazyproperty def info(self): return self.ColumnInfoClass()

while QuantityColumn could set ColumnInfoClass = QuantityColumnInfo if need be.

I also agree "info" is vague, but I don't think "col_info" is particularly elucidating since if you're using that attribute you would generally already know that the object you're looking it up on is a column. I think this is a job for documentation just specifying what goes into the ColumnInfo object. This could be documented both in the docs for the .info property, and for the ColumnInfo object itself.

cdeil · 2015-05-24T18:03:20Z

👍 , thanks!

@taldcroft – see https://travis-ci.org/astropy/astropy/jobs/63857641#L1971

cdeil · 2015-05-25T09:47:53Z

astropy/table/info.py

I think this docstring needs to be updated:

ValueError: option=meta is not an allowed information type

Should change to attributes?

taldcroft · 2015-05-25T13:29:18Z

@cdeil - thanks for the useful testing and feedback.

The latest commits do:

Add a shape column to the attributes option if there is any column with non-trivial shape. This shows up in your test file.
Add a n_bad column at the end if there are any columns with at least one "bad" element. For a masked column "bad" means masked. For a non-masked column "bad" means any nan or inf elements. If you have different ideas for defining the number of bad I'm open to suggestion.
I chose "bad" instead of "good" because I personally think that is easier to interpret since bad values are more often in the minority (zero or a few).
Note: if there are no bad elements in the table then the bad column (with all zeros) is not included. I was wondering if this might be confusing, but it follows the idea of this tool that outputs are suppressed if they are all trivial.
Reordered stats per your suggestion.

I'm going to hold off on updating tests and docs until there is agreement that the outputs are OK.

cdeil · 2015-05-25T15:02:57Z

+1 to n_bad instead of n_good and to show the shape.

Can you change the default stats functions so that one gets e.g. a min / max for Conf_68_PosAng in the Fermi catalog example table? Currently I always get nan if there's a nan in a float column:

 Conf_68_SemiMajor           nan           nan         nan         nan    28
   Conf_68_SemiMinor           nan           nan         nan         nan    28
      Conf_68_PosAng           nan           nan         nan         nan    28
   Conf_95_SemiMajor           nan           nan         nan         nan    28
   Conf_95_SemiMinor           nan           nan         nan         nan    28
      Conf_95_PosAng           nan           nan         nan         nan    28

taldcroft · 2015-05-25T15:43:33Z

@mhvk - the last commit is a possible implementation for formatting mixins properly within tables:

In [2]: t = QTable([[1,2]*u.m, SkyCoord([1,2], [1,2], unit='deg,deg')])

In [3]: t
Out[3]: 
<QTable masked=False length=2>
  col0    col1 
   m    deg,deg
float64  object
------- -------
    1.0 1.0 1.0
    2.0 2.0 2.0

taldcroft · 2015-05-25T16:10:41Z

@cdeil - I think 49af58f does the trick. This gives a stats output that looks similar to stilts, except for having nan in some places where stilts has a blank.

cdeil · 2015-05-25T17:37:36Z

astropy/table/info.py

@taldcroft – I think the None option to return the OrderedDict isn't implemented yet, right?

In [23]: t.info(out=None) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-23-ec9644a5a55b> in <module>() ----> 1 t.info(out=None) /Users/deil/code/astropy/astropy/table/info.py in info(self, option, out) 86 87 outlines.extend(info.pformat(max_width=-1, max_lines=-1, show_unit=False)) ---> 88 out.writelines(outline + os.linesep for outline in outlines) AttributeError: 'NoneType' object has no attribute 'writelines'

I think it would be useful to be able to get a Table or OrderedDict back, so that it's possible to e.g. store the stats.

A table would seem best, since this is created anyway!

Done in 0455842.

mhvk · 2015-05-25T19:07:17Z

@taldcroft - one nice thing about meetings is that sometimes one gets to skip a session and do some astropy... I now had a fairly detailed look and I like this ColumnInfo PR very much. Indeed, so nice that I have no bigger comments than that I wonder why column_info.py is in astropy.utils rather than astropy.table. (More detailed comments will follow in-line.)

p.s. With the combination of getting information from columns, info now is, I think, just the right name.

mhvk · 2015-05-25T19:10:12Z

astropy/units/quantity.py

This call should not be necessary, right?

Ah yes, it's a leftover from when __init__ here was doing something. I just blindly deleted that one line.

taldcroft · 2015-05-25T19:22:13Z

I wonder why column_info.py is in astropy.utils rather than astropy.table

That allows Quantity, Time et al. to import ColumnInfo without importing all of astropy.table. I struggled with this a bit but as far as I know you can't get astropy.table.whatever without astropy.table.__init__ running.

mhvk · 2015-05-25T19:25:24Z

astropy/coordinates/sky_coordinate.py

Here, clearly a function is useful, but I'm a bit confused about the need for a cache. Is this for the case that one changes the representation half-way through, and then perhaps change it back? Since this is just a list comprehension loop, is this worth it? Could one just calculate it on the fly?

Separately, I think it would in principle be nice if col.info.default_format gave a string that would give a good clue on how to set format to do something else. Could one do this using my default_format suggestion and implement it as a property as follows?

@property def default_format(self): if self.parent_col is None: raise ValueError("Default format is undefined for a SkyCoordColumnInfo " "instance not attached to a SkyCoordObject.") formats = ['{0.' + compname + '.value:}' for compname in self._parent_col().data.components] return ' '.join(formats)

p.s. Is .value is necessarily the best option? Since the components could be Angle, one could also get hhmmss etc... But that probably is best left for later.

taldcroft · 2015-06-11T17:02:11Z

@mhvk - I need to add a test of the code that removes the class column, but ran out of time before some meetings today. Will add the test this evening. In the meantime you can look at a900297.

taldcroft · 2015-06-11T17:04:11Z

Ack, rushing and made one mistake that is fixed in bbb76e4.

mhvk · 2015-06-11T17:06:01Z

OK, looks good so far!

taldcroft · 2015-06-12T18:28:18Z

@mhvk - added 4 more commits, including 41416c4 that makes Table.info into an InfoDescriptor to give more symmetry with column info. Fingers crossed on tests, I ran a number of configurations locally.

mhvk · 2015-06-12T18:46:28Z

@taldcroft - travis fails since Column().info now gives the class among its properties (correctly, I think, so the test needs adjusting).

But I was surprised by your expectation that a QTable of two quantity columns should list the class. I think this is actually a bug in _is_mixin_column right now:

In [24]: Table()._is_mixin_column([10.]*u.m)
Out[24]: False

In [25]: QTable()._is_mixin_column([10.]*u.m)
Out[25]: True

This should be the other way around, right? If so, just swap True to False in the two definitions.

While you are at it: maybe also redefine has_mixin_columns to:

    @property
    def has_mixin_columns(self):
        """
        True if table has any mixin columns (defined as columns that are not Column
        subclasses)
        """
        return any(self._is_mixin_column(col) for col in self.columns.values())

taldcroft · 2015-06-12T19:22:14Z

@mhvk - tl;dr: the current logic is correct and lots of stuff breaks with the swap you suggest.

Granted, it's always confusing even to me. The _is_mixin_column method does this: Determine if col meets the protocol for a mixin Table column for this table.

For a QTable, a Quantity instance will stay as Quantity when added to the table, hence it IS a mixin column. Conversely, for a Table, a Quantity instance will be converted to a plain Column, so it IS NOT a mixin column.

Maybe the method should be named _will_be_a_mixin_column_when_added_to_table().

mhvk · 2015-06-12T20:03:15Z

@taldcroft - OK, that makes sense, though indeed the name is confusing.

More to the point: I think for a QTable, one should not list the Quantity class if all columns are Quantity, since that is really the "expected" column. I'll make a suggestion in-line on how to change it...

mhvk · 2015-06-12T20:09:08Z

astropy/table/info.py

OK, the simplest would just bet:

if len(uniq_types) == 1 and not tbl._is_mixin_column(cols[0], quantity_is_mixin=False):

It is sufficiently hard to get a Quantity into a regular table that I think it is OK that we do not do the right thing for a regular table with only Quantity columns....

Here is a logically equivalent version that is less obscure, but still relies on the fact that one cannot have a Quantity-mixin in a Table (it will always be converted to Column):

if 'class' in info.colnames: # Remove 'class' info column if all table columns are the same class # and they are either the default class for that table or else Quantity. uniq_types = set(type(col) for col in cols) if len(uniq_types) == 1 and isinstance(cols[0], (tbl.ColumnClass, u.Quantity)): del info['class']

When I wrote this I started thinking "why is Quantity so special?". It's just another mixin class, and has some of the same failings that other mixins have, most notably lack of masking. The fact that there is a QTable is only to differentiate what to do with a thingy-with-units when added to a table (Quantity or Column).

But in the end Quantity is just another mixin along with the rest, and I think it's important to always highlight this distinction by explicitly showing the class just as we would for a QTable full of Coordinates.

I'm inclined to go with:

if len(uniq_types) == 1 and isinstance(cols[0], tbl.ColumnClass):

In the end if it's just horrible this is a soft API (just an informational output) so we can change it.

mhvk · 2015-06-12T21:38:14Z

@taldcroft - OK, am convinced! I like the logic that if a table has just the type of columns it expects, we remove the class entry.

mhvk · 2015-06-12T23:56:50Z

@taldcroft - I had a final look and it is great! I would have merged, too, had it not been for the appveyor. It is something in access_table.rst, but I must admit I do not understand what is wrong...

taldcroft · 2015-06-13T00:23:27Z

Oh my, frustration. Will try once more skipping the doc tests that might be causing the problem. Hopefully we can revert to running the doc tests later, but at least we know they did pass on linux and mac.

Create ColumnInfo class for mixin columns

mhvk · 2015-06-13T14:46:02Z

Weird bug. But 🎆 for having this!

cdeil · 2015-06-13T15:44:44Z

@taldcroft – Thanks! This is great, I'll use this all the time for interactive work with tables.

If I try the Fermi catalog example again, I get a bunch or warnings:
https://gist.github.com/cdeil/97a7be42096cfe71f5c5#file-gistfile1-txt-L95
Is it important to have these warnings for this table for some reason?
Or would it be better to catch them and not show them to the end-user?

embray · 2015-08-19T19:53:18Z

astropy/utils/data_info.py

I left a comment on some commit somewhere questioning the need for this class, versus just making DataInfo itself the descriptor. But now I can't find that original comment anymore because GitHub.

Anyways, I see now why it was done this way--it seems to me you wanted to be able to instantiate a DataInfo only when this attribute is accessed via an instance of the class that this descriptor belongs to, and then to attach the _parent_ref.

Instead, if DataInfo (and friends) are themselves descriptors, you can still have an instance of the descriptor that lives in the class definition like:

class Table(object): info = TableInfo()

this TableInfo is referred to as an unbound descriptor similarly to "unbound method". It's like a template for later descriptors that will be bound to individual instances of the class. Accessing most attributes or methods of the unbound descriptor should probably fail (for example by checking if self._parent_ref is None). Though the unbound descriptor could still have some use--for example its __repr__ can display what attributes the info object carries.

The __get__ method can then return a bound version of the descriptor of the same class, the only difference being that it has its ._parent_ref set. Incidentally the __get__ implemented below will break when you try to look at the descriptor through the class:

In [1]: from astropy.table import Table In [2]: Table.info ERROR: AttributeError: 'NoneType' object has no attribute '__dict__' [astropy.utils.data_info] --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-2-915a8e227a9a> in <module>() ----> 1 Table.info /internal/1/root/src/astropy/astropy/astropy/utils/data_info.py in __get__(self, instance, owner_cls) 154 155 def __get__(self, instance, owner_cls): --> 156 if 'info' not in instance.__dict__: 157 instance.__dict__['info'] = self.info_cls() 158 instance.__dict__['info']._parent = instance AttributeError: 'NoneType' object has no attribute '__dict__'

This is because accessing a descriptor through the class goes through __get__ too, but instance is None in this case. It's typical for this case to just:

if instance is None: return self

where self is the unbound descriptor.

Then, to make the bound descriptor you do something like:

info = instance.__dict__.get('info') if info is None: info = self.__class__() # Or copy or whatever info._parent_ref = weakref.ref(instance) instance.__dict__['info'] = info return info

The one downside to the above approach is that the __init__ of the info class can't rely on _parent_ref. If this is undesirable you could also use the ugly but perfectly valid approach of:

info = self.__new__(self.__class__) info._parent_ref = weakref.ref(instance) info.__init__()

@embray - it would seem more elegant to make DataInfo itself a descriptor, following the procedure you describe. I must admit, though, I still do not fully understand it. In particular, would it also solve the large speed problems we had when info is present? Eg., in Quantity.__array_finalize__, can we still recognize that the descriptor is as yet uninitialized? (From the above it does seem so...). If so, to avoid loosing this again, could you repost your comment as a new issue, so it doesn't get lost again?

The one downside to the above approach is that the init of the info class can't rely on _parent_ref.

I'm not sure what you mean exactly by "can't rely on", but definitely _parent_ref needs to be set every time __get__ is called. I had some issues along the way copying / transferring info and doing this on-the-fly parent referencing made things work better. Apart from the bug with accessing info on the class object (I knew about this but forgot to address it), is there any problem in your approach of moving the info._parent_ref = weakref.ref(instance) statement outside the if block?

I also don't understand how the "ugly but perfectly valid" approach is different from the first. I'm expecting another lesson in advanced Python. 😄

I guess it will be more elegant to be able to say info = QuantifyInfo() in the class definition, but we'll definitely need good code comments so mortals (including myself) of the future can understand what's happening. I think my implementation is more easily understood by being very obvious about what is the info container and what is the descriptor.

What I mean is that with the bound/unbound descriptor pattern, if you want to be able to something in the class's __init__, that __init__ needs to handle being called in two different contexts:

class definition, in which case you're initializing an unbound instance of the descriptor--this is the one that lives on the class itself and provides the descriptor protocol:

class MyClass: info = DataInfo() # <-- DataInfo.__init__ called here

Creating a bound instance--when __get__ is called via the descriptor protocol, the idea is that you then create a new DataInfo instance that is bound to some instance of the parent class:

def __get__(self, inst, type=None): if inst is not None: bound_info = self.__class__() bound_info._parent_ref = inst # Or alternatively if the __init__ should be able to do something with _parent_ref: # bound_info = self.__new__(self.__class__) # bound_info._parent_ref = inst # This is what makes it 'bound' # bound_info.__init__() return bound_info

So if you want the DataInfo.__init__ to be able to do something different for bound instances you have to check in the __init__ for if _parent_ref is set or not.

An alternative is to use different subclasses for the Unbound and Bound versions. That might be simpler to see.

See how the astropy.modeling.Parameter class is implemented for reference. I think ultimately I'm trying to point out that since we already use this pattern elsewhere in Astropy it would make sense to use generally where appropriate. Maybe I can figure out a way to make it easy to implement this pattern generically, like I did with OrderedDescriptor

Maybe a simple way to state this pattern is:

Say what you mean.

So instead of the indirection of:

class MyClass: info = InfoDescriptor(DataInfo)

which says "info is a thing that somehow when I access it will be a DataInfo object". It forces the reader to have to think about descriptors and how that's all implemented. Where what you really want is to be able to write:

class MyClass: info = DataInfo()

which you can read as "info is a DataInfo".

taldcroft added table time units coordinates Affects-release Package-expert Effort-medium labels Apr 26, 2015

taldcroft added this to the v1.1.0 milestone Apr 26, 2015

taldcroft assigned mhvk Apr 26, 2015

taldcroft changed the title ~~WIP: create ColumnInfo class for mixin columns~~ Create ColumnInfo class for mixin columns May 3, 2015

embray reviewed May 4, 2015
View reviewed changes

taldcroft mentioned this pull request May 24, 2015

Add table info() method along with test and doc updates #3455

Closed

taldcroft force-pushed the table-colinfo branch from 4b7c821 to a6b28f5 Compare May 24, 2015 15:14

taldcroft mentioned this pull request May 24, 2015

Allow reshape, ravel, transpose etc. of Time objects #3224

Merged

cdeil reviewed May 25, 2015
View reviewed changes

mhvk reviewed May 25, 2015
View reviewed changes

Remove leftover code [skip ci]

bbb76e4

taldcroft added 4 commits June 12, 2015 13:23

Add test that class info col is suppressed correctly

a43a32d

Change col -> dat and column -> data object

1dc0bcc

Make Table.info be InfoDescriptor and update/expand docs

41416c4

Update CHANGES.rst

9ae9c69

mhvk reviewed Jun 12, 2015
View reviewed changes

Use simpler criterion for deleting info['class']; fix docstring fail

ed8a3e1

taldcroft mentioned this pull request Jun 13, 2015

Problem with #doctest: +FLOAT_CMP on Windows Py 2.6 ? #3853

Closed

Skip some doc tests to avoid issue astropy#3853

8c850ef

taldcroft added a commit that referenced this pull request Jun 13, 2015

Merge pull request #3731 from taldcroft/table-colinfo

fba810d

Create ColumnInfo class for mixin columns

taldcroft merged commit fba810d into astropy:master Jun 13, 2015

taldcroft deleted the table-colinfo branch June 13, 2015 02:37

taldcroft mentioned this pull request Jun 14, 2015

Suppress numpy Nan-related warnings in table info generation #3855

Merged

taldcroft mentioned this pull request Jul 15, 2015

POC: tables within tables within tables! #3963

Closed

taldcroft mentioned this pull request Jul 24, 2015

Retry --open-files check once after garbage collection #4002

Closed

embray removed the Ready-for-final-review label Aug 19, 2015

embray reviewed Aug 19, 2015
View reviewed changes

taldcroft mentioned this pull request Jun 28, 2016

Add method to summarize table column info #2999

Closed

Uh oh!

Create ColumnInfo class for mixin columns #3731

Create ColumnInfo class for mixin columns #3731

Uh oh!

Conversation

taldcroft commented Apr 26, 2015

Uh oh!

mhvk commented May 3, 2015

Uh oh!

taldcroft commented May 3, 2015

Uh oh!

embray commented May 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdeil commented May 24, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented May 25, 2015

Uh oh!

cdeil commented May 25, 2015

Uh oh!

taldcroft commented May 25, 2015

Uh oh!

taldcroft commented May 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented May 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented May 25, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Jun 11, 2015

Uh oh!

taldcroft commented Jun 11, 2015

Uh oh!

mhvk commented Jun 11, 2015

Uh oh!

taldcroft commented Jun 12, 2015

Uh oh!

mhvk commented Jun 12, 2015

Uh oh!

taldcroft commented Jun 12, 2015

Uh oh!

mhvk commented Jun 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Jun 12, 2015

Uh oh!

mhvk commented Jun 12, 2015