Skip to content

Conversation

@taldcroft
Copy link
Member

This is an initial implementation for #3518 to create ColumnInfo (aka ColumnAttribute) class for mixin columns.

One point is that for classes with built-in mixin support the info attribute is a class property wrapping _info as a ColumnInfo object, while for external classes (e.g. pandas Series), info is just a plain attribute which is a ColumnInfo object.

Currently all the calls to col_setattr and col_getattr have been left in place, but those methods are now trivial calls to inspect the col.info object. Will get around to doing the replacement once @mhvk is happy with the direction in this PR.

Closes #3518.

@taldcroft taldcroft added this to the v1.1.0 milestone Apr 26, 2015
@taldcroft taldcroft changed the title WIP: create ColumnInfo class for mixin columns Create ColumnInfo class for mixin columns May 3, 2015
@mhvk
Copy link
Contributor

mhvk commented May 3, 2015

@taldcroft - just so you know: this is definitely on my radar and now that teaching/exams are over, I hope to get to it soonish.

@taldcroft
Copy link
Member Author

@mhvk - copy. I took off the WIP in the title since I have now entirely removed col_getattr and col_setattr. This is definitely a good thing in terms of the way code now looks. The trade is that any Table column must have the info attribute/property. I think this is worth it.

@embray
Copy link
Member

embray commented May 4, 2015

This looks super useful 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just use

@lazyproperty
def info(self):
    return ColumnInfo()

:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft - a first comment only yet, but I thought that rather than duplicate code over all classes, one could do something like:

info = property(get_column_info())

where get_column_info is from utils as well and takes a class as an argument, i.e., something like:

def get_column_info(cls=ColumnInfo):
    def column_info(self):
        if not hasattr(self, '_info'):
            self._info = cls(self)
         return self._info
    return column_info

For Quantity, one would then have

    info = property(get_column_info(QuantityColumnInfo))

But I need to think through this a bit more. In particular, I worry info is a sufficiently general name that for some out-of-astropy classes it is going to be used already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk That's what @lazyproperty is. No need for that extra complexity. If you wanted to be able to have different ColumnInfo subclasses that could probably live as an attribute on the Column class or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I need to think better. Though we do have a different subclass in Quantity, but that could just be defined like you did above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embray - yup, the @lazyproperty looks nice here.

@mhvk - I've also been a little worried about info as being too generic. attrs is probably less common (and also less informative in code). Anyway col_info would be OK by me, though you end up with a lot of col.col_info. Other suggestions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand in quantity we would just have

@lazyproperty
def col_info(self):
    return QuantityColumnInfo()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would do it. If you really wanted to go wild, if QuantityColumn is a subclass of Column you could do something in the base class like like

ColumnInfoClass = ColumnInfo

@lazyproperty
def info(self):
    return self.ColumnInfoClass()

while QuantityColumn could set ColumnInfoClass = QuantityColumnInfo if need be.

I also agree "info" is vague, but I don't think "col_info" is particularly elucidating since if you're using that attribute you would generally already know that the object you're looking it up on is a column. I think this is a job for documentation just specifying what goes into the ColumnInfo object. This could be documented both in the docs for the .info property, and for the ColumnInfo object itself.

@cdeil
Copy link
Member

cdeil commented May 24, 2015

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring needs to be updated:

ValueError: option=meta is not an allowed information type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should change to attributes?

@taldcroft
Copy link
Member Author

@cdeil - thanks for the useful testing and feedback.

The latest commits do:

  • Add a shape column to the attributes option if there is any column with non-trivial shape. This shows up in your test file.
  • Add a n_bad column at the end if there are any columns with at least one "bad" element. For a masked column "bad" means masked. For a non-masked column "bad" means any nan or inf elements. If you have different ideas for defining the number of bad I'm open to suggestion.
  • I chose "bad" instead of "good" because I personally think that is easier to interpret since bad values are more often in the minority (zero or a few).
  • Note: if there are no bad elements in the table then the bad column (with all zeros) is not included. I was wondering if this might be confusing, but it follows the idea of this tool that outputs are suppressed if they are all trivial.
  • Reordered stats per your suggestion.

I'm going to hold off on updating tests and docs until there is agreement that the outputs are OK.

@cdeil
Copy link
Member

cdeil commented May 25, 2015

+1 to n_bad instead of n_good and to show the shape.

Can you change the default stats functions so that one gets e.g. a min / max for Conf_68_PosAng in the Fermi catalog example table? Currently I always get nan if there's a nan in a float column:

 Conf_68_SemiMajor           nan           nan         nan         nan    28
   Conf_68_SemiMinor           nan           nan         nan         nan    28
      Conf_68_PosAng           nan           nan         nan         nan    28
   Conf_95_SemiMajor           nan           nan         nan         nan    28
   Conf_95_SemiMinor           nan           nan         nan         nan    28
      Conf_95_PosAng           nan           nan         nan         nan    28

@taldcroft
Copy link
Member Author

@mhvk - the last commit is a possible implementation for formatting mixins properly within tables:

In [2]: t = QTable([[1,2]*u.m, SkyCoord([1,2], [1,2], unit='deg,deg')])

In [3]: t
Out[3]: 
<QTable masked=False length=2>
  col0    col1 
   m    deg,deg
float64  object
------- -------
    1.0 1.0 1.0
    2.0 2.0 2.0

@taldcroft
Copy link
Member Author

@cdeil - I think 49af58f does the trick. This gives a stats output that looks similar to stilts, except for having nan in some places where stilts has a blank.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft – I think the None option to return the OrderedDict isn't implemented yet, right?

In [23]: t.info(out=None)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-ec9644a5a55b> in <module>()
----> 1 t.info(out=None)

/Users/deil/code/astropy/astropy/table/info.py in info(self, option, out)
     86 
     87     outlines.extend(info.pformat(max_width=-1, max_lines=-1, show_unit=False))
---> 88     out.writelines(outline + os.linesep for outline in outlines)

AttributeError: 'NoneType' object has no attribute 'writelines'

I think it would be useful to be able to get a Table or OrderedDict back, so that it's possible to e.g. store the stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table would seem best, since this is created anyway!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0455842.

@mhvk
Copy link
Contributor

mhvk commented May 25, 2015

@taldcroft - one nice thing about meetings is that sometimes one gets to skip a session and do some astropy... I now had a fairly detailed look and I like this ColumnInfo PR very much. Indeed, so nice that I have no bigger comments than that I wonder why column_info.py is in astropy.utils rather than astropy.table. (More detailed comments will follow in-line.)

p.s. With the combination of getting information from columns, info now is, I think, just the right name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call should not be necessary, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it's a leftover from when __init__ here was doing something. I just blindly deleted that one line.

@taldcroft
Copy link
Member Author

I wonder why column_info.py is in astropy.utils rather than astropy.table

That allows Quantity, Time et al. to import ColumnInfo without importing all of astropy.table. I struggled with this a bit but as far as I know you can't get astropy.table.whatever without astropy.table.__init__ running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, clearly a function is useful, but I'm a bit confused about the need for a cache. Is this for the case that one changes the representation half-way through, and then perhaps change it back? Since this is just a list comprehension loop, is this worth it? Could one just calculate it on the fly?

Separately, I think it would in principle be nice if col.info.default_format gave a string that would give a good clue on how to set format to do something else. Could one do this using my default_format suggestion and implement it as a property as follows?

@property
def default_format(self):
    if self.parent_col is None:
       raise ValueError("Default format is undefined for a SkyCoordColumnInfo "
                        "instance not attached to a SkyCoordObject.")
    formats = ['{0.' + compname + '.value:}' for compname
               in self._parent_col().data.components]
    return ' '.join(formats)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. Is .value is necessarily the best option? Since the components could be Angle, one could also get hhmmss etc... But that probably is best left for later.

@taldcroft
Copy link
Member Author

@mhvk - I need to add a test of the code that removes the class column, but ran out of time before some meetings today. Will add the test this evening. In the meantime you can look at a900297.

@taldcroft
Copy link
Member Author

Ack, rushing and made one mistake that is fixed in bbb76e4.

@mhvk
Copy link
Contributor

mhvk commented Jun 11, 2015

OK, looks good so far!

@taldcroft
Copy link
Member Author

@mhvk - added 4 more commits, including 41416c4 that makes Table.info into an InfoDescriptor to give more symmetry with column info. Fingers crossed on tests, I ran a number of configurations locally.

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2015

@taldcroft - travis fails since Column().info now gives the class among its properties (correctly, I think, so the test needs adjusting).

But I was surprised by your expectation that a QTable of two quantity columns should list the class. I think this is actually a bug in _is_mixin_column right now:

In [24]: Table()._is_mixin_column([10.]*u.m)
Out[24]: False

In [25]: QTable()._is_mixin_column([10.]*u.m)
Out[25]: True

This should be the other way around, right? If so, just swap True to False in the two definitions.

While you are at it: maybe also redefine has_mixin_columns to:

    @property
    def has_mixin_columns(self):
        """
        True if table has any mixin columns (defined as columns that are not Column
        subclasses)
        """
        return any(self._is_mixin_column(col) for col in self.columns.values())

@taldcroft
Copy link
Member Author

@mhvk - tl;dr: the current logic is correct and lots of stuff breaks with the swap you suggest.

Granted, it's always confusing even to me. The _is_mixin_column method does this: Determine if col meets the protocol for a mixin Table column for this table.

For a QTable, a Quantity instance will stay as Quantity when added to the table, hence it IS a mixin column. Conversely, for a Table, a Quantity instance will be converted to a plain Column, so it IS NOT a mixin column.

Maybe the method should be named _will_be_a_mixin_column_when_added_to_table().

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2015

@taldcroft - OK, that makes sense, though indeed the name is confusing.

More to the point: I think for a QTable, one should not list the Quantity class if all columns are Quantity, since that is really the "expected" column. I'll make a suggestion in-line on how to change it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the simplest would just bet:

if len(uniq_types) == 1 and not tbl._is_mixin_column(cols[0], quantity_is_mixin=False):

It is sufficiently hard to get a Quantity into a regular table that I think it is OK that we do not do the right thing for a regular table with only Quantity columns....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a logically equivalent version that is less obscure, but still relies on the fact that one cannot have a Quantity-mixin in a Table (it will always be converted to Column):

    if 'class' in info.colnames:
        # Remove 'class' info column if all table columns are the same class
        # and they are either the default class for that table or else Quantity.
        uniq_types = set(type(col) for col in cols)
        if len(uniq_types) == 1 and isinstance(cols[0], (tbl.ColumnClass, u.Quantity)):
            del info['class']

When I wrote this I started thinking "why is Quantity so special?". It's just another mixin class, and has some of the same failings that other mixins have, most notably lack of masking. The fact that there is a QTable is only to differentiate what to do with a thingy-with-units when added to a table (Quantity or Column).

But in the end Quantity is just another mixin along with the rest, and I think it's important to always highlight this distinction by explicitly showing the class just as we would for a QTable full of Coordinates.

I'm inclined to go with:

        if len(uniq_types) == 1 and isinstance(cols[0], tbl.ColumnClass):

In the end if it's just horrible this is a soft API (just an informational output) so we can change it.

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2015

@taldcroft - OK, am convinced! I like the logic that if a table has just the type of columns it expects, we remove the class entry.

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2015

@taldcroft - I had a final look and it is great! I would have merged, too, had it not been for the appveyor. It is something in access_table.rst, but I must admit I do not understand what is wrong...

@taldcroft
Copy link
Member Author

Oh my, frustration. Will try once more skipping the doc tests that might be causing the problem. Hopefully we can revert to running the doc tests later, but at least we know they did pass on linux and mac.

taldcroft added a commit that referenced this pull request Jun 13, 2015
Create ColumnInfo class for mixin columns
@taldcroft taldcroft merged commit fba810d into astropy:master Jun 13, 2015
@taldcroft taldcroft deleted the table-colinfo branch June 13, 2015 02:37
@mhvk
Copy link
Contributor

mhvk commented Jun 13, 2015

Weird bug. But 🎆 for having this!

@cdeil
Copy link
Member

cdeil commented Jun 13, 2015

@taldcroft – Thanks! This is great, I'll use this all the time for interactive work with tables.

If I try the Fermi catalog example again, I get a bunch or warnings:
https://gist.github.com/cdeil/97a7be42096cfe71f5c5#file-gistfile1-txt-L95
Is it important to have these warnings for this table for some reason?
Or would it be better to catch them and not show them to the end-user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on some commit somewhere questioning the need for this class, versus just making DataInfo itself the descriptor. But now I can't find that original comment anymore because GitHub.

Anyways, I see now why it was done this way--it seems to me you wanted to be able to instantiate a DataInfo only when this attribute is accessed via an instance of the class that this descriptor belongs to, and then to attach the _parent_ref.

Instead, if DataInfo (and friends) are themselves descriptors, you can still have an instance of the descriptor that lives in the class definition like:

class Table(object):
    info = TableInfo()

this TableInfo is referred to as an unbound descriptor similarly to "unbound method". It's like a template for later descriptors that will be bound to individual instances of the class. Accessing most attributes or methods of the unbound descriptor should probably fail (for example by checking if self._parent_ref is None). Though the unbound descriptor could still have some use--for example its __repr__ can display what attributes the info object carries.

The __get__ method can then return a bound version of the descriptor of the same class, the only difference being that it has its ._parent_ref set. Incidentally the __get__ implemented below will break when you try to look at the descriptor through the class:

In [1]: from astropy.table import Table

In [2]: Table.info
ERROR: AttributeError: 'NoneType' object has no attribute '__dict__' [astropy.utils.data_info]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-915a8e227a9a> in <module>()
----> 1 Table.info

/internal/1/root/src/astropy/astropy/astropy/utils/data_info.py in __get__(self, instance, owner_cls)
    154 
    155     def __get__(self, instance, owner_cls):
--> 156         if 'info' not in instance.__dict__:
    157             instance.__dict__['info'] = self.info_cls()
    158         instance.__dict__['info']._parent = instance

AttributeError: 'NoneType' object has no attribute '__dict__'

This is because accessing a descriptor through the class goes through __get__ too, but instance is None in this case. It's typical for this case to just:

if instance is None:
    return self

where self is the unbound descriptor.

Then, to make the bound descriptor you do something like:

info = instance.__dict__.get('info')
if info is None:
    info = self.__class__()  # Or copy or whatever
    info._parent_ref = weakref.ref(instance)
    instance.__dict__['info'] = info

return info

The one downside to the above approach is that the __init__ of the info class can't rely on _parent_ref. If this is undesirable you could also use the ugly but perfectly valid approach of:

info = self.__new__(self.__class__)
info._parent_ref = weakref.ref(instance)
info.__init__()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@embray - it would seem more elegant to make DataInfo itself a descriptor, following the procedure you describe. I must admit, though, I still do not fully understand it. In particular, would it also solve the large speed problems we had when info is present? Eg., in Quantity.__array_finalize__, can we still recognize that the descriptor is as yet uninitialized? (From the above it does seem so...). If so, to avoid loosing this again, could you repost your comment as a new issue, so it doesn't get lost again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one downside to the above approach is that the init of the info class can't rely on _parent_ref.

I'm not sure what you mean exactly by "can't rely on", but definitely _parent_ref needs to be set every time __get__ is called. I had some issues along the way copying / transferring info and doing this on-the-fly parent referencing made things work better. Apart from the bug with accessing info on the class object (I knew about this but forgot to address it), is there any problem in your approach of moving the info._parent_ref = weakref.ref(instance) statement outside the if block?

I also don't understand how the "ugly but perfectly valid" approach is different from the first. I'm expecting another lesson in advanced Python. 😄

I guess it will be more elegant to be able to say info = QuantifyInfo() in the class definition, but we'll definitely need good code comments so mortals (including myself) of the future can understand what's happening. I think my implementation is more easily understood by being very obvious about what is the info container and what is the descriptor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that with the bound/unbound descriptor pattern, if you want to be able to something in the class's __init__, that __init__ needs to handle being called in two different contexts:

  1. class definition, in which case you're initializing an unbound instance of the descriptor--this is the one that lives on the class itself and provides the descriptor protocol:
class MyClass:
    info = DataInfo()  # <-- DataInfo.__init__ called here
  1. Creating a bound instance--when __get__ is called via the descriptor protocol, the idea is that you then create a new DataInfo instance that is bound to some instance of the parent class:
def __get__(self, inst, type=None):
    if inst is not None:
        bound_info = self.__class__()
        bound_info._parent_ref = inst
        # Or alternatively if the __init__ should be able to do something with _parent_ref:
        # bound_info = self.__new__(self.__class__)
        # bound_info._parent_ref = inst  # This is what makes it 'bound'
        # bound_info.__init__()
        return bound_info

So if you want the DataInfo.__init__ to be able to do something different for bound instances you have to check in the __init__ for if _parent_ref is set or not.

An alternative is to use different subclasses for the Unbound and Bound versions. That might be simpler to see.

See how the astropy.modeling.Parameter class is implemented for reference. I think ultimately I'm trying to point out that since we already use this pattern elsewhere in Astropy it would make sense to use generally where appropriate. Maybe I can figure out a way to make it easy to implement this pattern generically, like I did with OrderedDescriptor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a simple way to state this pattern is:

Say what you mean.

So instead of the indirection of:

class MyClass:
    info = InfoDescriptor(DataInfo)

which says "info is a thing that somehow when I access it will be a DataInfo object". It forces the reader to have to think about descriptors and how that's all implemented. Where what you really want is to be able to write:

class MyClass:
    info = DataInfo()

which you can read as "info is a DataInfo".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create ColumnAttribute class (or similar) for mixin columns

4 participants