-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Create ColumnInfo class for mixin columns #3731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@taldcroft - just so you know: this is definitely on my radar and now that teaching/exams are over, I hope to get to it soonish. |
|
@mhvk - copy. I took off the WIP in the title since I have now entirely removed |
|
This looks super useful 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just use
@lazyproperty
def info(self):
return ColumnInfo():)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taldcroft - a first comment only yet, but I thought that rather than duplicate code over all classes, one could do something like:
info = property(get_column_info())
where get_column_info is from utils as well and takes a class as an argument, i.e., something like:
def get_column_info(cls=ColumnInfo):
def column_info(self):
if not hasattr(self, '_info'):
self._info = cls(self)
return self._info
return column_info
For Quantity, one would then have
info = property(get_column_info(QuantityColumnInfo))
But I need to think through this a bit more. In particular, I worry info is a sufficiently general name that for some out-of-astropy classes it is going to be used already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhvk That's what @lazyproperty is. No need for that extra complexity. If you wanted to be able to have different ColumnInfo subclasses that could probably live as an attribute on the Column class or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I need to think better. Though we do have a different subclass in Quantity, but that could just be defined like you did above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand in quantity we would just have
@lazyproperty
def col_info(self):
return QuantityColumnInfo()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would do it. If you really wanted to go wild, if QuantityColumn is a subclass of Column you could do something in the base class like like
ColumnInfoClass = ColumnInfo
@lazyproperty
def info(self):
return self.ColumnInfoClass()while QuantityColumn could set ColumnInfoClass = QuantityColumnInfo if need be.
I also agree "info" is vague, but I don't think "col_info" is particularly elucidating since if you're using that attribute you would generally already know that the object you're looking it up on is a column. I think this is a job for documentation just specifying what goes into the ColumnInfo object. This could be documented both in the docs for the .info property, and for the ColumnInfo object itself.
astropy/table/info.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring needs to be updated:
ValueError: option=meta is not an allowed information type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should change to attributes?
|
@cdeil - thanks for the useful testing and feedback. The latest commits do:
I'm going to hold off on updating tests and docs until there is agreement that the outputs are OK. |
|
+1 to Can you change the default stats functions so that one gets e.g. a min / max for |
|
@mhvk - the last commit is a possible implementation for formatting mixins properly within tables: |
astropy/table/info.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taldcroft – I think the None option to return the OrderedDict isn't implemented yet, right?
In [23]: t.info(out=None)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-ec9644a5a55b> in <module>()
----> 1 t.info(out=None)
/Users/deil/code/astropy/astropy/table/info.py in info(self, option, out)
86
87 outlines.extend(info.pformat(max_width=-1, max_lines=-1, show_unit=False))
---> 88 out.writelines(outline + os.linesep for outline in outlines)
AttributeError: 'NoneType' object has no attribute 'writelines'
I think it would be useful to be able to get a Table or OrderedDict back, so that it's possible to e.g. store the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A table would seem best, since this is created anyway!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 0455842.
|
@taldcroft - one nice thing about meetings is that sometimes one gets to skip a session and do some astropy... I now had a fairly detailed look and I like this p.s. With the combination of getting information from columns, |
astropy/units/quantity.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call should not be necessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, it's a leftover from when __init__ here was doing something. I just blindly deleted that one line.
That allows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, clearly a function is useful, but I'm a bit confused about the need for a cache. Is this for the case that one changes the representation half-way through, and then perhaps change it back? Since this is just a list comprehension loop, is this worth it? Could one just calculate it on the fly?
Separately, I think it would in principle be nice if col.info.default_format gave a string that would give a good clue on how to set format to do something else. Could one do this using my default_format suggestion and implement it as a property as follows?
@property
def default_format(self):
if self.parent_col is None:
raise ValueError("Default format is undefined for a SkyCoordColumnInfo "
"instance not attached to a SkyCoordObject.")
formats = ['{0.' + compname + '.value:}' for compname
in self._parent_col().data.components]
return ' '.join(formats)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. Is .value is necessarily the best option? Since the components could be Angle, one could also get hhmmss etc... But that probably is best left for later.
|
Ack, rushing and made one mistake that is fixed in bbb76e4. |
|
OK, looks good so far! |
|
@taldcroft - travis fails since But I was surprised by your expectation that a This should be the other way around, right? If so, just swap While you are at it: maybe also redefine |
|
@mhvk - tl;dr: the current logic is correct and lots of stuff breaks with the swap you suggest. Granted, it's always confusing even to me. The For a QTable, a Quantity instance will stay as Quantity when added to the table, hence it IS a mixin column. Conversely, for a Table, a Quantity instance will be converted to a plain Column, so it IS NOT a mixin column. Maybe the method should be named |
|
@taldcroft - OK, that makes sense, though indeed the name is confusing. More to the point: I think for a |
astropy/table/info.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the simplest would just bet:
if len(uniq_types) == 1 and not tbl._is_mixin_column(cols[0], quantity_is_mixin=False):
It is sufficiently hard to get a Quantity into a regular table that I think it is OK that we do not do the right thing for a regular table with only Quantity columns....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a logically equivalent version that is less obscure, but still relies on the fact that one cannot have a Quantity-mixin in a Table (it will always be converted to Column):
if 'class' in info.colnames:
# Remove 'class' info column if all table columns are the same class
# and they are either the default class for that table or else Quantity.
uniq_types = set(type(col) for col in cols)
if len(uniq_types) == 1 and isinstance(cols[0], (tbl.ColumnClass, u.Quantity)):
del info['class']
When I wrote this I started thinking "why is Quantity so special?". It's just another mixin class, and has some of the same failings that other mixins have, most notably lack of masking. The fact that there is a QTable is only to differentiate what to do with a thingy-with-units when added to a table (Quantity or Column).
But in the end Quantity is just another mixin along with the rest, and I think it's important to always highlight this distinction by explicitly showing the class just as we would for a QTable full of Coordinates.
I'm inclined to go with:
if len(uniq_types) == 1 and isinstance(cols[0], tbl.ColumnClass):
In the end if it's just horrible this is a soft API (just an informational output) so we can change it.
|
@taldcroft - OK, am convinced! I like the logic that if a table has just the type of columns it expects, we remove the |
|
@taldcroft - I had a final look and it is great! I would have merged, too, had it not been for the appveyor. It is something in |
|
Oh my, frustration. Will try once more skipping the doc tests that might be causing the problem. Hopefully we can revert to running the doc tests later, but at least we know they did pass on linux and mac. |
Create ColumnInfo class for mixin columns
|
Weird bug. But 🎆 for having this! |
|
@taldcroft – Thanks! This is great, I'll use this all the time for interactive work with tables. If I try the Fermi catalog example again, I get a bunch or warnings: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on some commit somewhere questioning the need for this class, versus just making DataInfo itself the descriptor. But now I can't find that original comment anymore because GitHub.
Anyways, I see now why it was done this way--it seems to me you wanted to be able to instantiate a DataInfo only when this attribute is accessed via an instance of the class that this descriptor belongs to, and then to attach the _parent_ref.
Instead, if DataInfo (and friends) are themselves descriptors, you can still have an instance of the descriptor that lives in the class definition like:
class Table(object):
info = TableInfo()this TableInfo is referred to as an unbound descriptor similarly to "unbound method". It's like a template for later descriptors that will be bound to individual instances of the class. Accessing most attributes or methods of the unbound descriptor should probably fail (for example by checking if self._parent_ref is None). Though the unbound descriptor could still have some use--for example its __repr__ can display what attributes the info object carries.
The __get__ method can then return a bound version of the descriptor of the same class, the only difference being that it has its ._parent_ref set. Incidentally the __get__ implemented below will break when you try to look at the descriptor through the class:
In [1]: from astropy.table import Table
In [2]: Table.info
ERROR: AttributeError: 'NoneType' object has no attribute '__dict__' [astropy.utils.data_info]
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-915a8e227a9a> in <module>()
----> 1 Table.info
/internal/1/root/src/astropy/astropy/astropy/utils/data_info.py in __get__(self, instance, owner_cls)
154
155 def __get__(self, instance, owner_cls):
--> 156 if 'info' not in instance.__dict__:
157 instance.__dict__['info'] = self.info_cls()
158 instance.__dict__['info']._parent = instance
AttributeError: 'NoneType' object has no attribute '__dict__'This is because accessing a descriptor through the class goes through __get__ too, but instance is None in this case. It's typical for this case to just:
if instance is None:
return selfwhere self is the unbound descriptor.
Then, to make the bound descriptor you do something like:
info = instance.__dict__.get('info')
if info is None:
info = self.__class__() # Or copy or whatever
info._parent_ref = weakref.ref(instance)
instance.__dict__['info'] = info
return infoThe one downside to the above approach is that the __init__ of the info class can't rely on _parent_ref. If this is undesirable you could also use the ugly but perfectly valid approach of:
info = self.__new__(self.__class__)
info._parent_ref = weakref.ref(instance)
info.__init__()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@embray - it would seem more elegant to make DataInfo itself a descriptor, following the procedure you describe. I must admit, though, I still do not fully understand it. In particular, would it also solve the large speed problems we had when info is present? Eg., in Quantity.__array_finalize__, can we still recognize that the descriptor is as yet uninitialized? (From the above it does seem so...). If so, to avoid loosing this again, could you repost your comment as a new issue, so it doesn't get lost again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one downside to the above approach is that the init of the info class can't rely on _parent_ref.
I'm not sure what you mean exactly by "can't rely on", but definitely _parent_ref needs to be set every time __get__ is called. I had some issues along the way copying / transferring info and doing this on-the-fly parent referencing made things work better. Apart from the bug with accessing info on the class object (I knew about this but forgot to address it), is there any problem in your approach of moving the info._parent_ref = weakref.ref(instance) statement outside the if block?
I also don't understand how the "ugly but perfectly valid" approach is different from the first. I'm expecting another lesson in advanced Python. 😄
I guess it will be more elegant to be able to say info = QuantifyInfo() in the class definition, but we'll definitely need good code comments so mortals (including myself) of the future can understand what's happening. I think my implementation is more easily understood by being very obvious about what is the info container and what is the descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that with the bound/unbound descriptor pattern, if you want to be able to something in the class's __init__, that __init__ needs to handle being called in two different contexts:
- class definition, in which case you're initializing an unbound instance of the descriptor--this is the one that lives on the class itself and provides the descriptor protocol:
class MyClass:
info = DataInfo() # <-- DataInfo.__init__ called here- Creating a bound instance--when
__get__is called via the descriptor protocol, the idea is that you then create a newDataInfoinstance that is bound to some instance of the parent class:
def __get__(self, inst, type=None):
if inst is not None:
bound_info = self.__class__()
bound_info._parent_ref = inst
# Or alternatively if the __init__ should be able to do something with _parent_ref:
# bound_info = self.__new__(self.__class__)
# bound_info._parent_ref = inst # This is what makes it 'bound'
# bound_info.__init__()
return bound_infoSo if you want the DataInfo.__init__ to be able to do something different for bound instances you have to check in the __init__ for if _parent_ref is set or not.
An alternative is to use different subclasses for the Unbound and Bound versions. That might be simpler to see.
See how the astropy.modeling.Parameter class is implemented for reference. I think ultimately I'm trying to point out that since we already use this pattern elsewhere in Astropy it would make sense to use generally where appropriate. Maybe I can figure out a way to make it easy to implement this pattern generically, like I did with OrderedDescriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a simple way to state this pattern is:
Say what you mean.
So instead of the indirection of:
class MyClass:
info = InfoDescriptor(DataInfo)which says "info is a thing that somehow when I access it will be a DataInfo object". It forces the reader to have to think about descriptors and how that's all implemented. Where what you really want is to be able to write:
class MyClass:
info = DataInfo()which you can read as "info is a DataInfo".
This is an initial implementation for #3518 to create
ColumnInfo(akaColumnAttribute) class for mixin columns.One point is that for classes with built-in mixin support the
infoattribute is a class property wrapping_infoas aColumnInfoobject, while for external classes (e.g. pandas Series),infois just a plain attribute which is aColumnInfoobject.Currently all the calls to
col_setattrandcol_getattrhave been left in place, but those methods are now trivial calls to inspect thecol.infoobject. Will get around to doing the replacement once @mhvk is happy with the direction in this PR.Closes #3518.