Skip to content

Conversation

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Oct 12, 2017

This is a WIP of moving all the column formatting infrastructure into Info. In particular this makes Quantity much more like Column. This would also apply to other mixins where it makes sense.

In [3]: q = [1,2,3]*u.m

In [4]: q.info.format = '.3f'

In [5]: q.info.pprint()  # Even for a bare quantity, but also within a Table
 None
-----
1.000
2.000
3.000

In [6]: q.info.format = 'asdf'  # Instant validation
ValueError: unable to parse format string asdf for its column.

But.. this re-introduced some weird regression, possibly related to MaskedArray and __array_finalize__ that was fixed in #3023. Not clear at this point. @mhvk if you have ideas let me know!

In [7]: c1 = MaskedColumn([1,2], format='%d')

In [8]: c1.format
Out[8]: '%d'

In [9]: c2 = c1[:]

In [11]: print(c1.format)  # YIKES!!
None

In [12]: print(c2.format)
None

@taldcroft taldcroft added this to the v3.0.0 milestone Oct 12, 2017
@taldcroft taldcroft requested a review from mhvk October 12, 2017 13:20
@astropy-bot
Copy link

astropy-bot bot commented Oct 12, 2017

Hi there @taldcroft 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

I see this is a work in progress pull request. I'll report back on the checks once the PR is ready for review.

If there are any issues with this message, please report them here.

@taldcroft
Copy link
Member Author

See also #6721

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea very much - see one question about possible problems with masked columns.


try:
# test whether it formats without error exemplarily
self.pformat(max_lines=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this try/except work also for a MaskedColumn in which the first item is masked? I recall having to work around that case in the printing routines... It could either fail on the masked item, or fail to raise on something that would fail on a real number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a problem (though not one introduced by this PR). But there are mitigating factors:

  • The actual smallest allowed value of max_lines is 7. This allows the internal code to be a bit simpler and not deal with the problems that would ensue from formatting only 1 line. So for a typical table you actually get the first two and last two lines checked.
  • In the event that all these are masked, the worst case is that later on one gets the same exception you would have seen when setting the format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. and I found a bug, see #6812. This should be a very quick review.

lines, outs = _pformat_col(self, max_lines, show_name=show_name,
show_unit=show_unit, show_dtype=show_dtype,
html=html)
# _pformat_col = self._formatter._pformat_col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case: should be removed not commented in a final PR.

@taldcroft
Copy link
Member Author

@mhvk - I've been thinking about going the whole way and putting all the column attributes into info. For instance in a mixin col you can currently set meta to anything, while in a column it is a MetaData descriptor and only allows dict-like inputs. Not good.

So then Info would be the universal container of column attributes. From initial look it seems technically feasible. I'm worried a bit about performance since every single column attribute access (name, format, etc) will have to do the descriptor dance of actually getting the info object from the instance __dict__. Hopefully it won't be a big hit. Overall I think the code will be cleaner because then the Column class no longer cares about its particular attributes and copy or pickle get simpler (since they get delegated to info).

@mhvk
Copy link
Contributor

mhvk commented Nov 6, 2017

@taldcroft - I think moving everything to info would make life a lot easier - separation of concerns at its best! My sense would be that the additional overhead will not be much trouble since mostly those attributes are needed in cases where one is about to do something more costly (like copying a column).

@bsipocz bsipocz modified the milestones: v3.0.0, v3.1 Jan 18, 2018
@astropy-bot
Copy link

astropy-bot bot commented Mar 22, 2018

Hi humans 👋 - this pull request hasn't had any new commits for approximately 5 months. I plan to close this in a month if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock. You may also consider sending a reminder e-mail about it to the astropy-dev mailing list.

If you believe I commented on this pull request incorrectly, please report this here.

@mhvk
Copy link
Contributor

mhvk commented Mar 22, 2018

@taldcroft - I liked this PR a lot. Do you want to revive it? I'm not sure any more what exactly the problems were with masked arrays, but can look into that.

@astropy-bot
Copy link

astropy-bot bot commented Apr 19, 2018

⏰ Time's up! ⏰

I'm going to close this pull request as per my previous message. If you think what is being added/fixed here is still important, please remember to open an issue to keep track of it. Thanks!

If this is the first time I am commenting on this issue, or if you believe I closed this issue incorrectly, please report this here.

@taldcroft
Copy link
Member Author

This would help with #7481, which adds an attribute to MaskedColumnInfo to control serialization of that class. I'm re-opening this.

@taldcroft taldcroft reopened this May 20, 2018
@mhvk
Copy link
Contributor

mhvk commented May 20, 2018

As I wrote above, I thought this was a great idea (and with classes over I have a bit more time to do reviews...)

@taldcroft taldcroft force-pushed the table-info-format branch from 94cbac1 to 2f5d6f7 Compare June 29, 2018 19:28
@taldcroft
Copy link
Member Author

taldcroft commented Jun 29, 2018

@mhvk - I've rediscovered why I dropped this the first time around, and the problems related to MaskedArray and #6721. Basically I just don't understand what the heck is going on in numpy and MaskedArray. The output of In [6] demonstrates a current test failure of this branch, where the format attribute isn't making it through the 16 (!!) calls to _copy_attrs() for this one slice operation. Every one of those is copying all the attributes and making a deep copy of meta.

This is beyond my numpy-fu, but maybe you could make sense of this? Definitely all this nonsense is going to make MaskedColumn slicing really slow.

In [2]: mc = table.MaskedColumn(data=[1, 2], format="%i")
HERE in _copy_attrs(): <class 'numpy.ndarray'> 4610665728 None
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782921096 %i
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920984 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920984 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920984 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920984 None

In [3]: mc[0:1].format
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921768 None
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782920984 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920648 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921432 None
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4782920872 %i
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921432 None
Out[3]: '%i'

In [4]: x = mc[0]
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None

In [6]: print(mc.take([0, 1]).format)
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4782920872 %i
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782921544 %i
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782921880 %i
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782920424 %i
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782921992 None
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4782920984 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920536 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920312 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920312 None
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4782921432 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920312 None
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4782920312 None
None

In [7]: x = mc.take(0)
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4782920872 %i

@mhvk
Copy link
Contributor

mhvk commented Jun 29, 2018

OK, that seems like a nice puzzle.... I'll have a look. Just for information: I presume it is not as bad for regular Column?

@taldcroft
Copy link
Member Author

For a regular column this behaves like I would expect:

In [10]: c = table.Column(data=[1, 2], format="%i")
HERE in _copy_attrs(): <class 'numpy.ndarray'> 4556411168 None

In [11]: x = c.take(0)

In [12]: x = c[0]

In [13]: print(c[0:1].format)
HERE in _copy_attrs(): <class 'astropy.table.column.Column'> 4556680888 %i
%i

In [14]: print(c.take([0, 1]).format)
HERE in _copy_attrs(): <class 'astropy.table.column.Column'> 4556680888 %i
%i

@mhvk
Copy link
Contributor

mhvk commented Jun 30, 2018

@taldcroft - I think the ultimate problem is like in #7394 (why __array_finalize__ is called so often), that we override data, which MaskedArray uses to refer to whatever it is masking. So MaskedArray expects type(mc.data) is Column, while we want type(mc.data) is MaskedArray. The result is that when we call MaskedArray.__getitem__(self, item), self.data gets indexed, which is a MaskedArray(BaseColumn), and thus again indexes the column.

So, one solution is the hack I proposed in #7394 - with that one only gets 2 copy_attrs (where I'm not sure the _copy_attrs_slice is still needed; I didn't check). Another one would be to change the class definition to MaskedColumn(MaskedArray, Column), and let data be what MaskedArray wants it to be.

p.s. I did check and it seems all table tests still pass with the hack in #7394
p.s.2 A similar hack in a new MaskedColumn.__getitem__ - where one temporarily just sets self.__class__ to MaskedArray - would probably work too.

@taldcroft taldcroft force-pushed the table-info-format branch from 4901e59 to 19df182 Compare July 5, 2018 15:29
@taldcroft
Copy link
Member Author

@mhvk - In a5ac8a6 I did the hack from #7394 => #6721 => https://github.com/astropy/astropy/compare/master...mhvk:table-masked-column-array-finalize?expand=1

I might have done something wrong because tests are not passing for me, in particular the test on masked_col.take([0, 1]) preserving the format. (Right now I'm seeing other failures too, but they may be unrelated).

In [2]: mc = table.MaskedColumn(data=np.arange(100), format="%i")
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 4634270632 obj: <class 'numpy.ndarray'> 4539845248
HERE in _copy_attrs(): <class 'numpy.ndarray'> 4539845248 None
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4634270072 obj: <class 'astropy.table.column.BaseColumn'> 4634270632
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4634270632 %i
Setting format = %i

In [3]: x = mc[0:1]  # OK
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4634488904 obj: <class 'numpy.ma.core.MaskedArray'> 4634270744
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4634270744 None
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4634270072 %i
Setting format = %i

In [4]: print(mc.take([0, 1]).format)  # FAIL
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 4634490360 obj: <class 'astropy.table.column.MaskedColumn'> 4634270072
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4634270072 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 4634490248 obj: <class 'astropy.table.column.BaseColumn'> 4634490360
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4634490360 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 4634489912 obj: <class 'astropy.table.column.BaseColumn'> 4634490248
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4634490248 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4634489576 obj: <class 'astropy.table.column.BaseColumn'> 4634489912
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 4634489912 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4634489352 obj: <class 'numpy.ma.core.MaskedArray'> 4634489464
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 4634489464 None
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4634489576 None
None

In [5]: print(np.sin(mc).format)  # OK
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4709447832 obj: <class 'astropy.table.column.MaskedColumn'> 4709447384
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4709447384 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4709448280 obj: <class 'astropy.table.column.MaskedColumn'> 4709447384
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4709447384 %i
Setting format = %i
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 4709447832 obj: <class 'astropy.table.column.MaskedColumn'> 4709448280
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 4709448280 %i
Setting format = %i
%i

@mhvk
Copy link
Contributor

mhvk commented Jul 5, 2018

I rebased the hack on master, and ./setup.py test passes, so I think the problem must lie with a later commit here (when I was playing around with this PR, the tests never all passed).

Now specifically on the problem you noted: looking at the MaskedArray code, I see that in __getitem__, the indexing of data is done very simply, self.data[inx], while in take, it is done on self._data, which will be different. Still, it is not immediately obvious why things go awry. The five __array_finalize__ you are seeing can be followed in MaskedArray.take:

  1. _data = self._data # get underlying BaseColumn
  2. _data.take(indices...)... # do actual taking BaseColumn
  3. [...] # convert possible scalars to array BaseColumn
  4. .view(cls) # view back as MaskedColumn
  5. out[()] # convert back to scalar, if needed MaskedColumn

Now, I just tried on your PR all the steps above, and steps 1-4 are all OK, but step 5 not only does not set format, it also removes it from the two previous ones:

In [30]: mc = table.MaskedColumn(data=np.arange(100), format="%i")
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 140144500296840 obj: <class 'numpy.ndarray'> 140144844979136
HERE in _copy_attrs(): <class 'numpy.ndarray'> 140144844979136 None
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 140144844391032 obj: <class 'astropy.table.column.BaseColumn'> 140144500296840
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 140144500296840 %i
Setting format = %i

In [45]: _data = mc._data
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 140144354097528 obj: <class 'astropy.table.column.MaskedColumn'> 140144844391032
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 140144844391032 %i
Setting format = %i

In [46]: step1 = mc._data
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 140144844391368 obj: <class 'astropy.table.column.MaskedColumn'> 140144844391032
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 140144844391032 %i
Setting format = %i

In [47]: step1.format
Out[47]: '%i'

In [48]: step2 = step1.take([0, 1])
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 140144500295048 obj: <class 'astropy.table.column.BaseColumn'> 140144844391368
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 140144844391368 %i
Setting format = %i

In [49]: step2.format
Out[49]: '%i'

In [50]: step3 = step2[...]
HERE in __array_finalize__ self: <class 'astropy.table.column.BaseColumn'> 140144844390808 obj: <class 'astropy.table.column.BaseColumn'> 140144500295048
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 140144500295048 %i
Setting format = %i

In [51]: step3.format
Out[51]: '%i'

In [52]: step4 = step3.view(table.MaskedColumn)
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 140144844390472 obj: <class 'astropy.table.column.BaseColumn'> 140144844390808
HERE in _copy_attrs(): <class 'astropy.table.column.BaseColumn'> 140144844390808 %i
Setting format = %i

In [53]: step4.format
Out[53]: '%i'

In [54]: step5 = step4[()]
HERE in __array_finalize__ self: <class 'astropy.table.column.MaskedColumn'> 140144844390584 obj: <class 'numpy.ma.core.MaskedArray'> 140144844392600
HERE in _copy_attrs(): <class 'numpy.ma.core.MaskedArray'> 140144844392600 None
HERE in _copy_attrs(): <class 'astropy.table.column.MaskedColumn'> 140144844390472 None

In [55]: step5.format

In [56]: step4.format

In [57]: step3.format

In [58]: step2.format
Out[58]: '%i'

@mhvk
Copy link
Contributor

mhvk commented Jul 5, 2018

Much thinking... I think the problem is that MaskedArray.__array_finalize__ also transfers __dict__['info']. At least, in the above chain, I get

step3.__dict__['info'] is step2.__dict__['info']
# False
step4.__dict__['info'] is step3.__dict__['info']
# True (and also for 5)

So, I can solve this problem (and get the format transferred also in step 5), by changing as follows:

diff --git a/astropy/table/column.py b/astropy/table/column.py
index 7ea7986ef..624b99cc0 100644
--- a/astropy/table/column.py
+++ b/astropy/table/column.py
@@ -350,6 +350,8 @@ class BaseColumn(_ColumnGetitemShim, np.ndarray):
 
         if callable(super().__array_finalize__):
             super().__array_finalize__(obj)
+            # This happens for MaskedArray, which copies 'info'. Undo that.
+            self.__dict__.pop('info', None)

But: this is not very logical. More logical would be to ensure that info is always copied, like we do in Quantity, etc. I think it is cleaner if we changed to something like:

--- a/astropy/table/column.py
+++ b/astropy/table/column.py
@@ -345,7 +345,7 @@ class BaseColumn(_ColumnGetitemShim, np.ndarray):
         print('HERE in __array_finalize__ self: {} {} obj: {} {}'
               .format(type(self), id(self), type(obj), id(obj)))
         # Obj will be none for direct call to Column() creator
-        if obj is None:
+        if obj is None or obj.__class__ is np.ndarray:
             return
 
         if callable(super().__array_finalize__):
@@ -708,7 +708,13 @@ class BaseColumn(_ColumnGetitemShim, np.ndarray):
             import pdb
             pdb.set_trace()
 
-        for attr in ('name', 'unit', 'format', 'description'):
+        # Copy info if the original had `info` defined.  Because of the way the
+        # DataInfo works, `'info' in obj.__dict__` is False until the
+        # `info` attribute is accessed or set.
+        if 'info' in obj.__dict__:
+            self.info = obj.info
+
+        for attr in ('name', 'unit', 'description'):
             val = getattr(obj, attr, None)
             if attr == 'format':
                 print('HERE in _copy_attrs():', type(obj), id(obj), val)

p.s. I think the whole loop looking for attrs can be skipped if info in obj.__dict__ does not hold.

@astropy-bot
Copy link

astropy-bot bot commented Jun 15, 2019

Hi humans 👋 - this pull request hasn't had any new commits for approximately 11 months. I plan to close this in a month if the pull request doesn't have any new commits by then.

In lieu of a stalled pull request, please consider closing this and open an issue instead if a reminder is needed to revisit in the future. Maintainers may also choose to add keep-open label to keep this PR open but it is discouraged unless absolutely necessary.

If this PR still needs to be reviewed, as an author, you can rebase it to reset the clock.

If you believe I commented on this pull request incorrectly, please report this here.

@mhvk
Copy link
Contributor

mhvk commented Jun 15, 2019

@taldcroft - still keen to move this forward. Hopefully, #8855 will make it easier.

@taldcroft
Copy link
Member Author

I just don't know what to do about this:

In [18]: c = MaskedColumn([1,2])

In [19]: timeit c.info.name
8.81 µs ± 364 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [20]: timeit c.name
342 ns ± 6.59 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Throwing away 8 µs for every single column attribute would really be a bummer. Maybe the info access could be improved somewhat, but I don't see how get under a 1 µs if info is a descriptor. There is just a bunch of code that needs to be navigated, but I haven't profiled this recently.

@mhvk
Copy link
Contributor

mhvk commented Jul 8, 2019

@taldcroft - definitely a good point about the speed, though not all of it counts. For instance, while I confirm your 26-times slower result,

c = MaskedColumn([1,2])
%timeit c.name
# 10000000 loops, best of 5: 87.7 ns per loop
%timeit c.info.name
# 100000 loops, best of 5: 2.25 µs per loop

One should really imagine the name is on info itself, i.e., we don't go through __getattr__. That reduces it to a factor 14.

c.info._info_name = c.name
%timeit c.info._info_name
# 1000000 loops, best of 5: 1.24 µs per loop

Profiling just doing c.info, it seems a lot of time is spent in __setattr__ - there should definitely be room for improvement...

@mhvk
Copy link
Contributor

mhvk commented Jul 8, 2019

See #8968 for a substantial speed-up with a minimal change.

@taldcroft taldcroft closed this Oct 1, 2019
@taldcroft taldcroft deleted the table-info-format branch October 1, 2019 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants