Change behavior of Table regarding masked columns #8789

taldcroft · 2019-06-03T13:25:26Z

Aka allow MaskedColumn and Column in the same table. This is a WIP but would like initial big-picture (API) feedback before going too much further. From the #7496 description (and see also #3665).

Currently if any column in a table is a MaskedColumn, then all other columns are "upgraded" to be masked as well. This happens on initialization and if a MaskedColumn is added or any column in an existing table is converted from Column to MaskedColumn. This is entirely a holdover from the original implementation of table using a numpy structured array.

The implementation of this is relatively straightforward, the main reservation is about stability and breaking code.

The main API changes are:

In order to automatically add new columns as MaskedColumn, one must supply masked=True when creating the table. Currently this happens automatically if any of the columns are masked.
Adding a MaskedColumn to an unmasked table would not automatically "upgrade" other columns. This is a feature of the new implementation.
The Table.masked attribute reflects only the behavior when adding a new column. So if masked is true then non-mixin columns are always converted to MaskedColumn [1], but with masked=False then the column is added as whatever it is.

Previous there was discussion about issuing a warning if column would have been upgraded. I think this would be an annoyance because it will happen in situations that are not actually a problem, but instead just different. My feeling is that the small cross section of code that relied on the old behavior will break, which is not ideal but not as bad as silently doing something wrong (which is where warnings make sense). But discussion would be good!

One thing to consider for this PR is allowing:

t = Table.read(filename, format=format, masked=True)  # MaskedColumns needed

This is a nicer version of the current workaround:

t  = Table.read(filename, format=format)
t = Table(t, masked=True)

Closes #7496

[1] - this is not always true, see code, but in most cases that will be true.

To Do:

Change log
Docs explaining the change
Some additional tests
Question: does masked=True round-trip through I/O? Answer: NO, see comment later.

taldcroft · 2019-06-03T14:07:18Z

One important thing to note is that this PR is a pre-cursor to other Table optimizations. Right now a lot of table updates (e.g. replacing a column) are slow and/or awkward because they are forced to go through what amounts to full table initialization. This is driven by the requirement to potentially upgrade a table to masked and "upgrade" every other column in the table accordingly.

So with this PR something like replace_column would really be as simple as updating the Table.columns dictionary (#8702). Fast and simple, yay. Part of that would be factoring out much of _init_from_list into a separate method akin to _convert_col_for_table that converts column-like thingy into an acceptable table column object.

mhvk · 2019-06-04T14:12:53Z

I'm in favour about the change in general, and in particular like that normal Table and QTable will both be able to have masked columns mixed in.

One question about the API: as we are changing it anyway, would it be an idea to introduce a new MaskedTable subclass? It would behave a bit like QTable, in that it would change some columns automatically to MaskedColumn (instead of to Quantity).

The main advantage would be that the table code can be much simpler. But it might also help in making the masked property something a bit more logical (like "are there any masked columns" or, if we are willing to break API that far, "are there any masked values?") - when I started reviewing this PR I was quite confused seeing tests that had table.masked is False even though there were masked columns with masked elements in it.

In terms of implementation, initially it would need a Table.__new__ method that checks for masked and passes on to MaskedTable if set (perhaps with a deprecation warning).

taldcroft · 2019-06-04T14:54:16Z

Indeed I thought about MaskedTable, and it is a reasonable option to discuss. This would immediately solve the table reading issue I mentioned. However, at the end I suspect we are better off not going that route. This bifurcation would extend to other Table subclasses (starting with QTable) and perhaps lead to more complexity instead of simplicity.

One important thing to remember is that the masked attribute has never been about the contents of the table, but rather defining how to handle adding columns. So this PR does not change that fundamental meaning, and I would argue that is a good thing. It does change the behavior associated with masked=False.

The meaning of masked=True for a Table has not changed at all. It means always add MaskedColumn except for mixins, and for mixins add a mask attribute (as a FalseArray instance) if it doesn't already exist.

Previously, masked=False meant to not add a MaskedColumn, but if that was requested then convert to masked=True and change all existing columns. Now it means add the column using the "most appropriate" class and do not touch other columns.

Definitely the name masked is not the most helpful, compared to say require_masked_columns, but it isn't clear if changing the attribute name at this point is worth it. In the end what I imagine is a user migration toward never using masked=True.

I do think something like a has_masked_columns attribute would be helpful. The definition of this is slightly ambiguous though in terms of whether that means columns that can be masked or columns that have masked elements. The latter is generally more useful, and for mixins like Time you don't know without looking, but there is a question of whether MaskedColumn counts for has_masked_columns if none of the elements are actually masked.

mhvk · 2019-06-04T16:29:10Z

@taldcroft - I see your point about masked - and agree we should change the meaning as little as possible.

This also means the question of whether to use a subclass or not is independent, and at some level an implementation detail: unless the user goes out of their way to check what the actual class of the table instance is, the behaviour of the instance will be as defined by the masked attribute. Which in turn means the decision whether or not to do it should be mostly whether it makes things a lot easier.

I do like the subclass by its analogy with QTable, though, and think that perhaps a more general Table.__new__ with masked and always_apply_units (or something better) and arguments that tell what kind of table behaviour you get.

Or perhaps whether we can push the "agnostic" model a bit further, and also let Quantity just remain itself in a regular Table by default. If so, that would push the question of whether something should be a Quantity or Column to the readers.

Anyway, all that somewhat orthogonal to the current PR, which I think is a good idea regardless of how it is implemented.

hamogu · 2019-06-09T11:13:25Z

I think that's a reasonable API change. In that past, I've found masked columns to be cumbersome in some cases, because they forced the entire table into masked-mode. I don't remember if it was a speed-issue or if some downstream code could not handle the masks, but in any case, having only columns masked that need to be masked is a good idea.

taldcroft · 2019-06-21T16:24:44Z

About explicitly round-tripping the table masked attribute, I started down this path and got a partly working solution, but realized it was problematic. Doing this requires updating (hacking) the serialization code to add that specific attribute into the serialized data via a special key in the table meta. It is really not nice.

After doing this I also had the realization that existing table files from masked tables would not have this special meta in them anyway, so they will be read as unmasked, while tables written with astropy >= 4.0 would have that there. So it's just a mess, and the only answer is to accept that if you want a table that is read from file to be masked, then that needs to be done as an explicit step outside of the read process.

All this actually makes astropy I/O more consistent and explicit. Before it was using implicit rules to set masked, and now that will be explicitly handled by the user.

mhvk · 2019-06-21T16:29:21Z

@taldcroft - makes sense; I agree with moving towards a state where masked just becomes irrelevant.

codecov · 2019-06-21T17:33:02Z

Codecov Report

Merging #8789 into master will decrease coverage by <.01%.
The diff coverage is 95.12%.

@@            Coverage Diff             @@
##           master    #8789      +/-   ##
==========================================
- Coverage   86.72%   86.72%   -0.01%     
==========================================
  Files         405      404       -1     
  Lines       59994    59972      -22     
  Branches     1100     1100              
==========================================
- Hits        52032    52012      -20     
+ Misses       7321     7319       -2     
  Partials      641      641

Impacted Files	Coverage Δ
astropy/utils/iers/iers.py	`92.93% <100%> (+0.02%)`	⬆️
astropy/io/ascii/core.py	`93.27% <100%> (-0.03%)`	⬇️
astropy/table/table.py	`93.23% <94.73%> (+0.3%)`	⬆️
astropy/utils/data.py	`81.15% <0%> (-0.45%)`	⬇️
astropy/units/equivalencies.py	`96.18% <0%> (-0.02%)`	⬇️
astropy/constants/astropyconst20.py	`100% <0%> (ø)`	⬆️
astropy/constants/iau2015.py	`100% <0%> (ø)`	⬆️
astropy/io/votable/tree.py	`85.98% <0%> (ø)`	⬆️
astropy/constants/astropyconst40.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9d4d95...d856ad7. Read the comment docs.

taldcroft · 2019-06-21T16:57:49Z

astropy/table/table.py

-            columns = self.TableColumns()
-
-            for name in names:
-                columns[name] = self.ColumnClass(name=name, data=newdata[name])


This was hardwired previously to fast-track the logic in _init_from_list assuming that self.ColumnClass is set appropriately. Since we no longer have that guarantee we just dump the list of columns (masked or unmasked) into _init_from_list.

taldcroft · 2019-06-21T17:29:53Z

astropy/utils/iers/iers.py

        # Read in as a regular Table, including possible masked columns.
        # Columns will be filled and converted to Quantity in cls.__init__.
        iers_a = Table.read(file, format='cds', readme=readme)
+        iers_a = Table(iers_a, masked=True, copy=False)


This is a good example of what can break in application code and the suggested workaround. It is not that elegant, but at some level I just want to discourage use of masked=True altogether. But for the sake of completeness, options would be:

Table.read(file, format='cds', readme=readme, masked=True) # High-impact API change, prevents readers from having `masked` kwarg # or iers_a = Table.read(file, format='cds', readme=readme) iers_a.masked = True # Make `masked` mutable and upgrade Columns to MaskedColumn. # But this violates "don't do radical things to object with a property change" # or iers_a.convert_to_masked()

So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?

Yes. The final version of iers_a is forced to have all MaskedColumn, and any new column will get added as MaskedColumn.

taldcroft · 2019-06-21T17:41:51Z

This is ready for final review.

mhvk

This looks nearly ready to do: only two comments, of which one is a nitpick. Overall, I feel the code has been simplified, which suggests this is really a change for the better.

One question for the future: all the tests for not hasattr(col, 'mask') are somewhat annoying. One possible suggestion would be to move mask to col.info - for non-masked columns, it could create a FalseArray on the fly. But no reason to hold this up; more something to consider for #6720

mhvk · 2019-06-21T17:58:21Z

astropy/io/ascii/tests/test_c_reader.py

        table = read_rdb(text.replace('\n', newline), parallel=parallel)
        assert_table_equal(table, expected)
-        assert np.all(table == expected)
+        # TO DO : deal with this


Was this done?

No, but it is now.

mhvk · 2019-06-21T18:06:15Z

astropy/table/tests/test_masked.py

 class TestTableInit(SetupData):
    """Initializing a table"""

    def test_mask_true_if_any_input_masked(self):


Change the name of the test?

astrofrog

Just a few small things but overall I'm in favor of this. The auto-upgrading of whole tables has always really bugged me, and even though this may break a few cases out there, I think most users won't be affected.

astrofrog · 2019-06-21T21:04:10Z

astropy/utils/iers/iers.py

        # Read in as a regular Table, including possible masked columns.
        # Columns will be filled and converted to Quantity in cls.__init__.
        iers_a = Table.read(file, format='cds', readme=readme)
+        iers_a = Table(iers_a, masked=True, copy=False)


So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?

astrofrog · 2019-06-21T21:05:04Z

docs/table/masking.rst

+.. Note::
+
+   For most applications we now recommend using the default |Table| behavior with
+   ``masked=False`` to allow heterogenous column types.


This sounds like you recommend explicitly setting masked=False, but the best is to just not specify masked=, right?

Ah, the wording was not ideal. I'm changing to:

For most applications, even those with masked column data, we now recommend using the default |Table| behavior which allows heterogenous column types. This implies creating tables *without* specifying the ``masked`` keyword argument.

astrofrog · 2019-06-21T21:06:54Z

docs/table/masking.rst

+.. doctest-skip::
+
+  >> dat = Table.read('data.fits')
+  >> dat['new_column'] = MaskedColumn([1, 2, 3, 4, 5])


And what about giving dat['new_column'] = np.ma.array([1, 2, 3, 4, 5]) as an alternative? (just to avoid the MaskedColumn import)

👍, but maybe np.ma.MaskedArray instead

astrofrog · 2019-06-21T21:12:41Z

I think the HTML build is something that's actually fixed on master, so it should go away if you rebase.

bsipocz · 2019-06-21T21:12:52Z

@taldcroft - while addressing the review, could you also rebase to get rid of the unrelated CI failure?

taldcroft · 2019-06-21T22:48:37Z

Passing all tests now, comments addressed (I think).

mhvk · 2019-06-21T22:53:45Z

At least the memory aspect of FalseArray is easily addressed: it is readonly, so can just be a single element, np.broadcast_to(False, col.shape).

Anyway, for another time!

mhvk

All looks good now, so merging. Very nice1

mhvk · 2019-06-21T22:57:27Z

@astrofrog - I guess I cannot merge without dismissing your request for changes - do you agree the change in wording is good? If so, just go ahead and merge.

review has been addressed

bsipocz · 2019-06-22T01:14:57Z

I'm certain that the review has been adequately addressed, and thus go ahead and merge, so other PRs based on are not being held up.

bsipocz · 2019-06-22T01:15:12Z

Thank you @taldcroft!

Correctly handle masking in add_row() after #8789

PR astropy#8789 allowed a `Table` to contain both `MaskedColumn` and `Column` instances, meaning that in many cases there is no longer any need to set `Table.masked=True`, but not all of the relevant documentation was updated.

taldcroft added table Work in progress API change PRs and issues that change an existing API, possibly requiring a deprecation period labels Jun 3, 2019

taldcroft added this to the v4.0 milestone Jun 3, 2019

taldcroft requested review from astrofrog and mhvk June 3, 2019 13:25

taldcroft mentioned this pull request Jun 19, 2019

Fix replace_column for table sub-classes that have non-conventional initializers #8702

Closed

taldcroft force-pushed the table-mask branch from 5b077b8 to 2501cf8 Compare June 21, 2019 14:09

taldcroft added the whatsnew-needed label Jun 21, 2019

taldcroft added io.ascii io.fits io.misc Ready-for-final-review and removed Work in progress labels Jun 21, 2019

taldcroft commented Jun 21, 2019

View reviewed changes

mhvk changed the title ~~WIP: change behavior of Table regarding masked columns~~ Change behavior of Table regarding masked columns Jun 21, 2019

mhvk reviewed Jun 21, 2019

View reviewed changes

taldcroft mentioned this pull request Jun 21, 2019

Simplify and speed up Table replace_column() #8902

Merged

astrofrog previously requested changes Jun 21, 2019

View reviewed changes

taldcroft added 11 commits June 21, 2019 18:18

Update tests in io.ascii

66a0a09

Update still more io.ascii tests

d90a9dc

Add details of API change and update docs [skip ci]

cc89088

Add has_masked_columns, has_masked_values + better docs + fixes

6fbb05f

Make the IERS_A table be masked

fb75e31

Fix Table.as_array for masked mixin columns

187afe8

Update changelog and docs tweak

b3d912a

Add test of Table.mask property

bffca36

Add warning and test for adding a MaskedColumn with unit to QTable

e588e96

Update Table __eq__ to check for masked cols not masked attr

93b86bf

Address minor review comments

d856ad7

taldcroft force-pushed the table-mask branch from b8eba93 to d856ad7 Compare June 21, 2019 22:20

mhvk approved these changes Jun 21, 2019

View reviewed changes

bsipocz merged commit 6b5015c into astropy:master Jun 22, 2019

taldcroft added a commit to taldcroft/astropy that referenced this pull request Jun 23, 2019

Correctly handle masking in add_row() after astropy#8789

86277b3

taldcroft mentioned this pull request Jun 23, 2019

Correctly handle masking in add_row() after #8789 #8907

Merged

taldcroft added a commit that referenced this pull request Jun 29, 2019

Merge pull request #8907 from taldcroft/table-mask-add-row

af312ae

Correctly handle masking in add_row() after #8789

taldcroft mentioned this pull request Jul 1, 2019

Fix Table operations for masking post #8789 #8942

Closed

dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Jul 2, 2019

Correctly handle masking in add_row() after astropy#8789

67b5c6c

taldcroft added a commit to taldcroft/astropy that referenced this pull request Jul 5, 2019

Add test of fix in 9590265 for regression from astropy#8789 [skip ci]

2486a82

eblur pushed a commit to eblur/astropy that referenced this pull request Jul 15, 2019

Add test of fix in 9590265 for regression from astropy#8789 [skip ci]

6e122e1

taldcroft deleted the table-mask branch October 1, 2019 20:45

eerovaher mentioned this pull request Aug 4, 2021

Update table documentation #12011

Merged

9 tasks

Uh oh!

Change behavior of Table regarding masked columns #8789

Change behavior of Table regarding masked columns #8789

Uh oh!

Conversation

taldcroft commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taldcroft commented Jun 3, 2019

Uh oh!

mhvk commented Jun 4, 2019

Uh oh!

taldcroft commented Jun 4, 2019

Uh oh!

mhvk commented Jun 4, 2019

Uh oh!

hamogu commented Jun 9, 2019

Uh oh!

taldcroft commented Jun 21, 2019

Uh oh!

mhvk commented Jun 21, 2019

Uh oh!

codecov bot commented Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Jun 21, 2019

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astrofrog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astrofrog commented Jun 21, 2019

Uh oh!

bsipocz commented Jun 21, 2019

Uh oh!

taldcroft commented Jun 21, 2019

Uh oh!

mhvk commented Jun 21, 2019

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk commented Jun 21, 2019

Uh oh!

bsipocz commented Jun 22, 2019

Uh oh!

bsipocz commented Jun 22, 2019

taldcroft commented Jun 3, 2019 •

edited

Loading

codecov bot commented Jun 21, 2019 •

edited

Loading

taldcroft Jun 21, 2019 •

edited

Loading