Skip to content

Conversation

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Jun 3, 2019

Aka allow MaskedColumn and Column in the same table. This is a WIP but would like initial big-picture (API) feedback before going too much further. From the #7496 description (and see also #3665).

Currently if any column in a table is a MaskedColumn, then all other columns are "upgraded" to be masked as well. This happens on initialization and if a MaskedColumn is added or any column in an existing table is converted from Column to MaskedColumn. This is entirely a holdover from the original implementation of table using a numpy structured array.

The implementation of this is relatively straightforward, the main reservation is about stability and breaking code.

The main API changes are:

  • In order to automatically add new columns as MaskedColumn, one must supply masked=True when creating the table. Currently this happens automatically if any of the columns are masked.
  • Adding a MaskedColumn to an unmasked table would not automatically "upgrade" other columns. This is a feature of the new implementation.
  • The Table.masked attribute reflects only the behavior when adding a new column. So if masked is true then non-mixin columns are always converted to MaskedColumn [1], but with masked=False then the column is added as whatever it is.

Previous there was discussion about issuing a warning if column would have been upgraded. I think this would be an annoyance because it will happen in situations that are not actually a problem, but instead just different. My feeling is that the small cross section of code that relied on the old behavior will break, which is not ideal but not as bad as silently doing something wrong (which is where warnings make sense). But discussion would be good!

One thing to consider for this PR is allowing:

t = Table.read(filename, format=format, masked=True)  # MaskedColumns needed

This is a nicer version of the current workaround:

t  = Table.read(filename, format=format)
t = Table(t, masked=True)

Closes #7496

[1] - this is not always true, see code, but in most cases that will be true.

To Do:

  • Change log
  • Docs explaining the change
  • Some additional tests
  • Question: does masked=True round-trip through I/O? Answer: NO, see comment later.

@taldcroft taldcroft added table Work in progress API change PRs and issues that change an existing API, possibly requiring a deprecation period labels Jun 3, 2019
@taldcroft taldcroft added this to the v4.0 milestone Jun 3, 2019
@taldcroft taldcroft requested review from astrofrog and mhvk June 3, 2019 13:25
@taldcroft
Copy link
Member Author

One important thing to note is that this PR is a pre-cursor to other Table optimizations. Right now a lot of table updates (e.g. replacing a column) are slow and/or awkward because they are forced to go through what amounts to full table initialization. This is driven by the requirement to potentially upgrade a table to masked and "upgrade" every other column in the table accordingly.

So with this PR something like replace_column would really be as simple as updating the Table.columns dictionary (#8702). Fast and simple, yay. Part of that would be factoring out much of _init_from_list into a separate method akin to _convert_col_for_table that converts column-like thingy into an acceptable table column object.

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2019

I'm in favour about the change in general, and in particular like that normal Table and QTable will both be able to have masked columns mixed in.

One question about the API: as we are changing it anyway, would it be an idea to introduce a new MaskedTable subclass? It would behave a bit like QTable, in that it would change some columns automatically to MaskedColumn (instead of to Quantity).

The main advantage would be that the table code can be much simpler. But it might also help in making the masked property something a bit more logical (like "are there any masked columns" or, if we are willing to break API that far, "are there any masked values?") - when I started reviewing this PR I was quite confused seeing tests that had table.masked is False even though there were masked columns with masked elements in it.

In terms of implementation, initially it would need a Table.__new__ method that checks for masked and passes on to MaskedTable if set (perhaps with a deprecation warning).

@taldcroft
Copy link
Member Author

Indeed I thought about MaskedTable, and it is a reasonable option to discuss. This would immediately solve the table reading issue I mentioned. However, at the end I suspect we are better off not going that route. This bifurcation would extend to other Table subclasses (starting with QTable) and perhaps lead to more complexity instead of simplicity.

One important thing to remember is that the masked attribute has never been about the contents of the table, but rather defining how to handle adding columns. So this PR does not change that fundamental meaning, and I would argue that is a good thing. It does change the behavior associated with masked=False.

The meaning of masked=True for a Table has not changed at all. It means always add MaskedColumn except for mixins, and for mixins add a mask attribute (as a FalseArray instance) if it doesn't already exist.

Previously, masked=False meant to not add a MaskedColumn, but if that was requested then convert to masked=True and change all existing columns. Now it means add the column using the "most appropriate" class and do not touch other columns.

Definitely the name masked is not the most helpful, compared to say require_masked_columns, but it isn't clear if changing the attribute name at this point is worth it. In the end what I imagine is a user migration toward never using masked=True.

I do think something like a has_masked_columns attribute would be helpful. The definition of this is slightly ambiguous though in terms of whether that means columns that can be masked or columns that have masked elements. The latter is generally more useful, and for mixins like Time you don't know without looking, but there is a question of whether MaskedColumn counts for has_masked_columns if none of the elements are actually masked.

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2019

@taldcroft - I see your point about masked - and agree we should change the meaning as little as possible.

This also means the question of whether to use a subclass or not is independent, and at some level an implementation detail: unless the user goes out of their way to check what the actual class of the table instance is, the behaviour of the instance will be as defined by the masked attribute. Which in turn means the decision whether or not to do it should be mostly whether it makes things a lot easier.

I do like the subclass by its analogy with QTable, though, and think that perhaps a more general Table.__new__ with masked and always_apply_units (or something better) and arguments that tell what kind of table behaviour you get.

Or perhaps whether we can push the "agnostic" model a bit further, and also let Quantity just remain itself in a regular Table by default. If so, that would push the question of whether something should be a Quantity or Column to the readers.

Anyway, all that somewhat orthogonal to the current PR, which I think is a good idea regardless of how it is implemented.

@hamogu
Copy link
Member

hamogu commented Jun 9, 2019

I think that's a reasonable API change. In that past, I've found masked columns to be cumbersome in some cases, because they forced the entire table into masked-mode. I don't remember if it was a speed-issue or if some downstream code could not handle the masks, but in any case, having only columns masked that need to be masked is a good idea.

@taldcroft
Copy link
Member Author

About explicitly round-tripping the table masked attribute, I started down this path and got a partly working solution, but realized it was problematic. Doing this requires updating (hacking) the serialization code to add that specific attribute into the serialized data via a special key in the table meta. It is really not nice.

After doing this I also had the realization that existing table files from masked tables would not have this special meta in them anyway, so they will be read as unmasked, while tables written with astropy >= 4.0 would have that there. So it's just a mess, and the only answer is to accept that if you want a table that is read from file to be masked, then that needs to be done as an explicit step outside of the read process.

All this actually makes astropy I/O more consistent and explicit. Before it was using implicit rules to set masked, and now that will be explicitly handled by the user.

@mhvk
Copy link
Contributor

mhvk commented Jun 21, 2019

@taldcroft - makes sense; I agree with moving towards a state where masked just becomes irrelevant.

@codecov
Copy link

codecov bot commented Jun 21, 2019

Codecov Report

Merging #8789 into master will decrease coverage by <.01%.
The diff coverage is 95.12%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8789      +/-   ##
==========================================
- Coverage   86.72%   86.72%   -0.01%     
==========================================
  Files         405      404       -1     
  Lines       59994    59972      -22     
  Branches     1100     1100              
==========================================
- Hits        52032    52012      -20     
+ Misses       7321     7319       -2     
  Partials      641      641
Impacted Files Coverage Δ
astropy/utils/iers/iers.py 92.93% <100%> (+0.02%) ⬆️
astropy/io/ascii/core.py 93.27% <100%> (-0.03%) ⬇️
astropy/table/table.py 93.23% <94.73%> (+0.3%) ⬆️
astropy/utils/data.py 81.15% <0%> (-0.45%) ⬇️
astropy/units/equivalencies.py 96.18% <0%> (-0.02%) ⬇️
astropy/constants/astropyconst20.py 100% <0%> (ø) ⬆️
astropy/constants/iau2015.py 100% <0%> (ø) ⬆️
astropy/io/votable/tree.py 85.98% <0%> (ø) ⬆️
astropy/constants/astropyconst40.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9d4d95...d856ad7. Read the comment docs.

columns = self.TableColumns()

for name in names:
columns[name] = self.ColumnClass(name=name, data=newdata[name])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was hardwired previously to fast-track the logic in _init_from_list assuming that self.ColumnClass is set appropriately. Since we no longer have that guarantee we just dump the list of columns (masked or unmasked) into _init_from_list.

# Read in as a regular Table, including possible masked columns.
# Columns will be filled and converted to Quantity in cls.__init__.
iers_a = Table.read(file, format='cds', readme=readme)
iers_a = Table(iers_a, masked=True, copy=False)
Copy link
Member Author

@taldcroft taldcroft Jun 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good example of what can break in application code and the suggested workaround. It is not that elegant, but at some level I just want to discourage use of masked=True altogether. But for the sake of completeness, options would be:

Table.read(file, format='cds', readme=readme, masked=True)  
# High-impact API change, prevents readers from having `masked` kwarg
# or
iers_a = Table.read(file, format='cds', readme=readme)
iers_a.masked = True  
# Make `masked` mutable and upgrade Columns to MaskedColumn.
# But this violates "don't do radical things to object with a property change"
# or
iers_a.convert_to_masked()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The final version of iers_a is forced to have all MaskedColumn, and any new column will get added as MaskedColumn.

@taldcroft
Copy link
Member Author

This is ready for final review.

@mhvk mhvk changed the title WIP: change behavior of Table regarding masked columns Change behavior of Table regarding masked columns Jun 21, 2019
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nearly ready to do: only two comments, of which one is a nitpick. Overall, I feel the code has been simplified, which suggests this is really a change for the better.

One question for the future: all the tests for not hasattr(col, 'mask') are somewhat annoying. One possible suggestion would be to move mask to col.info - for non-masked columns, it could create a FalseArray on the fly. But no reason to hold this up; more something to consider for #6720

table = read_rdb(text.replace('\n', newline), parallel=parallel)
assert_table_equal(table, expected)
assert np.all(table == expected)
# TO DO : deal with this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this done?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it is now.

class TestTableInit(SetupData):
"""Initializing a table"""

def test_mask_true_if_any_input_masked(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the name of the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small things but overall I'm in favor of this. The auto-upgrading of whole tables has always really bugged me, and even though this may break a few cases out there, I think most users won't be affected.

# Read in as a regular Table, including possible masked columns.
# Columns will be filled and converted to Quantity in cls.__init__.
iers_a = Table.read(file, format='cds', readme=readme)
iers_a = Table(iers_a, masked=True, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?

.. Note::

For most applications we now recommend using the default |Table| behavior with
``masked=False`` to allow heterogenous column types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like you recommend explicitly setting masked=False, but the best is to just not specify masked=, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the wording was not ideal. I'm changing to:

   For most applications, even those with masked column data, we now recommend using
   the default |Table| behavior which allows heterogenous column types.  This implies
   creating tables *without* specifying the ``masked`` keyword argument.

.. doctest-skip::

>> dat = Table.read('data.fits')
>> dat['new_column'] = MaskedColumn([1, 2, 3, 4, 5])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what about giving dat['new_column'] = np.ma.array([1, 2, 3, 4, 5]) as an alternative? (just to avoid the MaskedColumn import)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, but maybe np.ma.MaskedArray instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@astrofrog
Copy link
Member

I think the HTML build is something that's actually fixed on master, so it should go away if you rebase.

@bsipocz
Copy link
Member

bsipocz commented Jun 21, 2019

@taldcroft - while addressing the review, could you also rebase to get rid of the unrelated CI failure?

@taldcroft
Copy link
Member Author

Passing all tests now, comments addressed (I think).

@mhvk
Copy link
Contributor

mhvk commented Jun 21, 2019

At least the memory aspect of FalseArray is easily addressed: it is readonly, so can just be a single element, np.broadcast_to(False, col.shape).

Anyway, for another time!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good now, so merging. Very nice1

@mhvk
Copy link
Contributor

mhvk commented Jun 21, 2019

@astrofrog - I guess I cannot merge without dismissing your request for changes - do you agree the change in wording is good? If so, just go ahead and merge.

@bsipocz bsipocz dismissed astrofrog’s stale review June 22, 2019 01:14

review has been addressed

@bsipocz
Copy link
Member

bsipocz commented Jun 22, 2019

I'm certain that the review has been adequately addressed, and thus go ahead and merge, so other PRs based on are not being held up.

@bsipocz bsipocz merged commit 6b5015c into astropy:master Jun 22, 2019
@bsipocz
Copy link
Member

bsipocz commented Jun 22, 2019

Thank you @taldcroft!

taldcroft added a commit to taldcroft/astropy that referenced this pull request Jun 23, 2019
taldcroft added a commit that referenced this pull request Jun 29, 2019
Correctly handle masking in add_row() after #8789
dhomeier pushed a commit to dhomeier/astropy that referenced this pull request Jul 2, 2019
taldcroft added a commit to taldcroft/astropy that referenced this pull request Jul 5, 2019
eblur pushed a commit to eblur/astropy that referenced this pull request Jul 15, 2019
@taldcroft taldcroft deleted the table-mask branch October 1, 2019 20:45
eerovaher added a commit to eerovaher/astropy that referenced this pull request Aug 3, 2021
PR astropy#8789 allowed a `Table` to contain both `MaskedColumn` and `Column`
instances, meaning that in many cases there is no longer any need to set
`Table.masked=True`, but not all of the relevant documentation was
updated.
@eerovaher eerovaher mentioned this pull request Aug 4, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API change PRs and issues that change an existing API, possibly requiring a deprecation period io.ascii io.fits io.misc Ready-for-final-review table whatsnew-needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider allowing MaskedColumn and Column in the same Table

5 participants