-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Change behavior of Table regarding masked columns #8789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
One important thing to note is that this PR is a pre-cursor to other Table optimizations. Right now a lot of table updates (e.g. replacing a column) are slow and/or awkward because they are forced to go through what amounts to full table initialization. This is driven by the requirement to potentially upgrade a table to masked and "upgrade" every other column in the table accordingly. So with this PR something like replace_column would really be as simple as updating the |
|
I'm in favour about the change in general, and in particular like that normal One question about the API: as we are changing it anyway, would it be an idea to introduce a new The main advantage would be that the table code can be much simpler. But it might also help in making the In terms of implementation, initially it would need a |
|
Indeed I thought about One important thing to remember is that the The meaning of Previously, Definitely the name I do think something like a |
|
@taldcroft - I see your point about This also means the question of whether to use a subclass or not is independent, and at some level an implementation detail: unless the user goes out of their way to check what the actual class of the table instance is, the behaviour of the instance will be as defined by the I do like the subclass by its analogy with Or perhaps whether we can push the "agnostic" model a bit further, and also let Anyway, all that somewhat orthogonal to the current PR, which I think is a good idea regardless of how it is implemented. |
|
I think that's a reasonable API change. In that past, I've found masked columns to be cumbersome in some cases, because they forced the entire table into masked-mode. I don't remember if it was a speed-issue or if some downstream code could not handle the masks, but in any case, having only columns masked that need to be masked is a good idea. |
|
About explicitly round-tripping the table After doing this I also had the realization that existing table files from masked tables would not have this special meta in them anyway, so they will be read as unmasked, while tables written with astropy >= 4.0 would have that there. So it's just a mess, and the only answer is to accept that if you want a table that is read from file to be masked, then that needs to be done as an explicit step outside of the read process. All this actually makes astropy I/O more consistent and explicit. Before it was using implicit rules to set |
|
@taldcroft - makes sense; I agree with moving towards a state where |
Codecov Report
@@ Coverage Diff @@
## master #8789 +/- ##
==========================================
- Coverage 86.72% 86.72% -0.01%
==========================================
Files 405 404 -1
Lines 59994 59972 -22
Branches 1100 1100
==========================================
- Hits 52032 52012 -20
+ Misses 7321 7319 -2
Partials 641 641
Continue to review full report at Codecov.
|
| columns = self.TableColumns() | ||
|
|
||
| for name in names: | ||
| columns[name] = self.ColumnClass(name=name, data=newdata[name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was hardwired previously to fast-track the logic in _init_from_list assuming that self.ColumnClass is set appropriately. Since we no longer have that guarantee we just dump the list of columns (masked or unmasked) into _init_from_list.
| # Read in as a regular Table, including possible masked columns. | ||
| # Columns will be filled and converted to Quantity in cls.__init__. | ||
| iers_a = Table.read(file, format='cds', readme=readme) | ||
| iers_a = Table(iers_a, masked=True, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good example of what can break in application code and the suggested workaround. It is not that elegant, but at some level I just want to discourage use of masked=True altogether. But for the sake of completeness, options would be:
Table.read(file, format='cds', readme=readme, masked=True)
# High-impact API change, prevents readers from having `masked` kwarg
# or
iers_a = Table.read(file, format='cds', readme=readme)
iers_a.masked = True
# Make `masked` mutable and upgrade Columns to MaskedColumn.
# But this violates "don't do radical things to object with a property change"
# or
iers_a.convert_to_masked()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The final version of iers_a is forced to have all MaskedColumn, and any new column will get added as MaskedColumn.
|
This is ready for final review. |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nearly ready to do: only two comments, of which one is a nitpick. Overall, I feel the code has been simplified, which suggests this is really a change for the better.
One question for the future: all the tests for not hasattr(col, 'mask') are somewhat annoying. One possible suggestion would be to move mask to col.info - for non-masked columns, it could create a FalseArray on the fly. But no reason to hold this up; more something to consider for #6720
| table = read_rdb(text.replace('\n', newline), parallel=parallel) | ||
| assert_table_equal(table, expected) | ||
| assert np.all(table == expected) | ||
| # TO DO : deal with this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but it is now.
astropy/table/tests/test_masked.py
Outdated
| class TestTableInit(SetupData): | ||
| """Initializing a table""" | ||
|
|
||
| def test_mask_true_if_any_input_masked(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the name of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
astrofrog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small things but overall I'm in favor of this. The auto-upgrading of whole tables has always really bugged me, and even though this may break a few cases out there, I think most users won't be affected.
| # Read in as a regular Table, including possible masked columns. | ||
| # Columns will be filled and converted to Quantity in cls.__init__. | ||
| iers_a = Table.read(file, format='cds', readme=readme) | ||
| iers_a = Table(iers_a, masked=True, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to make sure I fully understand, the value in setting masked=True is essentially auto-conversion from Column to MaskedColumn later, right?
docs/table/masking.rst
Outdated
| .. Note:: | ||
|
|
||
| For most applications we now recommend using the default |Table| behavior with | ||
| ``masked=False`` to allow heterogenous column types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like you recommend explicitly setting masked=False, but the best is to just not specify masked=, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the wording was not ideal. I'm changing to:
For most applications, even those with masked column data, we now recommend using
the default |Table| behavior which allows heterogenous column types. This implies
creating tables *without* specifying the ``masked`` keyword argument.
docs/table/masking.rst
Outdated
| .. doctest-skip:: | ||
|
|
||
| >> dat = Table.read('data.fits') | ||
| >> dat['new_column'] = MaskedColumn([1, 2, 3, 4, 5]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what about giving dat['new_column'] = np.ma.array([1, 2, 3, 4, 5]) as an alternative? (just to avoid the MaskedColumn import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, but maybe np.ma.MaskedArray instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
I think the HTML build is something that's actually fixed on master, so it should go away if you rebase. |
|
@taldcroft - while addressing the review, could you also rebase to get rid of the unrelated CI failure? |
|
Passing all tests now, comments addressed (I think). |
|
At least the memory aspect of Anyway, for another time! |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good now, so merging. Very nice1
|
@astrofrog - I guess I cannot merge without dismissing your request for changes - do you agree the change in wording is good? If so, just go ahead and merge. |
|
I'm certain that the review has been adequately addressed, and thus go ahead and merge, so other PRs based on are not being held up. |
|
Thank you @taldcroft! |
Correctly handle masking in add_row() after #8789
PR astropy#8789 allowed a `Table` to contain both `MaskedColumn` and `Column` instances, meaning that in many cases there is no longer any need to set `Table.masked=True`, but not all of the relevant documentation was updated.
Aka allow MaskedColumn and Column in the same table. This is a WIP but would like initial big-picture (API) feedback before going too much further. From the #7496 description (and see also #3665).
Currently if any column in a table is a MaskedColumn, then all other columns are "upgraded" to be masked as well. This happens on initialization and if a MaskedColumn is added or any column in an existing table is converted from Column to MaskedColumn. This is entirely a holdover from the original implementation of table using a numpy structured array.
The implementation of this is relatively straightforward, the main reservation is about stability and breaking code.
The main API changes are:
masked=Truewhen creating the table. Currently this happens automatically if any of the columns are masked.Table.maskedattribute reflects only the behavior when adding a new column. So ifmaskedis true then non-mixin columns are always converted toMaskedColumn[1], but withmasked=Falsethen the column is added as whatever it is.Previous there was discussion about issuing a warning if column would have been upgraded. I think this would be an annoyance because it will happen in situations that are not actually a problem, but instead just different. My feeling is that the small cross section of code that relied on the old behavior will break, which is not ideal but not as bad as silently doing something wrong (which is where warnings make sense). But discussion would be good!
One thing to consider for this PR is allowing:
This is a nicer version of the current workaround:
Closes #7496
[1] - this is not always true, see code, but in most cases that will be true.
To Do:
masked=Trueround-trip through I/O? Answer: NO, see comment later.