-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Allow round-tripping ASCII masked tables for most formats #5347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # representation of masked values (depending on the writer class), but the | ||
| # machinery expects a list. The easiest here is to just pop the value off, | ||
| # i.e. fill_values=None is the same as not providing it at all. | ||
| if 'fill_values' in kwargs and kwargs['fill_values'] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly verbose, isn't this equivalent to kwargs.get('fill_values', None) is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe del kwargs['fill_values'] is more explicit than pop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on the del kwargs.
The kwargs.get() expression is not the same if fill_values is not in the supplied kwargs, which is normally the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. I was thinking about the dict.pop with default value, in that case kwargs.get would've worked - but with del you're approach is the only one that makes sense.
| if 'fill_values' in kwargs: | ||
| writer.data.fill_values = kwargs['fill_values'] | ||
| # Prepend user-specified values to the class default. Both are lists. | ||
| writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks some tests because one is a tuple and the other a list. Not sure why (user/test-input?) but maybe this needs a tuple or list call around it (or itertools.chain and a cast afterwards).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs fill_values for write() needs to be a list of tuples, but there has been a convenience built in to the reader-side that if there was only one tuple then you don't need the list around it. It looks like that short-cut is built into the tests. Urgh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we wanted to test that the short-cut works.
|
Am I right, that with this change some or most of the files previously written may not be re-read correctly? |
|
@MSeifert04 - This change has no impact on reading previously written files since the reader is not changing. What it will disrupt is the case of an existing code that is writing new files for a masked table and then relying on the output having the So of course we need to have a serious discussion about whether the breakage is going to be widely problematic. The release notes would have some fairly strong notes about this. |
|
@taldcroft Just a guess but the |
|
@MSeifert04 - yes, it's definitely a problem related to |
cf186df to
f6ea050
Compare
|
Rebased off of #5370 which should fix the Windows failures. |
f6ea050 to
4a98837
Compare
|
I feel like this should probably do something with |
|
@embray - I would not discount the idea of adjusting In any case I would prefer to keep that idea separate from this PR. |
astropy/io/ascii/core.py
Outdated
| write_spacer_lines = ['ASCII_TABLE_WRITE_SPACER_LINE'] | ||
| fill_include_names = None | ||
| fill_exclude_names = None | ||
| # Currently, the default matches the numpy default for masked values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the default, this comment line has to go. Note however, that this also states why we chose "--" as default in the first place - it's consistent with numpy. Maybe astropy is mature enough that we don't need that any longer, if we want the "least surprising" behavior, doing what numpy does is not a bad choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken about the least surprising behavior, but I think here we are getting into the distinction between the screen representation of numpy vs. a file serialization format. I think it is pretty clear that '--' is no good as a unique serialization marker for missing data, whereas '' has a fair bit of precedent. And since we already took the much bigger leap of making '' indicated masked when reading, let's close the loop now and do the same for writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to this change, I just wanted to make sure it's a conscious decision to move away from the numpy format since that was the reason for having -- in the first place.
astropy/io/ascii/core.py
Outdated
|
|
||
| from .fastbasic import FastBasic | ||
|
|
||
| # A values of None for fill_values imply getting the default string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values -> value; imply -> implies
|
BTW this clashes with #5379 and should probably wait till that is merged. |
4a98837 to
6f676f0
Compare
96a8424 to
513bf00
Compare
|
@hamogu - this is ready for final review. The last 5 commits are new since you last looked. There are no real functional changes, just docs and cleaner code and more complete testing. |
|
Closes #5235. |
| fill_exclude_names = None | ||
| # Currently, the default matches the numpy default for masked values. | ||
| fill_values = [(masked, '--')] | ||
| fill_values = [(masked, '')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment in here: # Note that this differs from the numpy default. See #5347
|
|
||
| from .fastbasic import FastBasic | ||
|
|
||
| # A value of None for fill_values imply getting the default string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imply -> implies
| # in which case make it a list | ||
| kwargs['fill_values'][1] + '' | ||
| kwargs['fill_values'] = [kwargs['fill_values']] | ||
| writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (potentially) a change in behaviour. Previously, user defined fill values would replace writer defaults, now they are added to writer defaults. I believe that that's not a big issue for our built-in classes, because we only have defaults for None, missing, empty etc. However, I have build classes for my own use that derive from BaseReader and replace "-999" with missing. However, my collaborators sometimes also send me tables where missing value are marked with other numbers. In that case I used to call my own class with fill_value='+999'. In the new implementation, this would not treat both +999 and -999 as missing. For my particular case all data value are between -10 and +10, so I'm fine either way, I just wanted to give an example where is might actually matter.
To summarize: I think this change (very few people build their own reader classes) is OK, but should be mentioned in the change log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hamogu - I'm getting a little confused about whether you talking about reading or writing, since you mentioned getting files from your collaborators and replacing -999 with missing. For writing you would have said "replacing missing with -999".
It's important to note that the user-supplied kwarg will always override the default by virtue of being earlier in the list. I also think there will be no change in behavior for custom writer classes that define a custom fill_value, e.g. to output -999 for missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. You are right, I was getting confused, please disregard my example. However, the point still stands.
If I had a custom writer class with the following replacements:
missing -> -999
nan -> -888
'abc'->'def'
and I call it with fill_values = ('123', '456') then before this PR I would only have the 123->456 replacement. After this PR I will have 123->456 plus the default replacements missin->-999, nan->-888, and 'abc'->'def'. Right?
In other words, before this change, I could switch off the default fill values by passing in essentially an empty list (or non-sense replacements that never happen). Now, I would have ot know the default replacements. If I want to witch them off in the example above, I would have to do (pseudo code) fill_values=[(missing, missing), (nan,nan), ('abc', 'abc')].
Usually, default replacements in a class make sense and you usually want to keep them even if the user passes in a fill_value specifier to replace some other type of values, but it's still a change in behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: It's a change in behaviour that makes it better, so I'm all for keeping the code as @taldcroft wrote it. I just suggest this to be added to the changelog as a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hamogu - understood, will do. Thanks for thinking this through.
| return | ||
|
|
||
| # Skip tests for fixed_width or HTML without bs4 | ||
| if ((fmt_name == 'html' and not HAS_BEAUTIFUL_SOUP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytest offers a short-cut for this case of "skip if import not successful":
bs4 = pytest.importorskip("bs4")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. In the interest of not further clogging CI, I would like to defer this change for now. Anyway it should probably be done as a global change across astropy if we decide this is now the preferred idiom. See #5543.
|
@taldcroft Awesome! Thanks for fixing things up here and in #5427 and adding the round-tripping test. |
This PR allows round-tripping ASCII masked tables for most formats by changing the default
io.asciimasked value output from--to an empty string.This is a relatively significant API change, but backward compatible output is possible by adding
fill_values=[(ascii.masked, '--')]to any existingwritestatements that require the old convention.Overall I think the API break is worth it given the substantial benefit. It turns out that certain things in the existing output
fill_valuesmachinery were not working or inconsistent anyway. For instance the slow writers use the numpy default string for a masked value (np.ma.masked_print_option) while the fast writers always use--.As for the actual table value representing "missing data", the precedent of an empty string from CSV seems compelling and all formats support this already. Otherwise one needs to make up some arbitrary sentinel value since
--is not unique enough. The fact that the readers already interpret an empty value as missing in every format is good and means there is no API change from that perspective.Closes #5337.
The current version does not pass tests for ['fixed_width', 'aastex', 'latex'].
fixed_widthneeds some special handling in the test. Thetexformats should work pending the tex lastline fix (e.g. #5236).Closes #5235.
This PR includes complete round-tripping tests.
cc: @hamogu @astrofrog