Skip to content

Conversation

@taldcroft
Copy link
Member

@taldcroft taldcroft commented Sep 19, 2016

This PR allows round-tripping ASCII masked tables for most formats by changing the default io.ascii masked value output from -- to an empty string.

This is a relatively significant API change, but backward compatible output is possible by adding fill_values=[(ascii.masked, '--')] to any existing write statements that require the old convention.

Overall I think the API break is worth it given the substantial benefit. It turns out that certain things in the existing output fill_values machinery were not working or inconsistent anyway. For instance the slow writers use the numpy default string for a masked value (np.ma.masked_print_option) while the fast writers always use --.

As for the actual table value representing "missing data", the precedent of an empty string from CSV seems compelling and all formats support this already. Otherwise one needs to make up some arbitrary sentinel value since -- is not unique enough. The fact that the readers already interpret an empty value as missing in every format is good and means there is no API change from that perspective.

Closes #5337.

The current version does not pass tests for ['fixed_width', 'aastex', 'latex']. fixed_width needs some special handling in the test. The tex formats should work pending the tex lastline fix (e.g. #5236).

Closes #5235.
This PR includes complete round-tripping tests.

cc: @hamogu @astrofrog

# representation of masked values (depending on the writer class), but the
# machinery expects a list. The easiest here is to just pop the value off,
# i.e. fill_values=None is the same as not providing it at all.
if 'fill_values' in kwargs and kwargs['fill_values'] is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly verbose, isn't this equivalent to kwargs.get('fill_values', None) is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe del kwargs['fill_values'] is more explicit than pop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the del kwargs.

The kwargs.get() expression is not the same if fill_values is not in the supplied kwargs, which is normally the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I was thinking about the dict.pop with default value, in that case kwargs.get would've worked - but with del you're approach is the only one that makes sense.

if 'fill_values' in kwargs:
writer.data.fill_values = kwargs['fill_values']
# Prepend user-specified values to the class default. Both are lists.
writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks some tests because one is a tuple and the other a list. Not sure why (user/test-input?) but maybe this needs a tuple or list call around it (or itertools.chain and a cast afterwards).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs fill_values for write() needs to be a list of tuples, but there has been a convenience built in to the reader-side that if there was only one tuple then you don't need the list around it. It looks like that short-cut is built into the tests. Urgh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we wanted to test that the short-cut works.

@MSeifert04
Copy link
Contributor

Am I right, that with this change some or most of the files previously written may not be re-read correctly?

@taldcroft
Copy link
Member Author

@MSeifert04 - This change has no impact on reading previously written files since the reader is not changing. What it will disrupt is the case of an existing code that is writing new files for a masked table and then relying on the output having the -- fill values for masked data. For instance the code might be reading back using the fill_values=[('--', '0')] trick.

So of course we need to have a serious discussion about whether the breakage is going to be widely problematic. The release notes would have some fairly strong notes about this.

@MSeifert04
Copy link
Contributor

@taldcroft Just a guess but the intp (default int-type) for windows is int32 (even for 64bit builds). I assume if you have trouble round-tripping that somewhere dtype=int is used and in another place int64 (or some equivalent). Might be somewhere in the fast-reader code.

@taldcroft
Copy link
Member Author

@MSeifert04 - yes, it's definitely a problem related to int32 being default on Windows. #4686 was my failed attempt to fix this. I was hoping that the fast reader would recognize an int32 overflow and upcast (19670dc) but this didn't fly and in fact made a bunch of other stuff break. In the meantime I'm just going to a slightly weaker test and only require that the dtype kind matches so i4 and i8 will be considered equal. There is no strict requirement of matching at this level for ASCII formats that don't include detailed data type information.

@taldcroft
Copy link
Member Author

Rebased off of #5370 which should fix the Windows failures.

@embray
Copy link
Member

embray commented Oct 4, 2016

I feel like this should probably do something with DEFAULT_ASCII_TNULL in the FITS module while we're at it, but I'm not sure exactly what (except maybe change its value to ''): https://github.com/astropy/astropy/blob/master/astropy/io/fits/column.py#L131

@taldcroft
Copy link
Member Author

@embray - I would not discount the idea of adjusting DEFAULT_ASCII_TNULL, but I'm just a bit more scared of messing around with FITS output since I imagine that this might have a larger impact owing to the wider use of FITS in "production" code vs simple ASCII tables. I have no real evidence for this assertion of course...

In any case I would prefer to keep that idea separate from this PR.

write_spacer_lines = ['ASCII_TABLE_WRITE_SPACER_LINE']
fill_include_names = None
fill_exclude_names = None
# Currently, the default matches the numpy default for masked values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the default, this comment line has to go. Note however, that this also states why we chose "--" as default in the first place - it's consistent with numpy. Maybe astropy is mature enough that we don't need that any longer, if we want the "least surprising" behavior, doing what numpy does is not a bad choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken about the least surprising behavior, but I think here we are getting into the distinction between the screen representation of numpy vs. a file serialization format. I think it is pretty clear that '--' is no good as a unique serialization marker for missing data, whereas '' has a fair bit of precedent. And since we already took the much bigger leap of making '' indicated masked when reading, let's close the loop now and do the same for writing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to this change, I just wanted to make sure it's a conscious decision to move away from the numpy format since that was the reason for having -- in the first place.


from .fastbasic import FastBasic

# A values of None for fill_values imply getting the default string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values -> value; imply -> implies

@taldcroft
Copy link
Member Author

BTW this clashes with #5379 and should probably wait till that is merged.

@taldcroft
Copy link
Member Author

@hamogu - this is ready for final review. The last 5 commits are new since you last looked. There are no real functional changes, just docs and cleaner code and more complete testing.

@taldcroft
Copy link
Member Author

Closes #5235.
This PR includes complete round-tripping tests.

fill_exclude_names = None
# Currently, the default matches the numpy default for masked values.
fill_values = [(masked, '--')]
fill_values = [(masked, '')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment in here: # Note that this differs from the numpy default. See #5347


from .fastbasic import FastBasic

# A value of None for fill_values imply getting the default string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imply -> implies

# in which case make it a list
kwargs['fill_values'][1] + ''
kwargs['fill_values'] = [kwargs['fill_values']]
writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (potentially) a change in behaviour. Previously, user defined fill values would replace writer defaults, now they are added to writer defaults. I believe that that's not a big issue for our built-in classes, because we only have defaults for None, missing, empty etc. However, I have build classes for my own use that derive from BaseReader and replace "-999" with missing. However, my collaborators sometimes also send me tables where missing value are marked with other numbers. In that case I used to call my own class with fill_value='+999'. In the new implementation, this would not treat both +999 and -999 as missing. For my particular case all data value are between -10 and +10, so I'm fine either way, I just wanted to give an example where is might actually matter.

To summarize: I think this change (very few people build their own reader classes) is OK, but should be mentioned in the change log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hamogu - I'm getting a little confused about whether you talking about reading or writing, since you mentioned getting files from your collaborators and replacing -999 with missing. For writing you would have said "replacing missing with -999".

It's important to note that the user-supplied kwarg will always override the default by virtue of being earlier in the list. I also think there will be no change in behavior for custom writer classes that define a custom fill_value, e.g. to output -999 for missing values.

Copy link
Member

@hamogu hamogu Dec 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. You are right, I was getting confused, please disregard my example. However, the point still stands.
If I had a custom writer class with the following replacements:
missing -> -999
nan -> -888
'abc'->'def'
and I call it with fill_values = ('123', '456') then before this PR I would only have the 123->456 replacement. After this PR I will have 123->456 plus the default replacements missin->-999, nan->-888, and 'abc'->'def'. Right?

In other words, before this change, I could switch off the default fill values by passing in essentially an empty list (or non-sense replacements that never happen). Now, I would have ot know the default replacements. If I want to witch them off in the example above, I would have to do (pseudo code) fill_values=[(missing, missing), (nan,nan), ('abc', 'abc')].
Usually, default replacements in a class make sense and you usually want to keep them even if the user passes in a fill_value specifier to replace some other type of values, but it's still a change in behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: It's a change in behaviour that makes it better, so I'm all for keeping the code as @taldcroft wrote it. I just suggest this to be added to the changelog as a bug fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hamogu - understood, will do. Thanks for thinking this through.

return

# Skip tests for fixed_width or HTML without bs4
if ((fmt_name == 'html' and not HAS_BEAUTIFUL_SOUP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest offers a short-cut for this case of "skip if import not successful":
bs4 = pytest.importorskip("bs4")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. In the interest of not further clogging CI, I would like to defer this change for now. Anyway it should probably be done as a global change across astropy if we decide this is now the preferred idiom. See #5543.

@taldcroft
Copy link
Member Author

@hamogu - updated the changelog in 3f00763. Look OK?

@taldcroft taldcroft merged commit 41770f5 into astropy:master Dec 2, 2016
@taldcroft taldcroft deleted the ascii-masked branch December 2, 2016 16:58
@wmwv
Copy link
Contributor

wmwv commented Dec 7, 2016

@taldcroft Awesome! Thanks for fixing things up here and in #5427 and adding the round-tripping test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot round-trip masked value to ECSV

6 participants