Allow round-tripping ASCII masked tables for most formats #5347

taldcroft · 2016-09-19T21:19:17Z

This PR allows round-tripping ASCII masked tables for most formats by changing the default io.ascii masked value output from -- to an empty string.

This is a relatively significant API change, but backward compatible output is possible by adding fill_values=[(ascii.masked, '--')] to any existing write statements that require the old convention.

Overall I think the API break is worth it given the substantial benefit. It turns out that certain things in the existing output fill_values machinery were not working or inconsistent anyway. For instance the slow writers use the numpy default string for a masked value (np.ma.masked_print_option) while the fast writers always use --.

As for the actual table value representing "missing data", the precedent of an empty string from CSV seems compelling and all formats support this already. Otherwise one needs to make up some arbitrary sentinel value since -- is not unique enough. The fact that the readers already interpret an empty value as missing in every format is good and means there is no API change from that perspective.

Closes #5337.

The current version does not pass tests for ['fixed_width', 'aastex', 'latex']. fixed_width needs some special handling in the test. The tex formats should work pending the tex lastline fix (e.g. #5236).

Closes #5235.
This PR includes complete round-tripping tests.

cc: @hamogu @astrofrog

MSeifert04 · 2016-09-19T22:08:26Z

astropy/io/ascii/core.py

+    # representation of masked values (depending on the writer class), but the
+    # machinery expects a list.  The easiest here is to just pop the value off,
+    # i.e. fill_values=None is the same as not providing it at all.
+    if 'fill_values' in kwargs and kwargs['fill_values'] is None:


This seems overly verbose, isn't this equivalent to kwargs.get('fill_values', None) is None

and maybe del kwargs['fill_values'] is more explicit than pop.

Agreed on the del kwargs.

The kwargs.get() expression is not the same if fill_values is not in the supplied kwargs, which is normally the case.

you're right. I was thinking about the dict.pop with default value, in that case kwargs.get would've worked - but with del you're approach is the only one that makes sense.

MSeifert04 · 2016-09-19T22:24:24Z

astropy/io/ascii/core.py

    if 'fill_values' in kwargs:
-        writer.data.fill_values = kwargs['fill_values']
+        # Prepend user-specified values to the class default.  Both are lists.
+        writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values


This breaks some tests because one is a tuple and the other a list. Not sure why (user/test-input?) but maybe this needs a tuple or list call around it (or itertools.chain and a cast afterwards).

According to the docs fill_values for write() needs to be a list of tuples, but there has been a convenience built in to the reader-side that if there was only one tuple then you don't need the list around it. It looks like that short-cut is built into the tests. Urgh.

Well, we wanted to test that the short-cut works.

MSeifert04 · 2016-09-19T22:27:48Z

Am I right, that with this change some or most of the files previously written may not be re-read correctly?

taldcroft · 2016-09-20T00:02:44Z

@MSeifert04 - This change has no impact on reading previously written files since the reader is not changing. What it will disrupt is the case of an existing code that is writing new files for a masked table and then relying on the output having the -- fill values for masked data. For instance the code might be reading back using the fill_values=[('--', '0')] trick.

So of course we need to have a serious discussion about whether the breakage is going to be widely problematic. The release notes would have some fairly strong notes about this.

MSeifert04 · 2016-09-21T21:42:52Z

@taldcroft Just a guess but the intp (default int-type) for windows is int32 (even for 64bit builds). I assume if you have trouble round-tripping that somewhere dtype=int is used and in another place int64 (or some equivalent). Might be somewhere in the fast-reader code.

taldcroft · 2016-09-21T21:58:15Z

@MSeifert04 - yes, it's definitely a problem related to int32 being default on Windows. #4686 was my failed attempt to fix this. I was hoping that the fast reader would recognize an int32 overflow and upcast (19670dc) but this didn't fly and in fact made a bunch of other stuff break. In the meantime I'm just going to a slightly weaker test and only require that the dtype kind matches so i4 and i8 will be considered equal. There is no strict requirement of matching at this level for ASCII formats that don't include detailed data type information.

taldcroft · 2016-09-30T21:08:36Z

Rebased off of #5370 which should fix the Windows failures.

embray · 2016-10-04T15:07:52Z

I feel like this should probably do something with DEFAULT_ASCII_TNULL in the FITS module while we're at it, but I'm not sure exactly what (except maybe change its value to ''): https://github.com/astropy/astropy/blob/master/astropy/io/fits/column.py#L131

taldcroft · 2016-10-04T15:55:29Z

@embray - I would not discount the idea of adjusting DEFAULT_ASCII_TNULL, but I'm just a bit more scared of messing around with FITS output since I imagine that this might have a larger impact owing to the wider use of FITS in "production" code vs simple ASCII tables. I have no real evidence for this assertion of course...

In any case I would prefer to keep that idea separate from this PR.

hamogu · 2016-10-14T00:24:46Z

astropy/io/ascii/core.py

    write_spacer_lines = ['ASCII_TABLE_WRITE_SPACER_LINE']
    fill_include_names = None
    fill_exclude_names = None
    # Currently, the default matches the numpy default for masked values.


If you change the default, this comment line has to go. Note however, that this also states why we chose "--" as default in the first place - it's consistent with numpy. Maybe astropy is mature enough that we don't need that any longer, if we want the "least surprising" behavior, doing what numpy does is not a bad choice.

Point taken about the least surprising behavior, but I think here we are getting into the distinction between the screen representation of numpy vs. a file serialization format. I think it is pretty clear that '--' is no good as a unique serialization marker for missing data, whereas '' has a fair bit of precedent. And since we already took the much bigger leap of making '' indicated masked when reading, let's close the loop now and do the same for writing.

I'm not opposed to this change, I just wanted to make sure it's a conscious decision to move away from the numpy format since that was the reason for having -- in the first place.

hamogu · 2016-10-14T00:25:48Z

astropy/io/ascii/core.py


    from .fastbasic import FastBasic

+    # A values of None for fill_values imply getting the default string


values -> value; imply -> implies

taldcroft · 2016-10-15T11:20:22Z

BTW this clashes with #5379 and should probably wait till that is merged.

taldcroft · 2016-12-02T11:27:17Z

@hamogu - this is ready for final review. The last 5 commits are new since you last looked. There are no real functional changes, just docs and cleaner code and more complete testing.

taldcroft · 2016-12-02T13:53:55Z

Closes #5235.
This PR includes complete round-tripping tests.

hamogu · 2016-12-02T14:29:37Z

astropy/io/ascii/core.py

    fill_exclude_names = None
-    # Currently, the default matches the numpy default for masked values.
-    fill_values = [(masked, '--')]
+    fill_values = [(masked, '')]


I would add a comment in here: # Note that this differs from the numpy default. See #5347

hamogu · 2016-12-02T14:30:10Z

astropy/io/ascii/core.py


    from .fastbasic import FastBasic

+    # A value of None for fill_values imply getting the default string


imply -> implies

hamogu · 2016-12-02T14:37:58Z

astropy/io/ascii/core.py

+            # in which case make it a list
+            kwargs['fill_values'][1] + ''
+            kwargs['fill_values'] = [kwargs['fill_values']]
+        writer.data.fill_values = kwargs['fill_values'] + writer.data.fill_values


This is (potentially) a change in behaviour. Previously, user defined fill values would replace writer defaults, now they are added to writer defaults. I believe that that's not a big issue for our built-in classes, because we only have defaults for None, missing, empty etc. However, I have build classes for my own use that derive from BaseReader and replace "-999" with missing. However, my collaborators sometimes also send me tables where missing value are marked with other numbers. In that case I used to call my own class with fill_value='+999'. In the new implementation, this would not treat both +999 and -999 as missing. For my particular case all data value are between -10 and +10, so I'm fine either way, I just wanted to give an example where is might actually matter.

To summarize: I think this change (very few people build their own reader classes) is OK, but should be mentioned in the change log.

@hamogu - I'm getting a little confused about whether you talking about reading or writing, since you mentioned getting files from your collaborators and replacing -999 with missing. For writing you would have said "replacing missing with -999".

It's important to note that the user-supplied kwarg will always override the default by virtue of being earlier in the list. I also think there will be no change in behavior for custom writer classes that define a custom fill_value, e.g. to output -999 for missing values.

Sorry. You are right, I was getting confused, please disregard my example. However, the point still stands.
If I had a custom writer class with the following replacements:
missing -> -999
nan -> -888
'abc'->'def'
and I call it with fill_values = ('123', '456') then before this PR I would only have the 123->456 replacement. After this PR I will have 123->456 plus the default replacements missin->-999, nan->-888, and 'abc'->'def'. Right?

In other words, before this change, I could switch off the default fill values by passing in essentially an empty list (or non-sense replacements that never happen). Now, I would have ot know the default replacements. If I want to witch them off in the example above, I would have to do (pseudo code) fill_values=[(missing, missing), (nan,nan), ('abc', 'abc')].
Usually, default replacements in a class make sense and you usually want to keep them even if the user passes in a fill_value specifier to replace some other type of values, but it's still a change in behaviour.

Just to be clear: It's a change in behaviour that makes it better, so I'm all for keeping the code as @taldcroft wrote it. I just suggest this to be added to the changelog as a bug fix.

@hamogu - understood, will do. Thanks for thinking this through.

hamogu · 2016-12-02T14:40:59Z

astropy/io/ascii/tests/test_write.py

+        return
+
+    # Skip tests for fixed_width or HTML without bs4
+    if ((fmt_name == 'html' and not HAS_BEAUTIFUL_SOUP)


Pytest offers a short-cut for this case of "skip if import not successful":
bs4 = pytest.importorskip("bs4")

Cool. In the interest of not further clogging CI, I would like to defer this change for now. Anyway it should probably be done as a global change across astropy if we decide this is now the preferred idiom. See #5543.

taldcroft · 2016-12-02T16:40:31Z

@hamogu - updated the changelog in 3f00763. Look OK?

wmwv · 2016-12-07T16:07:25Z

@taldcroft Awesome! Thanks for fixing things up here and in #5427 and adding the round-tripping test.

taldcroft added io.ascii Affects-release labels Sep 19, 2016

taldcroft added this to the v1.3.0 milestone Sep 19, 2016

MSeifert04 reviewed Sep 19, 2016

View reviewed changes

taldcroft mentioned this pull request Sep 20, 2016

Fix int overflows in the ASCII fast reader c parser #5319

Merged

taldcroft force-pushed the ascii-masked branch from cf186df to f6ea050 Compare September 30, 2016 21:06

taldcroft force-pushed the ascii-masked branch from f6ea050 to 4a98837 Compare October 3, 2016 00:08

hamogu reviewed Oct 14, 2016

View reviewed changes

taldcroft force-pushed the ascii-masked branch from 4a98837 to 6f676f0 Compare December 2, 2016 01:15

taldcroft added 10 commits December 2, 2016 06:17

Change default io.ascii masked value output from -- to empty string

5598e90

Allow for write fill_values as a single tuple

a073f22

Replace kwargs.pop with del kwargs[..]

691c2b4

Tweak test it passes on Windows

a6c3980

Tweak test again hoping for Windows pass

16fe2a7

Make the HTML writer conform to the empty string default

22a6b1f

Do not test HTML if bs4 not installed

ade0a4b

Update changelog and docs

2799978

Add MaskedColumn import to docs

341f90f

Minor tweaks to code comments [skip ci]

86af4dc

taldcroft added 2 commits December 2, 2016 06:17

Remove set_numpy_masked_string hack for HTML writer after astropy#5379

b8b4afb

Remove skipping aastex, latex tests after astropy#5427

513bf00

taldcroft force-pushed the ascii-masked branch from 96a8424 to 513bf00 Compare December 2, 2016 11:22

taldcroft assigned hamogu Dec 2, 2016

taldcroft added Ready-for-final-review and removed Work in progress labels Dec 2, 2016

hamogu requested changes Dec 2, 2016

View reviewed changes

Add clarification of the API change in changelog [skip ci]

3f00763

taldcroft merged commit 41770f5 into astropy:master Dec 2, 2016

taldcroft deleted the ascii-masked branch December 2, 2016 16:58

taldcroft mentioned this pull request Feb 9, 2017

ECSV support for missing values? #5284

Closed

bsipocz mentioned this pull request Jun 19, 2019

IPAC table read error #4684

Closed


		from .fastbasic import FastBasic

		# A values of None for fill_values imply getting the default string


		from .fastbasic import FastBasic

		# A value of None for fill_values imply getting the default string

Uh oh!

Allow round-tripping ASCII masked tables for most formats #5347

Allow round-tripping ASCII masked tables for most formats #5347

Uh oh!

Conversation

taldcroft commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MSeifert04 commented Sep 19, 2016

Uh oh!

taldcroft commented Sep 20, 2016

Uh oh!

MSeifert04 commented Sep 21, 2016

Uh oh!

taldcroft commented Sep 21, 2016

Uh oh!

taldcroft commented Sep 30, 2016

Uh oh!

embray commented Oct 4, 2016

Uh oh!

taldcroft commented Oct 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Oct 15, 2016

Uh oh!

taldcroft commented Dec 2, 2016

Uh oh!

taldcroft commented Dec 2, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hamogu Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taldcroft commented Dec 2, 2016

Uh oh!

wmwv commented Dec 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

taldcroft commented Sep 19, 2016 •

edited

Loading

hamogu Dec 2, 2016 •

edited

Loading