Added support for compressed files in astropy.io.ascii #425

astrofrog · 2012-10-12T13:23:43Z

It turns out it was easy to modify astropy.io.ascii to support gzip- and bzip2-compressed files.

@taldcroft - what do you think?

mdboom · 2012-10-12T13:27:00Z

astropy/io/ascii/core.py

Should use the with statement here to ensure that the file gets closed.

mdboom · 2012-10-12T13:41:28Z

Note that io.vo does some of this already (in utils.xml), but it would be nice to factor this out into one place that given either a readable file object or a file path, returns a possibly gzip/bzipped/raw file object.

astrofrog · 2012-10-12T14:21:09Z

@taldcroft @mdboom - thanks for the feedback. I agree it would be best to factor this out into a function (that does the better checking you are suggesting) and then have that be used by io.ascii and io.vo. I'll work on it!

astrofrog · 2012-10-12T14:25:16Z

@mdboom - so are you suggesting essentially extracting _convert_to_fd_or_read_function from utils.xml and then also getting rid of the assumption of the extension indicating gzip, but always use the magic bytes, and then also have fallbacks in case one has for example an actual ascii file starting with BZh? or do you need that function to stay the way it is for io.vo?

astrofrog · 2012-10-12T14:44:58Z

@mdboom - so reading over what you said again, I guess I should instead write a function/context manager that returns a file object (not the read function). I'll write it separately from the existing _convert_to_fd_or_read_function, and then we can see if _convert_to_fd_or_read_function can use this new function to simplify things?

mdboom · 2012-10-12T16:13:16Z

Agreed.

astrofrog · 2012-10-12T18:47:02Z

@mdboom @taldcroft - I created a context manager that can understand filenames, URLs (or at least a subset), and file objects, and also understands compression. I think this addresses all your comments above, but the main caveat is that I'll still new to context managers, so I'm perfectly happy working more on this - just let me know :-) Also feel free to open a PR against my branch with suggested changes.

One issue so far: bz2 doesn't support reading from file objects, so I have to create a temporary file. Any ideas?

embray · 2012-10-12T20:04:25Z

Might be worth noting that there are also some relevant routines in https://github.com/astropy/astropy/blob/master/astropy/io/fits/file.py and in https://github.com/astropy/astropy/blob/master/astropy/io/fits/util.py;

I need to take a closer look at what you've done here though and see if there's anything worth adding that isn't already correct. Otherwise I need to see if I can refactor the io.fits end to make use of this code in any way so there isn't this duplication of effort...

astrofrog · 2012-10-12T21:19:36Z

@iguananaut - yes, we should try and reduce duplication - in the end what we're trying to do at that level is pretty generic, so it would be great to have a single function/context manager than can be used across packages. Feel free to hack away with what I have right now.

This integrates `astropy.io.vo` to use the new function. It fixes the following issues with the new function: - Python 3 compatibility - Adds a "binary" flag to determine whether the resulting file object should return bytes - Doesn't close file descriptors that were passed in

mdboom · 2012-10-16T18:30:23Z

I've filed a follow-on pull request for this against @astrofrog's branch.

It moves io.votable to use this new function.

It also addresses a couple of issues:

file handles that are passed in should not be closed by this function -- that's the caller's responsibility
Adds some basic Python 3 compatibility (at least gets the tests passing)
Adds a binary flag to force the returned file descriptor to always return bytes, rather than bytes on Python 2 and unicode on Python 3 as per the default open function without a mode argument.

There are a few lingering issues.

This removes any universal newline handling, so it breaks when handed Windows files. We should add some to the test suite to catch this. While it seems like io.TextIOWrapper would be a great tool to handle this, unfortunately, it doesn't seem possible to decouple the newline handling from unicode decoding. So it would change the behavior on Python 2.x where it currently returns bytes objects (which I think is more appropriate anyway, since the ascii table files don't specify an encoding). This breaks a lot of the tests, so I backed it out. I experimented with telling it to use a "null" codec (whose decode function just returns the bytes object from the file), but then it complains that decode must return a Unicode object.

On Python 3, since the expectation is that the returned file handle provides Unicode strings, I wrap it in a io.TextIOWrapper. This unfortunately does not work for BZ2Files, so I've added a hack in io.ascii to decode the lines after they've been read.

With the above two problems, it seems like the best way out is to write our own IO wrappers, one for Unicode decoding and one for newline handling that could be separated out and won't be so strict about the kinds of objects passed in. It's totally doable, I just lost my patience ;) (end rant).

Lastly, I think we should probably name this function get_readable_fileobj, since it's entirely likely we may grow a get_writable_fileobj in the future.

eteq · 2012-10-16T20:50:42Z

astropy/utils/misc.py

This function concerns me, as it duplicates the functionality in the astropy.config.get_data_* family of functions, especially the URL handling. Might it make more sense to port this behavior there, and use that in io.ascii?

This may also apply to @mdboom's PR he submitted to @astrofrog - I didn't look...

The URL stuff in here was actually just added as a secondary feature, but the compression stuff seems to be different from what's in config.get_data_*. How about having this general function (without cache support) in utils, and then have config use as much of it as possible to avoid duplication, and add the cache stuff? It somehow feels more natural for io.votable and io.ascii to use a file-reading utility from utils than from config...

Follow on to astropy#425.

eteq · 2012-10-16T21:23:13Z

@astrofrog (referencing the now outdated diff discussion): I see your point, but I think this means the get_data_* functions stuff should be moved to utils (or whatever we're calling it now). And the compression stuff is different from what's there, but I'm saying it would make sense to include the compression stuff in the get_data_* functions instead of putting it here and using it there. I also think it may be that we would want the caching stuff in case people want to directly access URLs from the web via ascii or the like.

This might be easier to handle by merging this PR and later combining them and moving the get_data_* functions out in a separate PR, though.

astrofrog · 2012-10-17T07:42:28Z

@eteq - I agree with putting all the functionality in one place, so then we can create astropy.utils.data and have the new stuff currently in this PR in there too. Before I go ahead and attempt to do this, I have a few questions though:

astropy.config.data was originally designed for retrieving data for tests and for astropy itself (e.g. leap second table and stuff like that) which means that most of the time, a given URL's content would be pretty stable, and cache=True made sense. However, if this function is to be used in io.ascii, io.votable, and eventually io.fits, I think it makes sense to disable the cache by default, as I think it should be used only if users understand the implications, and have to read the docs to activate the caching. So I propose that get_data_fileobj have cache=False by default.
The current functions were designed to get files internal to the astropy package, hence the calls to _find_pkg_data_path. This should probably be taken out of the main functions - i.e. I think io.ascii and other sub-packages should be calling a function that does not make that kind of assumption - so maybe we could have get_internal_data_fileobj or an internal=True/False argument or something like that? What do you think? I think that by default, we should just look at the file location relative to the user's location, not the module's.
where should all the actual cache managing functions go? astropy.config.data, astropy.config.cache, astropy.utilis.data, astropy.utils.cache?
The current functions make assumptions about filenames starting with hash, which should only apply to internal files (i.e. for internal files, we disallow hash/<some_hash> as a valid filename), but we should not make the assumption that the user is not reading valid files from a hash directory.

I guess my main point is that there are some things that are specific to getting files internal to astropy, but if we want to create a generic I/O function, it might be better to keep the internal-specific stuff in a separate layer that can wrap around the generic I/O function (in other words, the 'core' function could be astropy-agnostic apart from the caching, and all the rest would be in wrapping functions. What do you think? How about having

get_data_fileobj
get_internal_data_fileobj -> adds the _find_pkg_data_path call and `hash/` handling
get_filename_fileobj
get_internal_filename_fileobj -> adds the _find_pkg_data_path call and `hash/` handling

?

@mdboom, @taldcroft, @iguananaut - do you have any thoughts about this too?

mdboom · 2012-10-17T13:07:29Z

@eteq: I agree. I see this function in this PR as being the low-level one that handles compression and URLs (and eventually newline handling since it looks like we have to do that ourselves once compression is involved). The existing config/data functions that do caching and module-relative lookup could be updated to use this function internally. (I think, unless I'm missing a detail, which possibly I am).

eteq · 2012-10-18T17:17:37Z

@astrofrog - answering your questions/comments in-order:

What I was suggesting earlier is that we should cache data downloaded via io.ascii and io.votable. If I download a data file from online that has a fixed URL, why should I have to download it again just because it's a table?

It might make sense to have a finite time limit on these, though - that's a feature that is not in the get_data_* functions, but we talked about it previously (and at the coordination meeting), so it's definitely something we should add.
The current implementations were designed to be used either for internal or external files - I tried to tune the implementation to have no clear performance difference between the two (at least when it was written, admittedly our packages have grown since then).

That said, I see your point that users probably expect it to start in the current working directory, and the data in the name does seem to imply that it's not for external files. When I look closer at the get_data_* functions, the only ones that really make sense for external data are get_data_fileobj and get_data_contents. So perhaps just factor the actual file retreival out of those two, and add get_external_fileobj and get_external_contents? (Or maybe just get_fileobj and get_contents? If these are all moved to astropy.utils.data or something, the meaning is fairly unambiguous.) Those would then not ever search the data directories/server, but would either reference the local filesystem (if not URLs) or a web page (if a URL) - and then we'd want to remove that functionality (arbitrary URLs) from the get_data_* family.
I think I like astropy.utils.data. It was in config originally mostly just because it was the first example of where we'd want to use a bunch of configuration items. The could also really use their own narrative documentation page - I'm happy to write that if you like.
Good point - if you like my proposal from the second bullet point, just drop all the hash stuff for the external versions.

So I guess I'm mostly in agreement to your proposal, but I don't think we need both get_data_fileobj and get_internal_data_fileobj - the whole point of the data frameworks was that it's transparent whether the data files are internal or external (on the data server). That is, the "data" part indicates that you're asking for either an astropy internal data file, or something on the astropy server.

So to be completely clear, I'm suggesting that instead the current get_data_* functions should remain, but should lose the ability to access arbitrary URLs - the should only be for files that are internal in the package or available on the astropy data server. We would then add two functions corresponding to get_data_fileobj and get_data_contents, but with names that make it clear they are for arbitrary file download/access. They all share the same internals for accessing URLs online, and we also enhance the cache so that one can provide an optional expiration time. (And we move all of it to astropy.utils.data

How does that sound? And @astrofrog, do you want to do this all, or should we split the effort somehow?

astrofrog · 2012-10-20T12:30:12Z

@eteq - I agree with most of what you said - we should have two core functions get_fileobj and get_contents that get used by e.g. io.ascii, io.votable, etc., and those can support optional caching (more on that below). The get_data_fileobj and get_data_contents implement the astropy-specific stuff and call get_fileobj and get_contents behind the scenes.

Regarding the cache, I think it should be available for users who want it, but should be disabled by default. It's easier to explain to users that they can use the cache if they need to improve performance (which allows us to give them information about timeout, etc. in the docs), than to have users who don't understand why the table they read still has an old version when the online one has changed. I do agree the cache has to be available as an option at the low level. Note that if you want, the get_data_* methods could use the cache, since we control the content, and so we can be more careful about not providing dynamic data at fixed URLs.

We can even control the default caching strategy via the config file, defaulting to False (in my view).

@mdboom @iguananaut - what do you think about the cache stuff?

@eteq - in the mean time, I'll see to working on this - the cache thing is easy to change whatever we decide.

eteq · 2012-10-20T15:58:07Z

@astrofrog - ok, what you suggest sounds good. I would say yes that the data functions should use the cache by default (that was always the intent), but you're right it makes sense to have the user-level functions' default controlled by a configuration option.

Are you also thinking at the same time you'll implement a timeout for the cache? If so, just let me know if you need any clarification regarding the way I was storing the cache metadata on-disk (using shelve). I remember at the time thinking I was starting to confuse myself with how I was doing it...

astrofrog · 2012-10-21T12:47:40Z

@eteq - do you think the cache specific code should stay in config, e.g. under config.cache? Or should it go in utils too? (and if so, maybe I can separate it out into data.cache).

astrofrog · 2012-10-21T13:05:51Z

@eteq - I'm currently working on this, and I was thinking, get_data_fileobj sounds a bit too generic, because users want to read data. How about using get_package_fileobj which is more accurate (it's data for the package, whether local or remote)? Is that ok?

astrofrog · 2012-10-21T13:40:34Z

(or I could use pkg_data instead of package, since you use that already for one of the functions)

astrofrog · 2012-10-21T13:54:43Z

And just another question for @eteq - should get_data_fileobj also behave like a context manager? (if it relies on get_fileobj, that happens by default, but just want to make sure that's ok). I know that this means updating a lot of tests that use get_data_fileobj, but I'm fine with that if we want to go that way.

EDIT: only a few tests use get_data_fileobj, so it's easy to fix the few that treat it like a function into context manager notation.

eteq · 2012-10-21T18:43:19Z

@astrofrog - I don't think it's that important where the cache stuff ends up, as it's probably not generally useful beyond the data infrastructure. So if all other things are equal, probably astropy.utils.data.cache makes the most sense, as then it's close to data.

I like pkg_data - definitely "data" has to be in there to distinguish it from just random downloads of anything, but you're right that data on its own is rather generic.

Hmm... Now that you mention it, if you're feeling up to it, get_data_fileobj would make sense to be possible as a context manager. You could make it work just like open, though, right? That is, it can work as a regular function or as a context manager? Then you don't need to change the existing tests (just add new ones).

(Oh, btw, @astrofrog, be sure to run the tests with the --remotedata or -r or whatever option, as well as using the mark for it in any new tests you write.)

astrofrog · 2012-10-21T20:23:00Z

@eteq - ok, I'm making progress on this, but it's not fully functional yet (just pushed what I have so far, but still needs a lot of work).

I have another question: should get_data_fileobj be able to retrieve an arbitrary URL, or is there any point in allowing it to do so, since one could just use get_fileobj for that?

Regarding the context manager thing, I think that once you write it as a context manager, you can't use it as a normal function, but I might be wrong about that. I will check.

eteq · 2012-10-22T03:07:29Z

@astrofrog - I would agree we don't allow get_data_fileobj to do arbitrary URLs - only the internal data files and data.astropy.org (even though we haven't implemented that yet). Then the distinction is more clear on when to use one over the other.

As for the context manager, you definitely can - in fact, now that I look at it, the current get_data_fileobj already supports it. You can look more closely to see how it works but the basic idea is this: if it's regular file, it just does return open(...) - that works as a context manager because with open(...) as f and with somefunc(...) as f are the same if all that somefunc does is return open(...).

For URLs, it's a bit more complicated. In Py 2.x (but not 3.x), urlopen(...) yields a file-like object, but it hasn't implemented __enter__ or __exit__, which are the two requirements for a context manager. If you look at the code, I get around this by just attaching such methods manually to the result of urlopen if it's on py 2.x

…o we have to use delete==False and then delete the file later on.

…line returns.

astrofrog · 2012-11-11T15:48:43Z

I think I've managed to fix all the tests in Windows. The issue was the usual one that NamedTemporaryFile cannot be read on win32 unless closed first, so I do that and then delete the file manually at the end of the context manager. Looks ok?

Also, rather than strip the Windows \r, I just used splitlines instead of split(\n) (just for the record, in case we run into this elsewhere in Astropy).

By the way, to try and fix this I set up a virtual machine with Windows 2008 Server 64-bit on AWS, and to my surprise, could not reproduce these issues! So it seems the NamedTemporaryFile is a win32-specific thing.

I've also rebased. Let's see what Travis says!

astrofrog · 2012-11-11T16:51:09Z

Travis passes, and the tests pass on MacOS X, Linux, and Windows! :-)

@mdboom and @eteq - I'll wait for your go-ahead before merging.

mdboom · 2012-11-12T15:44:02Z

That Windows fix seems reasonable. Probably that's the better thing to do in any case.

Looks good to me.

eteq · 2012-11-12T17:14:03Z

Looks good to me as well. I'll let you do the honors, @astrofrog :)

astrofrog · 2012-11-12T23:15:21Z

Thanks! I made a last minute commit, and will merge tomorrow morning if Travis passes (it should).

Added support for compressed files in astropy.io.ascii

pllim · 2012-11-13T22:02:35Z

Hi. I just updated my fork to reflect the latest master, which is a result of this merge, and it broke the conesearch code I inherited from @mdboom .

from astropy.utils.data import get_readable_fileobj
url = 'http://stsdas.stsci.edu/astrolib/vo_databases/conesearch_simple.json'
fd = get_readable_fileobj(url)
import json
tree = json.load(fd)

Error message::

ERROR: AttributeError: 'GeneratorContextManager' object has no attribute 'read' [json]
AttributeError                            Traceback (most recent call last)
<ipython-input-10-187213fb3901> in <module>()
----> 1 tree = json.load(fd)

/usr/stsci/pyssgdev/Python-2.7.3/lib/python2.7/json/__init__.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    272 
    273     """
--> 274     return loads(fp.read(),
    275         encoding=encoding, cls=cls, object_hook=object_hook,
    276         parse_float=parse_float, parse_int=parse_int,

AttributeError: 'GeneratorContextManager' object has no attribute 'read'

Any advice on the fix? I am using get_pkg_data_fileobj for its caching ability, as suggested by @mdboom a while ago. In the example above, I use get_readable_fileobj, so you can run it from command line and see for yourself.

Thanks,
Pey-Lian

astrofrog · 2012-11-14T12:31:58Z

@pllim - you need to now use get_readable_fileobj as a context manager, so in your example:

from astropy.utils.data import get_readable_fileobj
url = 'http://stsdas.stsci.edu/astrolib/vo_databases/conesearch_simple.json'
with get_readable_fileobj(url) as fd:
    import json
    tree = json.load(fd)

Can you confirm whether this works?

pllim · 2012-11-14T15:19:41Z

@astrofrog , thanks for the help, it works indeed. Turns out I need get_readable_fileobj after all, not get_pkg_data_fileobj as I previously stated. There is one more question about Python 3 (I cannot test it because my work machine does not have it installed) -- will something like this still works with this context manager?

Sample code::

from astropy.utils.data import get_readable_fileobj
url = 'http://stsdas.stsci.edu/astrolib/vo_databases/conesearch_simple.json'
with get_readable_fileobj(url, cache=True) as fd:
    import io
    import json
    wrapped_fd = io.TextIOWrapper(fd, 'utf8')
    try:
       tree = json.load(wrapped_fd)
    finally:
        fd.close()

mdboom · 2012-11-14T15:23:09Z

This looks fine, but you don't need to use the TextIOWrapper. The details of how that's done varies between Python 2 and 3, so we've put it inside of get_readable_fileobj. Just pass encoding='utf8' to get_readable_fileobj and then you don't need the wrapped_fd.

Also, because you're using the with statement, you don't need to explicitly close the fd object -- that is the purpose of the with statement and context manager: to handle closing the file properly for you.

pllim · 2012-11-14T15:40:52Z

Thanks, @mdboom ! That cleaned up the affected function considerably. Great work, everyone!

embray · 2012-11-14T21:55:18Z

I assume this means I don't have to do anything with this anymore?

embray · 2012-11-14T21:56:17Z

Oh yeah, and I think this needs a changelog entry.

astrofrog · 2012-11-14T21:58:10Z

@iguananaut - indeed! the only thing you might want to do is see if you want to make use of the function in astropy.io.fits so that users can benefits from the same functionality (auto-downloading URLs for example). If you don't want to do that now, maybe you could just open an issue for future (e.g. 0.3).

astrofrog · 2012-11-14T21:58:46Z

@iguananaut - good point about the changelog, I'll do it.

This integrates `astropy.io.vo` to use the new function. It fixes the following issues with the new function: - Python 3 compatibility - Adds a "binary" flag to determine whether the resulting file object should return bytes - Doesn't close file descriptors that were passed in

Added support for compressed files in astropy.io.ascii

MNT: Replace master with main

mdboom reviewed Oct 12, 2012
View reviewed changes

astropy/io/ascii/core.py Outdated

Copy link

Contributor

mdboom Oct 12, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the with statement here to ensure that the file gets closed.

eteq reviewed Oct 16, 2012
View reviewed changes

astrofrog added a commit to astrofrog/astropy that referenced this pull request Oct 16, 2012

Merge pull request #13 from mdboom/io/ascii-compressed

53ede96

Follow on to astropy#425.

eteq and others added 7 commits November 11, 2012 16:47

fixes suggested by @astrofrog

4cd3b67

Fixed pytest import

2586d67

NamedTemporaryFile instances cannot be read from disk until closed, s…

71d1074

…o we have to use delete==False and then delete the file later on.

Use splitlines instead of split('\n') for compatibility with Windows …

3f7e382

…line returns.

Fix deletion of files

eead4d5

Fixed test_fileobj for win32

6a2f059

Added missing fix for NamedTemporaryFile for Windows

06d0320

Remove unused code

2db02ae

astrofrog added a commit that referenced this pull request Nov 13, 2012

Merge pull request #425 from astrofrog/io/ascii-compressed

4af1f9c

Added support for compressed files in astropy.io.ascii

astrofrog merged commit 4af1f9c into astropy:master Nov 13, 2012

eteq mentioned this pull request Nov 16, 2012

utils.data fixes #481

Merged

astrofrog mentioned this pull request Nov 18, 2012

Add entry to change log for #425 #488

Closed

keflavich pushed a commit to keflavich/astropy that referenced this pull request Oct 9, 2013

Merge pull request astropy#425 from astrofrog/io/ascii-compressed

181243c

Added support for compressed files in astropy.io.ascii

astrofrog deleted the io/ascii-compressed branch July 5, 2016 15:49

jeffjennings pushed a commit to jeffjennings/astropy that referenced this pull request Jul 2, 2025

Merge pull request astropy#425 from pllim/replace-with-main

089bf3d

MNT: Replace master with main

Uh oh!

Added support for compressed files in astropy.io.ascii #425

Added support for compressed files in astropy.io.ascii #425

Uh oh!

Conversation

astrofrog commented Oct 12, 2012

Uh oh!

mdboom Oct 12, 2012

Choose a reason for hiding this comment

Uh oh!

mdboom commented Oct 12, 2012

Uh oh!

astrofrog commented Oct 12, 2012

Uh oh!

astrofrog commented Oct 12, 2012

Uh oh!

astrofrog commented Oct 12, 2012

Uh oh!

mdboom commented Oct 12, 2012

Uh oh!

astrofrog commented Oct 12, 2012

Uh oh!

embray commented Oct 12, 2012

Uh oh!

astrofrog commented Oct 12, 2012

Uh oh!

mdboom commented Oct 16, 2012

Uh oh!

eteq Oct 16, 2012

Choose a reason for hiding this comment

Uh oh!

eteq Oct 16, 2012

Choose a reason for hiding this comment

Uh oh!

astrofrog Oct 16, 2012

Choose a reason for hiding this comment

Uh oh!

eteq commented Oct 16, 2012

Uh oh!

astrofrog commented Oct 17, 2012

Uh oh!

mdboom commented Oct 17, 2012

Uh oh!

eteq commented Oct 18, 2012

Uh oh!

astrofrog commented Oct 20, 2012

Uh oh!

eteq commented Oct 20, 2012

Uh oh!

astrofrog commented Oct 21, 2012

Uh oh!

astrofrog commented Oct 21, 2012

Uh oh!

astrofrog commented Oct 21, 2012

Uh oh!

astrofrog commented Oct 21, 2012

Uh oh!

eteq commented Oct 21, 2012

Uh oh!

astrofrog commented Oct 21, 2012

Uh oh!

eteq commented Oct 22, 2012

Uh oh!

astrofrog commented Nov 11, 2012

Uh oh!

astrofrog commented Nov 11, 2012

Uh oh!

mdboom commented Nov 12, 2012

Uh oh!

eteq commented Nov 12, 2012

Uh oh!

astrofrog commented Nov 12, 2012

Uh oh!

pllim commented Nov 13, 2012

Uh oh!

astrofrog commented Nov 14, 2012

Uh oh!

pllim commented Nov 14, 2012

Uh oh!

mdboom commented Nov 14, 2012