Skip to content

download_file does not do anything about encodings #735

@eteq

Description

@eteq

This is prompted by #491, where it was discovered that our --remote-data tests were not working because Travis gets http://www.google.com somewhere that uses iso-8859-1 encoding (and the ü character). astropy.utils.data.download_file had no way of knowing this, so it assumed ascii and failed to decode the page. #734 will solve this particular problem by switching us to a web page that will always have consistent encoding.

However, the underlying problem (?) remains: download_file doesn't look into the html headers. The "content-type" header carries the character set information on google's page, so when the download is finished and the url request discarded, that information is lost. To make things even more annoying, there are actually a bunch of different ways a file can specify it's encoding - 3 different syntaxes that are actually in the HTML (for regular HTML, HTML5, and XHTML) in addition to the "content-type"...which some servers use and some don't.

I can imagine a few solutions: we could add keywords to download_file that one could use to decode the downloaded file (unless 'b' is in the mode). Or we could try to automatically decide when this needs to happen (based on reading "content-type" add perhaps the content itself). Or we can just leave it alone and say its up to the user to know what the encoding is. (Which I imagine will almost always be true in typical astronomy data cases.)

P.S. there is now a file at http://www.astropy.org/_static/test_encoding.html that's encoded in iso-8859-1 with ü in it. That can be used to create tests if we actually do anything about this.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions