Skip to content

Conversation

@eteq
Copy link
Member

@eteq eteq commented Nov 19, 2012

As discussed in #481, turning on the --remote-data option in the tests causes the tests to all fail on travis. This is probably some permissions issue that may or may not be fixable. The error messages can be viewed at https://travis-ci.org/astropy/astropy/builds/3229836

@eteq
Copy link
Member Author

eteq commented Nov 19, 2012

@mdboom
Copy link
Contributor

mdboom commented Nov 19, 2012

I think you explicitly need to specify the encoding as utf-8 on the line that tries to decode the Google home page.

@mdboom
Copy link
Contributor

mdboom commented Nov 19, 2012

Or, alternatively, don't decode and search for bytes rather than string, i.e. b'oogle</title>'

@eteq
Copy link
Member Author

eteq commented Nov 19, 2012

oh, tricky... but why would that be different on travis' machines?

@eteq
Copy link
Member Author

eteq commented Nov 19, 2012

I attached code just so travis will actually run as changes happen on this branch - this is not ready to be merged, though.

@mdboom
Copy link
Contributor

mdboom commented Nov 19, 2012

The default encoding is platform and user-specific. Not sure the details of what Travis is using, but one should never depend on it being the same across different systems.

@astrofrog
Copy link
Member

I think Travis runs on contributed/distributed machines, which would explain this kind of issue (maybe?)

@eteq
Copy link
Member Author

eteq commented Nov 22, 2012

@mdboom - as you can see here I tried a variety of different approaches, and all were unsucessful (see the travis builds for the commits above). Or did I mis-understand what you were suggesting?

@astrofrog
Copy link
Member

@eteq - just out of curiosity, do things work if you include @mdboom's recent PR (#539) which fixes some encoding-related bugs?

@mdboom
Copy link
Contributor

mdboom commented Dec 11, 2012

@eteq: Sorry I missed your question from a few weeks ago... Let's confirm first that #539 doesn't solve this, and if not, I'll have another look. I think @astrofrog is right that it's probably somehow related.

@eteq
Copy link
Member Author

eteq commented Dec 12, 2012

Still failing... @mdboom did anything from #539 give any insights here? I had to wipe the other commits to do the rebase, but I basically tried all combinations of 'utf-8' and 'ascii' with encode and decode...

@astrofrog
Copy link
Member

I managed to reproduce the issue locally! Will see if I can come up with a fix.

@mdboom
Copy link
Contributor

mdboom commented Dec 12, 2012

I was also able to reproduce locally at the following fixes it for me:

--- a/astropy/utils/tests/test_data.py
+++ b/astropy/utils/tests/test_data.py
@@ -199,7 +199,7 @@ def test_data_noastropy_fallback(monkeypatch, recwarn):
     #now try with no cache
     fnnocache = data.download_file(TESTURL, cache=False)
     with open(fnnocache, 'rb') as googlepage:
-        assert googlepage.read().decode().find('oogle</title>') > -1
+        assert googlepage.read().decode('utf8').find('oogle</title>') > -1

     #no warnings should be raise in fileobj because cache is unnecessary
     assert len(recwarn.list) == 0

If that isn't working for others, maybe Google is serving the page in a different encoding in different contexts? If that's the case, we probably need to be using a different reference URL (perhaps one we control on astropy.org?)

@astrofrog
Copy link
Member

Yeah, I'm in Germany so maybe that's why I get the error. I agree we should just use http://www.astropy.org instead.

@astrofrog
Copy link
Member

By the way, I still have issues even after @mdboom's suggested fix. The error is then:

    def decode(input, errors='strict'):
>       return codecs.utf_8_decode(input, errors, True)
E       UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 7133: invalid start byte

I did a print repr(googlepage.read()) and got: http://pastebin.com/cbqmRiVm

It turns out "I'm feeling lucky" in German doesn't decode to UTF8 ;-)

In [8]: "Auf gut Gl\xfcck!".decode('utf8')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/Volumes/Raptor/<ipython-input-8-ea8915abb885> in <module>()
----> 1 "Auf gut Gl\xfcck!".decode('utf8')

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_8_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 10: invalid start byte

Anyway, this is probably a good reason to just switch to using the Astropy website.

@astrofrog
Copy link
Member

Ah now this is interesting - it looks like the issue is that we must already be converting to UTF8 beforehand:

In [13]: "Glück".decode('utf8')
Out[13]: u'Gl\xfcck'

which is what's in my string before we try and decode it. Is this a bug?

@astrofrog
Copy link
Member

Quick update - it seems that in this case, urllib is returning output that's already in UTF8, hence the issue I'm seeing. We might want to think of a fix for this, because if we ever put a non-ascii character on e.g. the Astropy homepage, things won't work anymore. Just out of curiosity, why are we calling decode at all?

@mdboom
Copy link
Contributor

mdboom commented Dec 12, 2012

I see. Indeed, it looks like Google is serving iso-8859-1, not utf-8 at least for me... and with only English on the page (for me) the two are equivalent, but when serving German (based on your IP, I assume) the two are not equivalent.

We may not need the decode if we search for bytes rather than a unicode string, i.e.:

b'oogle</title>'

But ideally, we'd do this against a file we include on our website, as Google could change the encoding of their page at any time. (We could also be more robust to that by reading the "Content-Type" header, but as it stands now, download_file is completely encoding agnostic, and it should probably be kept that way).

Also, we should add a test that downloads a binary file (e.g. a PNG file or something) just to make sure it works. At present it does, but there's enough potentially for accidentally introducing a codec in download_file that we should watch out for problems.

@eteq
Copy link
Member Author

eteq commented Dec 15, 2012

I see your points @astrofrog and @mdboom - the main reason I opted for google is that my general theory is that google is probable one of the most "reachable" sites with the best up-time of anywhere. I was trying to avoid the possiblity of the tests failing because google was down instead of a problem with the good. That said, I see your points about the locality issues and I can't think of any other consistent way to deal with it.

(@mdboom - I tried the b'oogle</title>' trick earlier but it didn't seem to work... and anyway, isn't that the same as 'oogle</title>' in py 2.x ? I admit string encoding/decoding is something that has often confused me, though...)

So we could add a (small) page along the lines of http://www.astropy.org/test.html? And perhaps a very small binary file (like a zipped short text file or something) as http://ww.astropy.org/test.tgz? Once those are up I could update the test appropriately.

@mdboom
Copy link
Contributor

mdboom commented Dec 17, 2012

Yes b'oogle</title>' is the same as 'oogle' in Python 2.x, not when run through 2to3, the former gets converted to a byte string (i.e. stays the same) whereas the latter gets converted to a unicode string.

Yes -- I'm all for adding those files at the top level of the website.

@eteq
Copy link
Member Author

eteq commented Feb 6, 2013

I think I got to the bottom of this and have some solutions... but I think it's best approached separately (switching to a different test URL and the underlying problem here with encoding), so I'm going to close this with the intent that it get replaced by #734 and #735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants