Archive
Get URL info (file size, Content-Type, etc.)
Problem
You have a URL and you want to get some info about it. For instance, you want to figure out the content type (text/html, image/jpeg, etc.) of the URL, or the file size without actually downloading the given page.
Solution
Let’s see an example with an image. Consider the URL http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg .
#!/usr/bin/env python
import urllib
def get_url_info(url):
d = urllib.urlopen(url)
return d.info()
url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_url_info(url)
Output:
Date: Mon, 18 Oct 2010 18:58:07 GMT
Server: Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6
X-Powered-By: Zope (www.zope.org), Python (www.python.org)
Last-Modified: Thu, 08 Nov 2007 09:56:19 GMT
Content-Length: 103984
Accept-Ranges: bytes
Connection: close
Content-Type: image/jpeg
That is, the size of the image is 103,984 bytes and its content type is indeed image/jpeg.
In the code d.info() is a dictionary, so the extraction of a specific field is very easy:
#!/usr/bin/env python
import urllib
def get_content_type(url):
d = urllib.urlopen(url)
return d.info()['Content-Type']
url = 'http://'+'www'+'.geos.ed.ac.uk'+'/homes/s0094539/remarkable_forest.preview.jpg'
print get_content_type(url) # image/jpeg
This post is based on this thread.
Update (20121202)
With requests:
>>> import requests
>>> from pprint import pprint
>>> url = 'http://www.geos.ed.ac.uk/homes/s0094539/remarkable_forest.preview.jpg'
>>> r = requests.head(url)
>>> pprint(r.headers)
{'accept-ranges': 'none',
'connection': 'close',
'content-length': '103984',
'content-type': 'image/jpeg',
'date': 'Sun, 02 Dec 2012 21:05:57 GMT',
'etag': 'ts94515779.19',
'last-modified': 'Thu, 08 Nov 2007 09:56:19 GMT',
'server': 'Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_fastcgi/2.4.6',
'x-powered-by': 'Zope (www.zope.org), Python (www.python.org)'}
check if URL exists
Problem
You want to check if a URL exists without actually downloading the given file.
Solution
Update (20120124): There was something wrong with my previous solution, it didn’t work correctly. Here is my revised version.
import httplib
import urlparse
def get_server_status_code(url):
"""
Download just the header of a URL and
return the server's status code.
"""
# http://stackoverflow.com/questions/1140661
host, path = urlparse.urlparse(url)[1:3] # elems [1] and [2]
try:
conn = httplib.HTTPConnection(host)
conn.request('HEAD', path)
return conn.getresponse().status
except StandardError:
return None
def check_url(url):
"""
Check if a URL exists without downloading the whole file.
We only check the URL header.
"""
# see also http://stackoverflow.com/questions/2924422
good_codes = [httplib.OK, httplib.FOUND, httplib.MOVED_PERMANENTLY]
return get_server_status_code(url) in good_codes
Tests:
assert check_url('http://www.google.com') # exists
assert not check_url('http://simile.mit.edu/crowbar/nothing_here.html') # doesn't exist
We only get the header of a given URL and we check the response code of the web server.
Update (20121202)
With requests:
>>> import requests >>> >>> url = 'http://hup.hu' >>> r = requests.head(url) >>> r.status_code 200 # requests.codes.OK >>> url = 'http://www.google.com' >>> r = requests.head(url) >>> r.status_code 302 # requests.codes.FOUND >>> url = 'http://simile.mit.edu/crowbar/nothing_here.html' >>> r = requests.head(url) >>> r.status_code 404 # requests.codes.NOT_FOUND
