Archive
table2csv
Problem
I wanted to extract a table from an HTML. I wanted to import it to Excel, thus I wanted it in CSV format for instance.
Solution
table2csv can do exactly this. Visit the project’s page on GitHub for examples.
Note that I could only make it work under Python 2.7.
remove tags from HTML
Problem
You have an HTML string and you want to remove all the tags from it.
Solution
Install the package “bleach” via pip. Then:
>>> import bleach
>>> html = "Her <h1>name</h1> was <i>Jane</i>."
>>> cleaned = bleach.clean(html, tags=[], attributes={}, styles=[], strip=True)
>>> html
'Her <h1>name</h1> was <i>Jane</i>.'
>>> cleaned
'Her name was Jane.'
Tip from here.
get the title of a web page
Problem
You need the title of a web page.
Solution
from bs4 import BeautifulSoup soup = BeautifulSoup(html) print soup.title.string
I found the solution here.
Jinja2 example for generating a local file using a template
Here I want to show you how to generate an HTML file (a local file) using a template with the Jinja2 template engine.
Python source (proba.py)
#!/usr/bin/env python
import os
from jinja2 import Environment, FileSystemLoader
PATH = os.path.dirname(os.path.abspath(__file__))
TEMPLATE_ENVIRONMENT = Environment(
autoescape=False,
loader=FileSystemLoader(os.path.join(PATH, 'templates')),
trim_blocks=False)
def render_template(template_filename, context):
return TEMPLATE_ENVIRONMENT.get_template(template_filename).render(context)
def create_index_html():
fname = "output.html"
urls = ['http://example.com/1', 'http://example.com/2', 'http://example.com/3']
context = {
'urls': urls
}
#
with open(fname, 'w') as f:
html = render_template('index.html', context)
f.write(html)
def main():
create_index_html()
########################################
if __name__ == "__main__":
main()
Jinja2 template (templates/index.html)
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Proba</title>
</head>
<body>
<center>
<h1>Proba</h1>
<p>{{ urls|length }} links</p>
</center>
<ol align="left">
{% set counter = 0 -%}
{% for url in urls -%}
<li><a href="{{ url }}">{{ url }}</a></li>
{% set counter = counter + 1 -%}
{% endfor -%}
</ol>
</body>
</html>
Resulting output
If you execute proba.py, you will get this output:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Proba</title>
</head>
<body>
<center>
<h1>Proba</h1>
<p>3 links</p>
</center>
<ol align="left">
<li><a href="http://example.com/1">http://example.com/1</a></li>
<li><a href="http://example.com/2">http://example.com/2</a></li>
<li><a href="http://example.com/3">http://example.com/3</a></li>
</ol>
</body>
</html>
You can find all these files here (GitHub link).
Prettify HTML with BeautifulSoup
With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.
The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:
#!/usr/bin/env python
# prettify.py
# Usage: prettify <URL>
import sys
import urllib
from BeautifulSoup import BeautifulSoup
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
def process(url):
myopener = MyOpener()
#page = urllib.urlopen(url)
page = myopener.open(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)
return soup.prettify()
# process(url)
def main():
if len(sys.argv) == 1:
print "Jabba's HTML Prettifier v0.1"
print "Usage: %s <URL>" % sys.argv[0]
sys.exit(-1)
# else, if at least one parameter was passed
print process(sys.argv[1])
# main()
if __name__ == "__main__":
main()
You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.
Create a temporary file with unique name
Problem
I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.
Solution #1
#!/usr/bin/env python import os import tempfile temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False) html_file = temp.name (dirName, fileName) = os.path.split(html_file) fileBaseName = os.path.splitext(fileName)[0] pdf_file = dirName + '/' + fileBaseName + '.pdf' print html_file # /tmp/report_kWKEp5.html print pdf_file # /tmp/report_kWKEp5.pdf # calling of HTML to PDF converter is omitted
See the documentation of tempfile.NamedTemporaryFile here.
Solution #2 (update 20110303)
I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:
from time import time temp = "report.%.7f.html" % time() print temp # report.1299188541.3830960.html
The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.
Learn more
- tempfile – Create temporary filesystem resources (post by Doug Hellmann with lots of examples)
- Python doc on tempfile
Update (20150712): if you need a temp. file name in the current directory:
>>> import tempfile >>> tempfile.NamedTemporaryFile(dir='.').name '/home/jabba/tmpKrBzoY'
Update (20150910): if you need a temp. directory:
import tempfile import shutil dirpath = tempfile.mkdtemp() # the temp dir. is created # ... do stuff with dirpath shutil.rmtree(dirpath)
This tip is from here.
