html | Python Adventures

table2csv

July 25, 2018 Jabba Laci Leave a comment

Problem
I wanted to extract a table from an HTML. I wanted to import it to Excel, thus I wanted it in CSV format for instance.

Solution
table2csv can do exactly this. Visit the project’s page on GitHub for examples.

Note that I could only make it work under Python 2.7.

Categories: python Tags: csv, extract, html, table, table2csv

remove tags from HTML

July 13, 2016 Jabba Laci Leave a comment

Problem
You have an HTML string and you want to remove all the tags from it.

Solution
Install the package “bleach” via pip. Then:

>>> import bleach
>>> html = "Her <h1>name</h1> was <i>Jane</i>."
>>> cleaned = bleach.clean(html, tags=[], attributes={}, styles=[], strip=True)
>>> html
'Her <h1>name</h1> was <i>Jane</i>.'
>>> cleaned
'Her name was Jane.'

Tip from here.

Categories: python Tags: bleach, html, tags

get the title of a web page

September 8, 2015 Jabba Laci Leave a comment

Problem
You need the title of a web page.

Solution

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
print soup.title.string

I found the solution here.

Categories: python Tags: beautifulsoup, html, scrape, title

Jinja2 example for generating a local file using a template

February 25, 2014 Jabba Laci 1 comment

Here I want to show you how to generate an HTML file (a local file) using a template with the Jinja2 template engine.

Python source (proba.py)

#!/usr/bin/env python

import os
from jinja2 import Environment, FileSystemLoader

PATH = os.path.dirname(os.path.abspath(__file__))
TEMPLATE_ENVIRONMENT = Environment(
    autoescape=False,
    loader=FileSystemLoader(os.path.join(PATH, 'templates')),
    trim_blocks=False)


def render_template(template_filename, context):
    return TEMPLATE_ENVIRONMENT.get_template(template_filename).render(context)


def create_index_html():
    fname = "output.html"
    urls = ['http://example.com/1', 'http://example.com/2', 'http://example.com/3']
    context = {
        'urls': urls
    }
    #
    with open(fname, 'w') as f:
        html = render_template('index.html', context)
        f.write(html)


def main():
    create_index_html()

########################################

if __name__ == "__main__":
    main()

Jinja2 template (templates/index.html)

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>Proba</title>
</head>
<body>
<center>
    <h1>Proba</h1>
    <p>{{ urls|length }} links</p>
</center>
<ol align="left">
{% set counter = 0 -%}
{% for url in urls -%}
<li><a href="{{ url }}">{{ url }}</a></li>
{% set counter = counter + 1 -%}
{% endfor -%}
</ol>
</body>
</html>

Resulting output
If you execute proba.py, you will get this output:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>Proba</title>
</head>
<body>
<center>
    <h1>Proba</h1>
    <p>3 links</p>
</center>
<ol align="left">
<li><a href="http://example.com/1">http://example.com/1</a></li>
<li><a href="http://example.com/2">http://example.com/2</a></li>
<li><a href="http://example.com/3">http://example.com/3</a></li>
</ol>
</body>
</html>

You can find all these files here (GitHub link).

Categories: python Tags: html, jinja2, template

Prettify HTML with BeautifulSoup

April 3, 2011 Jabba Laci Leave a comment

With the Python library BeautifulSoup (BS), you can extract information from HTML pages very easily. However, there is one thing you should keep in mind: HTML pages are usually malformed. BS tries to correct an HTML page, but it means that BS’s internal representation of the HTML page can be slightly different from the original source. Thus, when you want to localize a part of an HTML page, you should work with the internal representation.

The following script takes an HTML and prints it in a corrected form, i.e. it shows how BS stores the given page. You can also use it to prettify the source:

#!/usr/bin/env python

# prettify.py
# Usage: prettify <URL>

import sys
import urllib
from BeautifulSoup import BeautifulSoup

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'

def process(url):
    myopener = MyOpener()
    #page = urllib.urlopen(url)
    page = myopener.open(url)

    text = page.read()
    page.close()

    soup = BeautifulSoup(text)
    return soup.prettify()
# process(url)

def main():
    if len(sys.argv) == 1:
        print "Jabba's HTML Prettifier v0.1"
        print "Usage: %s <URL>" % sys.argv[0]
        sys.exit(-1)
    # else, if at least one parameter was passed
    print process(sys.argv[1])
# main()

if __name__ == "__main__":
    main()

You can find the latest version of the script at https://github.com/jabbalaci/Bash-Utils.

Categories: python Tags: beautifulsoup, beautify, html, prettify

Create a temporary file with unique name

February 19, 2011 Jabba Laci Leave a comment

Problem

I wanted to download an html file with Python, store it in a temporary file, then convert this file to PDF by calling an external program.

Solution #1

#!/usr/bin/env python

import os
import tempfile

temp = tempfile.NamedTemporaryFile(prefix='report_', suffix='.html', dir='/tmp', delete=False)

html_file = temp.name
(dirName, fileName) = os.path.split(html_file)
fileBaseName = os.path.splitext(fileName)[0]
pdf_file = dirName + '/' + fileBaseName + '.pdf'

print html_file   # /tmp/report_kWKEp5.html
print pdf_file    # /tmp/report_kWKEp5.pdf
# calling of HTML to PDF converter is omitted

See the documentation of tempfile.NamedTemporaryFile here.

Solution #2 (update 20110303)

I had a problem with the previous solution. It works well in command-line, but when I tried to call that script in crontab, it stopped at the line “tempfile.NamedTemporaryFile”. No exception, nothing… So I had to use a different approach:

from time import time

temp = "report.%.7f.html" % time()
print temp    # report.1299188541.3830960.html

The function time() returns the time as a floating point number. It may not be suitable in a multithreaded environment, but it was not the case for me. This version works fine when called from crontab.

Learn more

tempfile – Create temporary filesystem resources (post by Doug Hellmann with lots of examples)
Python doc on tempfile

Update (20150712): if you need a temp. file name in the current directory:

>>> import tempfile
>>> tempfile.NamedTemporaryFile(dir='.').name
'/home/jabba/tmpKrBzoY'

Update (20150910): if you need a temp. directory:

import tempfile
import shutil

dirpath = tempfile.mkdtemp()    # the temp dir. is created
# ... do stuff with dirpath
shutil.rmtree(dirpath)

This tip is from here.

Categories: python Tags: cron, crontab, html, pdf, temp, temporary, time

Python Adventures

Archive

table2csv

remove tags from HTML

get the title of a web page

Jinja2 example for generating a local file using a template

Prettify HTML with BeautifulSoup

Create a temporary file with unique name

Blog Stats

Random Post

Recent Posts

Archives

Meta