Cleaning up HTML
The module lxml_html_clean provides a Cleaner class for cleaning up
HTML pages. It supports removing embedded or script content, special tags,
CSS style annotations and much more.
Note: the HTML Cleaner in lxml_html_clean is not considered
appropriate for security sensitive environments.
See e.g. bleach for an alternative.
Say, you have an overburdened web page from a hideous source which contains lots of content that upsets browsers and tries to run unnecessary code on the client side:
>>> html = '''\
... <html>
... <head>
... <script type="text/javascript" src="evil-site"></script>
... <link rel="alternate" type="text/rss" src="evil-rss">
... <style>
... body {background-image: url(javascript:do_evil)};
... div {color: expression(evil)};
... </style>
... </head>
... <body onload="evil_function()">
... <!-- I am interpreted for EVIL! -->
... <a href="javascript:evil_function()">a link</a>
... <a href="#" onclick="evil_function()">another link</a>
... <p onclick="evil_function()">a paragraph</p>
... <div style="display: none">secret EVIL!</div>
... <object> of EVIL! </object>
... <iframe src="evil-site"></iframe>
... <form action="evil-site">
... Password: <input type="password" name="password">
... </form>
... <blink>annoying EVIL!</blink>
... <a href="evil-site">spam spam SPAM!</a>
... <image src="evil!">
... </body>
... </html>'''
To remove the all superfluous content from this unparsed document, use the
clean_html function:
>>> from lxml_html_clean import clean_html
>>> print clean_html(html)
<div><style>/* deleted */</style><body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!<a href="evil-site">spam spam SPAM!</a>
<img src="evil!"></body></div>
The Cleaner class supports several keyword arguments to control exactly
which content is removed:
>>> from lxml_html_clean import Cleaner
>>> cleaner = Cleaner(page_structure=False, links=False)
>>> print cleaner.clean_html(html)
<html>
<head>
<link rel="alternate" src="evil-rss" type="text/rss">
<style>/* deleted */</style>
</head>
<body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!
<a href="evil-site">spam spam SPAM!</a>
<img src="evil!">
</body>
</html>
>>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
... page_structure=False, safe_attrs_only=False)
>>> print cleaner.clean_html(html)
<html>
<head>
</head>
<body>
<a href="">a link</a>
<a href="#">another link</a>
<p>a paragraph</p>
<div>secret EVIL!</div>
of EVIL!
Password:
annoying EVIL!
<a href="evil-site" rel="nofollow">spam spam SPAM!</a>
<img src="evil!">
</body>
</html>
To control the removal of CSS styles, set the style and/or inline_style
keyword arguments to True when creating a Cleaner instance.
If neither option is enabled, only @import rules are automatically removed
from CSS content.
You can also whitelist some otherwise dangerous content with
Cleaner(host_whitelist=['www.youtube.com']), which would allow
embedded media from YouTube, while still filtering out embedded media
from other sites.
See the docstring of Cleaner for the details of what can be
cleaned.
autolink
In addition to cleaning up malicious HTML, lxml_html_clean
contains functions to do other things to your HTML. This includes
autolinking:
autolink(doc, ...)
autolink_html(html, ...)
This finds anything that looks like a link (e.g.,
http://example.com) in the text of an HTML document, and
turns it into an anchor. It avoids making bad links.
Links in the elements <textarea>, <pre>, <code>,
anything in the head of the document. You can pass in a list of
elements to avoid in avoid_elements=['textarea', ...].
Links to some hosts can be avoided. By default links to
localhost*, example.* and 127.0.0.1 are not
autolinked. Pass in avoid_hosts=[list_of_regexes] to control
this.
Elements with the nolink CSS class are not autolinked. Pass
in avoid_classes=['code', ...] to control this.
The autolink_html() version of the function parses the HTML
string first, and returns a string.
wordwrap
You can also wrap long words in your html:
word_break(doc, max_width=40, ...)
word_break_html(html, ...)
This finds any long words in the text of the document and inserts
​ in the document (which is the Unicode zero-width space).
This avoids the elements <pre>, <textarea>, and <code>.
You can control this with avoid_elements=['textarea', ...].
It also avoids elements with the CSS class nobreak. You can
control this with avoid_classes=['code', ...].
Lastly you can control the character that is inserted with
break_character=u'\u200b'. However, you cannot insert markup,
only text.
word_break_html(html) parses the HTML document and returns a
string.