Cleaning up HTML

The module lxml_html_clean provides a Cleaner class for cleaning up HTML pages. It supports removing embedded or script content, special tags, CSS style annotations and much more.

Note: the HTML Cleaner in lxml_html_clean is not considered appropriate for security sensitive environments. See e.g. bleach for an alternative.

Say, you have an overburdened web page from a hideous source which contains lots of content that upsets browsers and tries to run unnecessary code on the client side:

>>> html = '''\
... <html>
...  <head>
...    <script type="text/javascript" src="evil-site"></script>
...    <link rel="alternate" type="text/rss" src="evil-rss">
...    <style>
...      body {background-image: url(javascript:do_evil)};
...      div {color: expression(evil)};
...    </style>
...  </head>
...  <body onload="evil_function()">
...    <!-- I am interpreted for EVIL! -->
...    <a href="javascript:evil_function()">a link</a>
...    <a href="#" onclick="evil_function()">another link</a>
...    <p onclick="evil_function()">a paragraph</p>
...    <div style="display: none">secret EVIL!</div>
...    <object> of EVIL! </object>
...    <iframe src="evil-site"></iframe>
...    <form action="evil-site">
...      Password: <input type="password" name="password">
...    </form>
...    <blink>annoying EVIL!</blink>
...    <a href="evil-site">spam spam SPAM!</a>
...    <image src="evil!">
...  </body>
... </html>'''

To remove the all superfluous content from this unparsed document, use the clean_html function:

>>> from lxml_html_clean import clean_html
>>> print clean_html(html)
<div><style>/* deleted */</style><body>

   <a href="">a link</a>
   <a href="#">another link</a>
   <p>a paragraph</p>
   <div>secret EVIL!</div>
    of EVIL!


     Password:
   annoying EVIL!<a href="evil-site">spam spam SPAM!</a>
   <img src="evil!"></body></div>

The Cleaner class supports several keyword arguments to control exactly which content is removed:

>>> from lxml_html_clean import Cleaner

>>> cleaner = Cleaner(page_structure=False, links=False)
>>> print cleaner.clean_html(html)
<html>
  <head>
    <link rel="alternate" src="evil-rss" type="text/rss">
    <style>/* deleted */</style>
  </head>
  <body>
    <a href="">a link</a>
    <a href="#">another link</a>
    <p>a paragraph</p>
    <div>secret EVIL!</div>
    of EVIL!
    Password:
    annoying EVIL!
    <a href="evil-site">spam spam SPAM!</a>
    <img src="evil!">
  </body>
</html>

>>> cleaner = Cleaner(style=True, links=True, add_nofollow=True,
...                   page_structure=False, safe_attrs_only=False)

>>> print cleaner.clean_html(html)
<html>
  <head>
  </head>
  <body>
    <a href="">a link</a>
    <a href="#">another link</a>
    <p>a paragraph</p>
    <div>secret EVIL!</div>
    of EVIL!
    Password:
    annoying EVIL!
    <a href="evil-site" rel="nofollow">spam spam SPAM!</a>
    <img src="evil!">
  </body>
</html>

To control the removal of CSS styles, set the style and/or inline_style keyword arguments to True when creating a Cleaner instance. If neither option is enabled, only @import rules are automatically removed from CSS content.

You can also whitelist some otherwise dangerous content with Cleaner(host_whitelist=['www.youtube.com']), which would allow embedded media from YouTube, while still filtering out embedded media from other sites.

See the docstring of Cleaner for the details of what can be cleaned.

autolink

In addition to cleaning up malicious HTML, lxml_html_clean contains functions to do other things to your HTML. This includes autolinking:

autolink(doc, ...)

autolink_html(html, ...)

This finds anything that looks like a link (e.g., http://example.com) in the text of an HTML document, and turns it into an anchor. It avoids making bad links.

Links in the elements <textarea>, <pre>, <code>, anything in the head of the document. You can pass in a list of elements to avoid in avoid_elements=['textarea', ...].

Links to some hosts can be avoided. By default links to localhost*, example.* and 127.0.0.1 are not autolinked. Pass in avoid_hosts=[list_of_regexes] to control this.

Elements with the nolink CSS class are not autolinked. Pass in avoid_classes=['code', ...] to control this.

The autolink_html() version of the function parses the HTML string first, and returns a string.

wordwrap

You can also wrap long words in your html:

word_break(doc, max_width=40, ...)

word_break_html(html, ...)

This finds any long words in the text of the document and inserts  in the document (which is the Unicode zero-width space).

This avoids the elements <pre>, <textarea>, and <code>. You can control this with avoid_elements=['textarea', ...].

It also avoids elements with the CSS class nobreak. You can control this with avoid_classes=['code', ...].

Lastly you can control the character that is inserted with break_character=u'\u200b'. However, you cannot insert markup, only text.

word_break_html(html) parses the HTML document and returns a string.