Archive
Read XML painlessly
Problem
I had an XML file (an RSS feed) from which I wanted to extract some data. I tried some XML libraries but I didn’t like any of them. Is there a simple, brain-friendly way for this? After all, it’s Python, so everything should be simple.
Solution
Yes, there is a simple library for reading XML called “untangle“, developed by Chris Stefanescu. It’s in PyPI, so installation is very easy:
sudo pip install untangle
For some examples, visit the project page.
Use Case
Let’s see a simple, real-world example. From the RSS feed of Planet Python, let’s extract the post titles and their URLs.
#!/usr/bin/env python
import untangle
#XML = 'examples/planet_python.xml' # can read a file too
XML = 'http://planet.python.org/rss20.xml'
o = untangle.parse(XML)
for item in o.rss.channel.item:
title = item.title.cdata
link = item.link.cdata
if link:
print title
print ' ', link
It couldn’t be any simpler :)
Limitations
According to Chris, untangle doesn’t support documents with namespaces (yet).
Related posts
Alternatives (update 20111031)
Here are some alternatives (thanks reddit).
- Python and XML (overview)
- lxml
- amara [official tutorial]
- xmltodict (converts XML to dict; added on 20141229)
lxml and amara are heavyweight solutions and are built upon C libraries so you may not be able to use them everywhere. untangle is a lightweight parser that can be a perfect choice to read a small and simple XML file.
