Skip to content

Instantly share code, notes, and snippets.

@MercuryRising
Created November 12, 2012 19:29
Show Gist options
  • Select an option

  • Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.

Select an option

Save MercuryRising/4061368 to your computer and use it in GitHub Desktop.
Pyquery, lxml, BeautifulSoup comparison
from bs4 import BeautifulSoup as bs
from pyquery import PyQuery as pq
from lxml.html import fromstring
import re
import requests
import time
def Timer():
a = time.time()
while True:
c = time.time()
yield time.time()-a
a = c
timer = Timer()
url = "http://www.python.org/"
html = requests.get(url).text
num = 100000
print '\n==== Total trials: %s =====' %num
next(timer)
soup = bs(html, 'lxml')
for x in range(num):
paragraphs = soup.findAll('p')
t = next(timer)
print 'bs4 total time: %.1f' %t
d = pq(html)
for x in range(num):
paragraphs = d('p')
t = next(timer)
print 'pq total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.cssselect('p')
t = next(timer)
print 'lxml (cssselect) total time: %.1f' %t
tree = fromstring(html)
for x in range(num):
paragraphs = tree.xpath('.//p')
t = next(timer)
print 'lxml (xpath) total time: %.1f' %t
for x in range(num):
paragraphs = re.findall('<[p ]>.*?</p>', html)
t = next(timer)
print 'regex total time: %.1f (doesn\'t find all p)\n' %t
@valgur
Copy link
Copy Markdown

valgur commented Dec 30, 2015

Thanks for the Gist! For anyone else curious about the results (used Python 3.5):

==== Total trials: 100000 =====
bs4 total time: 74.1
pq total time: 13.9
lxml (cssselect) total time: 13.6
lxml (xpath) total time: 8.6
regex total time: 17.2 (doesn't find all p)

@alaakh42
Copy link
Copy Markdown

alaakh42 commented Feb 20, 2018

Results using Python 2.7


==== Total trials: 100000 =====
bs4 total time: 38.0
pq total time: 5.2
lxml (cssselect) total time: 5.1
lxml (xpath) total time: 3.0
regex total time: 8.4 (doesn't find all p)

@guptarohit
Copy link
Copy Markdown

Results using Python 3.6

==== Total trials: 100000 =====
bs4 total time: 52.6
pq total time: 7.5
lxml (cssselect) total time: 6.8
lxml (xpath) total time: 4.5
regex total time: 11.2 (doesn't find all p)

@p3nj
Copy link
Copy Markdown

p3nj commented Aug 20, 2018

Results using Python 3.7

==== Total trials: 100000 =====
bs4 total time: 63.2
pq total time: 8.4
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 5.6
regex total time: 9.6 (doesn't find all p)

@kwuite
Copy link
Copy Markdown

kwuite commented Sep 14, 2018

Results using python 3.6.5

==== Total trials: 100000 =====
bs4 total time: 325.9 (Not sure why this happened, but it's a record in slowness)
pq total time: 8.9
lxml (cssselect) total time: 7.9
lxml (xpath) total time: 3.5
regex total time: 8.5 (doesn't find all p)

@ghid4ds
Copy link
Copy Markdown

ghid4ds commented Dec 17, 2018

==== Total trials: 100000 =====
bs4 total time: 93.2
pq total time: 7.4
lxml (cssselect) total time: 7.7
lxml (xpath) total time: 5.5
regex total time: 16.4 (doesn't find all p)

@Fischmax
Copy link
Copy Markdown

Fischmax commented Jan 7, 2019

In Python 3.7.1:
==== Total trials: 100000 =====
bs4 total time: 69.6
pq total time: 10.1
lxml (cssselect) total time: 9.6
lxml (xpath) total time: 6.3
regex total time: 13.6 (doesn't find all p)

@guptarohit
Copy link
Copy Markdown

Results using python 3.7.3

==== Total trials: 100000 =====
bs4 total time: 94.1
pq total time: 9.5
lxml (cssselect) total time: 8.6
lxml (xpath) total time: 5.9
regex total time: 12.9 (doesn't find all p)

@andriyor
Copy link
Copy Markdown

I tried selectolax and in this case selectolax is 2 times faster than lxml
https://rushter.com/blog/python-fast-html-parser/

from selectolax.parser import HTMLParser

tree = HTMLParser(html)
for x in range(num):
    paragraphs = tree.css('p')
t = next(timer)
print('selectolax total time: %.1f' % t)
==== Total trials: 100000 =====
bs4 total time: 95.4
pq total time: 10.9
lxml (cssselect) total time: 10.0
lxml (xpath) total time: 6.4
regex total time: 14.4 (doesn't find all p)
selectolax total time: 3.4

@deedy5
Copy link
Copy Markdown

deedy5 commented Apr 24, 2021

python 3.9.2

==== Total trials: 100000 =====
bs4 total time: 31.9
pq total time: 4.9
lxml (cssselect) total time: 4.4
lxml (xpath) total time: 3.1
regex total time: 8.5 (doesn't find all p)

@hokwanhung
Copy link
Copy Markdown

Python 3.10.4

==== Total trials: 100000 =====
bs4 total time: 30.1
pq total time: 2.8
lxml (cssselect) total time: 2.6
lxml (xpath) total time: 2.0
regex total time: 6.3 (doesn't find all p)

@xavierskip
Copy link
Copy Markdown

Python 3.10.1

==== Total trials: 100000 =====
bs4 total time: 45.9
pq total time: 4.6
lxml (cssselect) total time: 4.3
lxml (xpath) total time: 3.3
regex total time: 8.4 (doesn't find all p)

@p3nj
Copy link
Copy Markdown

p3nj commented Apr 30, 2023

Python 3.11.2

==== Total trials: 100000 =====
bs4 total time: 18.1
pq total time: 2.2
lxml (cssselect) total time: 2.2
lxml (xpath) total time: 1.7
regex total time: 5.2 (doesn't find all p)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment