0% found this document useful (0 votes)
69 views6 pages

Python Web Scraping Cheat Sheet

This document is a cheat sheet for web scraping with Python, providing examples of how to use BeautifulSoup and Requests to scrape web pages. It demonstrates how to install dependencies, fetch webpages and parse HTML, find elements by id, class, CSS selectors, and regex, extract attributes and text, and navigate element trees to find parent, child, and sibling elements. The goal is to provide a concise how-to guide for common web scraping tasks in Python.

Uploaded by

Euler Pi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views6 pages

Python Web Scraping Cheat Sheet

This document is a cheat sheet for web scraping with Python, providing examples of how to use BeautifulSoup and Requests to scrape web pages. It demonstrates how to install dependencies, fetch webpages and parse HTML, find elements by id, class, CSS selectors, and regex, extract attributes and text, and navigate element trees to find parent, child, and sibling elements. The goal is to provide a concise how-to guide for common web scraping tasks in Python.

Uploaded by

Euler Pi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

>DevByExample_

A Python Web Scraping How-To


Guide
Web Scraping with Python/BeautifulSoup/Requests

Install
$ pip install requests beautifulsoup4

BeautifulSoup on Text
from bs4 import BeautifulSoup

text = '''<div><h1>My Header</h1></div>'''

soup = BeautifulSoup(text, 'html.parser')

print(soup.prettify())

<div>
<h1>
My Header

</h1>

</div>

Fetch Webpage and Create Soup


import requests
from bs4 import BeautifulSoup

url = 'https://devbyexample.com/test-scraping'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

https://www.devbyexample.com/web-scraping-cheat-sheet 1/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Find By ID
<h1 id="article-title">Hello Everyone</h1>

header = soup.find(id="article-id")

print(header)

<h1 id="article-title">Hello Everyone</h1>

print(header.string)

Hello Everyone

Find By Class
<div id="articles">

<div class='article'>...</div>

<div class='article'>...</div>

<div class='article'>...</div>

<div class='article'>...</div>

<div class='end'><button>Next Page</button></div>

</div>

articles = soup.select('.article')

print(articles)

[ <div class="article">...</div>,

<div class="article">...</div>,

<div class="article">...</div>,

<div class="article">...</div>]

https://www.devbyexample.com/web-scraping-cheat-sheet 2/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Navigating Elements in Tree


<ul>

<li><a href="https://google.com">Google</a></li>

<li><a href="https://bing.com">Bing</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

# Get First Link

print(soup.a)

<a href="https://google.com">Google</a>

# Get all Link elements on page

print(soup.find_all("a"))

[ <a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

# Print all hrefs on page

for link in soup.find_all("a"):

print(link['href'])

https://google.com

https://bing.com

https://apple.com

https://www.devbyexample.com/web-scraping-cheat-sheet 3/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Element Attributes
<div id="article-10" class="article">

<h3>Header</h3>

<p>First Paragraph</p>

<p>Second Paragraph</p>

</div>

print(soup.div.name)

div

print(soup.div.contents)

[ '\n',
<h3>Header</h3>,

'\n',
<p>First Paragraph</p>,

'\n',
<p>Second Paragraph</p>,

'\n']

for strings in div.strings:

print(repr(strings))

'\n'

'Header'

'\n'

'First Paragraph'

'\n'

'Second Paragraph'

'\n'

for strings in soup.div.stripped_strings:

print(repr(strings))

'Header'

'First Paragraph'

'Second Paragraph'

https://www.devbyexample.com/web-scraping-cheat-sheet 4/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Find By Regex
<div>

<head><title>Sample Title</title></head>

<h1>Title Header</h1>

<hr>

<div>A description of something</div>

<h2>Section Header</h2>

<p>...</p>

<h2>Another Header</h2>

<p>...</p>

</div>

import re

headers = soup.find_all(re.compile('^h[1-6]'))

print(headers)

[ <h1>Title Header</h1>,

<h2>Section Header</h2>,

<h2>Another Header</h2>]

Search with CSS Select


<div>

<h3><a href="/sites">Sites</a></h3>

<ul class="site-list">

<li><a href="https://google.com">Google</a></li>

<li><a href="https://bing.com">Bing</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

</div>

print(soup.select('div a'))

[ <a href="/sites">Sites</a>,

<a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

print(soup.select('div > h3 > a'))

[<a href="/sites">Sites</a>]

print(soup.select('li:nth-child(odd)'))

[ <li><a href="https://google.com">Google</a></li>,

<li><a href="https://apple.com">Apple</a></li>]

print(soup.select('a[href*="http"]'))

[ <a href="https://google.com">Google</a>,

<a href="https://bing.com">Bing</a>,

<a href="https://apple.com">Apple</a>]

https://www.devbyexample.com/web-scraping-cheat-sheet 5/6
02/08/2022, 06:45 Web Scraping with Python Cheat Sheet

Parent, Children and Siblings


<div>

<ul>

<li><a href="https://google.com">Google</a></li>

<li><a href="https://bing.com">Bing</a></li>

<li><a href="https://apple.com">Apple</a></li>

</ul>

</div>

# Get Parent Name

ul_element = soup.find('ul')

print(ul_element.parent.name)

div

# Print all text in children

for child in ul_element.children:

print(child.string)

Google

Bing

Apple

# Siblings
first_li_element = soup.find('li')

print(first_li_element)

for sibling in first_li_element.next_siblings:

print(sibling)

<li><a href="https://google.com">Google</a></li>

<li><a href="https://bing.com">Bing</a></li>

<li><a href="https://apple.com">Apple</a></li>

Interested in Learning Dev with Deep Dives into Real World Examples?

SIGN UP

https://www.devbyexample.com/web-scraping-cheat-sheet 6/6

You might also like