BeautifulSoup for
Python RPA
11/13/2024 © NexusIQ Solutions 1
BeautifulSoup is a Python library used for parsing HTML and XML documents, making it easier to extract data for web scraping. Below are its key
features:
Key Features of BeautifulSoup
1. Parsing HTML and XML
• BeautifulSoup supports parsing HTML and XML documents, allowing you to work with various types of markup.
• It can handle poorly formatted HTML, making it robust for scraping real-world web pages.
2. Tree Navigation
• Tag Navigation: Access HTML tags directly by their names:
soup.title # Access the <title> tag
• Attribute Access: Retrieve attributes of HTML tags:
soup.img['src'] # Get the 'src' attribute of an <img> tag\
3. Search Functions
• find(): Finds the first matching tag:
soup.find('h1') # Find the first <h1> tag
• find_all(): Finds all matching tags:
soup.find_all('a') # Find all <a> tags (links)
• CSS Selectors: Use select() for CSS-style queries:
soup.select('.class-name') # Select elements by class
11/13/2024 © NexusIQ Solutions 2
4. Prettify HTML
• Format the HTML structure for better readability:
print(soup.prettify())
5. Modifying the Parse Tree
• Modify or delete elements directly in the parsed tree:
soup.title.string = "New Title" # Change the content of the <title> tag
6. Handle Encodings
BeautifulSoup automatically handles different character encodings, ensuring compatibility with a wide variety of web pages.
7. Extract Text
• Retrieve only the text content of HTML elements:
print(soup.get_text()) # Extract all text
8. Flexible Parsers
• BeautifulSoup supports multiple parsers, including:
• html.parser: Default parser, built into Python.
• lxml: Fast and robust, requires additional installation.
• html5lib: Strict, creates a valid parse tree, but slower.
11/13/2024 © NexusIQ Solutions 3
9. Supports Complex Queries
• Use tag combinations, attributes, and filters for complex queries:
soup.find('div', {'class': 'example-class'}) # Find <div> with a specific class
10. Works with Various Document Formats
• Parse both HTML documents and XML files seamlessly.
11. Integration with Other Libraries
Combine BeautifulSoup with libraries like requests for HTTP requests or selenium for handling JavaScript-heavy websites.
Advantages of BeautifulSoup
• Ease of Use: Intuitive syntax and features for beginners.
• Error Handling: Can parse malformed or poorly written HTML.
• Flexibility: Works with multiple parsers, enabling compatibility with diverse requirements.
• Integration: Works well with libraries like requests, pandas, and selenium.
11/13/2024 © NexusIQ Solutions 4
Practical Example
import requests
from bs4 import BeautifulSoup
# Fetch a webpage
response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, 'html.parser')
# Extract the title
print("Page Title:", soup.title.text)
# Extract all links
for link in soup.find_all('a'):
print("Link:", link['href'])
11/13/2024 © NexusIQ Solutions 5
11/13/2024 © NexusIQ Solutions 6