Web crawling is widely used technique to collect data from other websites. It works by visiting web pages, following links and gathering useful information like text, images, or tables. Python has various libraries and frameworks that support web crawling. In this article we will see about web crawling using Python.
1. Web Crawling with Requests
The first step in web crawling is fetching the content of a webpage. The requests library allows us to send an HTTP request to a website and retrieve its HTML content. For this we will use requests module in python.
requests.get(URL) : Sends a GET request to the specified URL.response.status_code : Checks if the request was successful status code 200 means success.response.text : Contains the HTML content of the webpage.
Python
import requests
URL = "https://www.geeksforgeeks.org/"
resp = requests.get(URL)
print("Status Code:", resp.status_code)
print("\nResponse Content:")
print(resp.text)
Output:
Web Crawling with RequestsSometimes websites provide data in JSON format which we need to convert into Python Dictionary. In this example a GET request is made to the location API using the `requests` library. If the request is successful indicated by a status code of 200 then ISS's current location data is fetched and printed. Otherwise an error message with the status code is displayed.
response.json() : Converts the JSON response into a Python dictionary.- You can now access specific fields like
data['iss_position']['latitude'].
Python
import requests
URL = "http://api.open-notify.org/iss-now.json"
response = requests.get(URL)
if response.status_code == 200:
data = response.json()
print("ISS Location Data:")
print(data)
else:
print(
f"Error: Failed to retrieve data. Status code: {response.status_code}")
Output:
Web crawling in json format3. Web Scraping Images with Python
You can also use web crawling to download images from websites. In this example a GET request is used to fetch an image from a given URL. If the request is successful the image data is saved to a local file named "gfg_logo.png". Otherwise a failure message is displayed.
response.content : Contains the binary content of the image.open(output_filename, "wb") : Opens a file in binary write mode to save the image.
Python
import requests
image_url = "https://media.geeksforgeeks.org/wp-content/uploads/20230505175603/100-Days-of-Machine-Learning.webp"
output_filename = "gfg_logo.png"
response = requests.get(image_url)
if response.status_code == 200:
with open(output_filename, "wb") as file:
file.write(response.content)
print(f"Image downloaded successfully as {output_filename}")
else:
print("Failed to download the image.")
Output:
Image downloaded successfully as gfg_logo.png
4. Crawling Elements Using XPath
We use Python to get the current temperature of Noida from a weather website. First we send a request to the website. Then we use XPath to find and show the temperature from the webpage. If it's found we print it otherwise we show an error message.
Python
from lxml import etree
import requests
weather_url = "https://weather.com/en-IN/weather/today/l/60f76bec229c75a05ac18013521f7bfb52c75869637f3449105e9cb79738d492"
response = requests.get(weather_url)
if response.status_code == 200:
dom = etree.HTML(response.text)
elements = dom.xpath(
"//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
if elements:
temperature = elements[0].text
print(f"The current temperature is: {temperature}")
else:
print("Temperature element not found.")
else:
print("Failed to fetch the webpage.")
Output:
The current temperature is: 31
5. Reading Tables on the Web Using Pandas
We can read tables on the web by using Pandas and web crawling. Pandas is used to extract tables from a specified URL using its read_html function. If tables are successfully extracted from the webpage they are printed one by one with a separator. If no tables are found a message indicating this is displayed.
Python
import pandas as pd
url = "https://www.geeksforgeeks.org/html/html-tables/"
extracted_tables = pd.read_html(url)
if extracted_tables:
for idx, table in enumerate(extracted_tables, 1):
print(f"Table {idx}:")
print(table)
print("-" * 50)
else:
print("No tables found on the webpage.")
Output:
Reading Tables on web using Pandas6. Crawl a Web Page and Get Most Frequent Words
We can also find most frequent words we crawl a web page using requests and then use BeautifulSoup to read the content. We focus on a specific part of the page called 'entry-content' and extract all the words from it. After cleaning the words, removing symbols and non-alphabetic characters we count how often each word appears. Finally we show the 10 most common words from that content.
Python
import requests
from bs4 import BeautifulSoup
from collections import Counter
def start(url):
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'html.parser')
wordlist = []
for each_text in soup.findAll('div', {'class': 'entry-content'}):
content = each_text.text
words = content.lower().split()
for each_word in words:
wordlist.append(each_word)
clean_wordlist(wordlist)
def clean_wordlist(wordlist):
clean_list = []
symbols = "!@#$%^&*()_-+={[}]|\\;:\"<>?/.,"
for word in wordlist:
for symbol in symbols:
word = word.replace(symbol, '')
if len(word) > 0:
create_dictionary(clean_list)
def create_dictionary(clean_list):
word_count = Counter(clean_list)
top = word_count.most_common(10)
print("Top 10 most frequent words:")
for word, count in top:
print(f'{word}: {count}')
if __name__ == "__main__":
start(url)
Output:
crawl a web pageWeb crawling with Python provides an efficient way to collect and analyze data from the web. It is essential for various applications such as data mining, market research and content aggregation. With proper handling of ethical guidelines, web crawling becomes important for gathering data and insights from web.
Explore
Python Fundamentals
Python Data Structures
Advanced Python
Data Science with Python
Web Development with Python
Python Practice