ISRA UNIVERSITY
Faculty of Engineering, Science & Technology
Department of Computer Science
Course Code Course Name Credit Hours
Python Programming 3(2+3)
LAB TASK # 08
Student’s Name:______________ Student’s ID: ______________
Date:_______________________ Teacher : _________________
Objective: Learn to create a web scraping project using the Scrapy library. You will set
up a Scrapy project, create a spider, scrape data from a website, and save it in different
formats.
1|Page
Python Programming
Scrapy is a high-level framework used to scrape data from highly complex websites. With
it, bypassing Captcha’s using predefined functions or external libraries is possible.
You can write a simple Scrapy crawler to scrape web data by using an object definition by
means of a Python class. However, it's not particularly user-friendly compared to other
Python scraping libraries.
Although the learning curve for this library is steep, you can do a lot with it, and it's highly
efficient in performing crawling tasks.
Pros:
General framework for scraping purposes.
Strong encoding support.
It doesn’t require beautifulsoup.
Cons:
Steep learning curve.
Scrapy can’t scrap dynamic webpages.
It requires different installation steps for different websites.
2|Page
Python Programming
Step 1: Install Scrapy and Other Dependencies
1. Open a new Colab notebook.
2. Run the following cell to install Scrapy and other necessary libraries.
!pip install scrapy
!pip install twisted
3. Since Google Colab uses asyncio, which conflicts with Scrapy's Twisted reactor,
we need to set up a compatible reactor. Add this code in a cell at the start:
import sys
if '[Link]' in [Link]:
# This will fix the asyncio compatibility issue
import nest_asyncio
nest_asyncio.apply()
from [Link] import asyncioreactor
[Link]()
3|Page
Python Programming
Step 2: Set Up Scrapy Project Files in Colab
In Colab, we can’t create a full Scrapy project structure as we would on a local machine.
Instead, we’ll create a single spider script to simulate a simpler setup.
1. Create a new file called quotes_spider.py in the current directory with the following
code. This spider scrapes quotes, authors, and tags from [Link].
%%writefile quotes_spider.py
import scrapy
class QuotesSpider([Link]):
name = "quotes"
start_urls = ['[Link]
def parse(self, response):
for quote in [Link]('[Link]'):
yield {
'text': [Link]('[Link]::text').get(),
'author': [Link]('[Link]::text').get(),
'tags': [Link]('[Link] [Link]::text').getall(),
next_page = [Link]('[Link] a::attr(href)').get()
if next_page is not None:
next_page = [Link](next_page)
yield [Link](next_page, callback=[Link])
4|Page
Python Programming
Step 3: Run the Scrapy Spider in Colab
Since Colab does not have direct access to the terminal, we will use IPython to execute
shell commands in the notebook.
1. Run the following cell to execute the spider and save the output to a JSON file
([Link]):
!scrapy runspider quotes_spider.py -o [Link]
1. This command should run the spider and output data to [Link].
2. To check if data has been scraped successfully, you can load and display the
contents of [Link]:
import json
with open("[Link]", "r") as f:
quotes_data = [Link](f)
# Display the first few quotes
quotes_data[:5]
Step 4: Display Data in a DataFrame for Easy Viewing
If you want to work with the data in a more structured format, you can load the JSON
data into a Pandas DataFrame:
o
import pandas as pd
quotes_df = [Link](quotes_data)
quotes_df.head()
5|Page
Python Programming
TASK:
1. Books to Scrape
URL: [Link]
Data Available: Book titles, prices, availability, ratings, and categories.
Description: This site has a collection of books with structured categories, making it a good
source for scraping information related to products in a catalog-style format.
Example Elements to Scrape:
Book titles: article.product_pod h3 a::attr(title)
Price: p.price_color::text
Availability: [Link]::text
6|Page
Python Programming