0% found this document useful (0 votes)
21 views6 pages

Lab 8

This document outlines a lab task for a Python Programming course focused on creating a web scraping project using the Scrapy library. It provides step-by-step instructions for setting up a Scrapy project in Google Colab, including installation, creating a spider to scrape quotes, and saving the output in JSON format. Additionally, it includes a task to scrape data from a specified books website, detailing the elements to be extracted.

Uploaded by

Shanza Atique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views6 pages

Lab 8

This document outlines a lab task for a Python Programming course focused on creating a web scraping project using the Scrapy library. It provides step-by-step instructions for setting up a Scrapy project in Google Colab, including installation, creating a spider to scrape quotes, and saving the output in JSON format. Additionally, it includes a task to scrape data from a specified books website, detailing the elements to be extracted.

Uploaded by

Shanza Atique
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ISRA UNIVERSITY

Faculty of Engineering, Science & Technology


Department of Computer Science

Course Code Course Name Credit Hours

Python Programming 3(2+3)

LAB TASK # 08

Student’s Name:______________ Student’s ID: ______________

Date:_______________________ Teacher : _________________

Objective: Learn to create a web scraping project using the Scrapy library. You will set
up a Scrapy project, create a spider, scrape data from a website, and save it in different
formats.

1|Page
Python Programming
Scrapy is a high-level framework used to scrape data from highly complex websites. With
it, bypassing Captcha’s using predefined functions or external libraries is possible.
You can write a simple Scrapy crawler to scrape web data by using an object definition by
means of a Python class. However, it's not particularly user-friendly compared to other
Python scraping libraries.
Although the learning curve for this library is steep, you can do a lot with it, and it's highly
efficient in performing crawling tasks.
Pros:
 General framework for scraping purposes.
 Strong encoding support.
 It doesn’t require beautifulsoup.
Cons:
 Steep learning curve.
 Scrapy can’t scrap dynamic webpages.
 It requires different installation steps for different websites.

2|Page
Python Programming
Step 1: Install Scrapy and Other Dependencies

1. Open a new Colab notebook.


2. Run the following cell to install Scrapy and other necessary libraries.

!pip install scrapy

!pip install twisted

3. Since Google Colab uses asyncio, which conflicts with Scrapy's Twisted reactor,
we need to set up a compatible reactor. Add this code in a cell at the start:

import sys

if '[Link]' in [Link]:

# This will fix the asyncio compatibility issue

import nest_asyncio

nest_asyncio.apply()

from [Link] import asyncioreactor

[Link]()

3|Page
Python Programming
Step 2: Set Up Scrapy Project Files in Colab

In Colab, we can’t create a full Scrapy project structure as we would on a local machine.
Instead, we’ll create a single spider script to simulate a simpler setup.

1. Create a new file called quotes_spider.py in the current directory with the following
code. This spider scrapes quotes, authors, and tags from [Link].

%%writefile quotes_spider.py

import scrapy

class QuotesSpider([Link]):

name = "quotes"

start_urls = ['[Link]

def parse(self, response):

for quote in [Link]('[Link]'):

yield {

'text': [Link]('[Link]::text').get(),

'author': [Link]('[Link]::text').get(),

'tags': [Link]('[Link] [Link]::text').getall(),

next_page = [Link]('[Link] a::attr(href)').get()

if next_page is not None:

next_page = [Link](next_page)

yield [Link](next_page, callback=[Link])

4|Page
Python Programming
Step 3: Run the Scrapy Spider in Colab

Since Colab does not have direct access to the terminal, we will use IPython to execute
shell commands in the notebook.

1. Run the following cell to execute the spider and save the output to a JSON file
([Link]):

!scrapy runspider quotes_spider.py -o [Link]

1. This command should run the spider and output data to [Link].
2. To check if data has been scraped successfully, you can load and display the
contents of [Link]:

import json

with open("[Link]", "r") as f:

quotes_data = [Link](f)

# Display the first few quotes

quotes_data[:5]

Step 4: Display Data in a DataFrame for Easy Viewing

If you want to work with the data in a more structured format, you can load the JSON
data into a Pandas DataFrame:

o
import pandas as pd

quotes_df = [Link](quotes_data)
quotes_df.head()

5|Page
Python Programming
TASK:

1. Books to Scrape

 URL: [Link]
 Data Available: Book titles, prices, availability, ratings, and categories.
 Description: This site has a collection of books with structured categories, making it a good
source for scraping information related to products in a catalog-style format.

Example Elements to Scrape:

 Book titles: article.product_pod h3 a::attr(title)


 Price: p.price_color::text
 Availability: [Link]::text

6|Page
Python Programming

You might also like