0% found this document useful (0 votes)

28 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Financial Data Science (FIN42110)

Dr. Richard McGee

Web Scraping

1
Introduction
Definitions

Web Scraping
Using tools to gather data you can see on a webpage.
A wide range of web scraping
techniques and tools exist. These can
be as simple as copy/paste and
increase in complexity to automation
tools, HTML parsing, APIs and
programming.

2
Definitions

HTTP
HyperText Transfer Protocol

• HTTP is the foundation of data

communication for the World Wide
Web, where hypertext documents
include hyperlinks to other resources
that the user can easily access, for
example by a mouse click or by
tapping the screen in a web browser.

The protocol defines aspects of authentication, requests, status

codes, persistent connections, client/server request/response.
etc.

3
Definitions

HTML
HyperText Markup Language

• HyperText Markup Language

(HTML) is the set of markup
symbols or codes inserted into a file
intended for display on the Internet.
The markup tells web browsers how
to display a web page’s words and
images.

Each individual piece markup code (which would fall between

"<" and ">" characters) is referred to as an element, though
many people also refer to it as a tag.

4
Definitions

XML
Extensible Markup Language

• Extensible Markup Language (XML)

is a markup language and file format
for storing, transmitting, and
reconstructing arbitrary data.

It defines a set of rules for encoding documents in a format that

is both human-readable and machine-readable.
XML is about encoding data, HTML is about display.

5
XML Example

• We will check out the Books.xml example file on

Brightspace.
• With a new data set it can be useful to use an XML viewer
to view the hierarchy:
• https://www.xmlgrid.net/

6
XML parsing

from lxml.etree import fromstring

with open('Books.xml', 'r') as file:

xml = file.read()

root = fromstring(xml)

for books in root.xpath("/catalog/book"):

print(books.xpath("title")[0].text)

7
Definitions

JSON
JavaScript Object Notation

• JSON, is a lightweight computer

data interchange format. It is a
text-based, human-readable format
for representing simple data
structures and associative arrays
(called objects) in serialization and
serves as an alternative to XML.

8
Definitions

API
Application Programming Interface

• An application programming
interface (API) is a connection
between computers or between
computer programs. It is a type of
software interface, offering a service
to other pieces of software.

9
Definitions

SOAP
Simple Object Access Protocol

• SOAP is a commonly used set of

commands and objects used to
implement an API.

10
Definitions

• Parsing
• The act of analyzing the strings and symbols to reveal
only the data you need.
• Crawling
• Moving across or through a website in an attempt to
gather data from more than one URL or page

11
HTML Structure: div

<html>
<head>
<style>
.myDiv {
border: 5px outset red;
background-color: lightblue;
text-align: center;
}
</style>
</head>
<body>

<div class="myDiv">
<h2>This is a heading in a div element</h2>
<p>This is some text in a div element.</p>
</div>

</body>
</html>

• division/section/used as a container for HTML elements.

• https://www.w3schools.com/Tags/tag_div.asp

12
HTML Structure: table/tr/td

• one <table> and one or more <tr>, <th>, and <td> elements
• https://www.w3schools.com/Tags/tag_table.asp

13
Robots.txt

• Instructs web robots (typically search engine robots) how

to crawl pages on the website.
• Example https://www.buzzfeed.com/robots.txt
• Accessing at too high a frequency will get you blocked!

14
Useful Python Packages

• pip install beautifulsoup4

• pip install requests
• pip install html5lib
• pip install yfinance
• pip install mplfinance
• pip install twython
• pip install selenium
• install chrome browser
• and chrome driver matching browser version

15
Crypto Punks
Example Project: Crypto Punk Pricing

https://www.larvalabs.com/cryptopunks
Step 1: Specify what you are looking for. In this case:

• a database of 10,000 crypto punks

• their key features
• their trade history and prices
• looking to explain prices with features.

16
Example Project: Crypto Punk Pricing

Examine the web page source:

17
Example Project: Crypto Punk Pricing

Step 2: Design your database structure

• for this project I will use a simple SQLite DB

• https://sqlitebrowser.org/dl/
• I will create two tables - a punk attribute table and a trade
table.

18
Example Project: Crypto Punk Pricing

Step 3: Examine the web site structure (view page source in

browser)

• CyptoPunk is nicely structured with one page per punk

numbered 1-10,000
• e.g. punk 1 is at
https://www.larvalabs.com/cryptopunks/details/1

19
Example Project: Crypto Punk Pricing

• Example: Print trade dates and amounts for one punk.

import requests
from bs4 import BeautifulSoup

# Crypto Punk
#~~~~~~~~~~~~
BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
PunkNo = '1'
page = requests.get(BaseStr + PunkNo)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', attrs={'class':'table'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
if cols:
cols = [ele.text.strip() for ele in cols]
print(cols[4] + ' : ' + cols[3])

20
Yahoo finance API
Web Scraping from yahoo finance

import yfinance as yf
import mplfinance as mpf
import numpy as np

ticker_name = 'NFLX'
yticker = yf.Ticker(ticker_name)
nflx = yticker.history(period="1y") # max, 1y, 3mo
....
# Compute log returns
nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))

https://pypi.org/project/yfinance/
https://pypi.org/project/mplfinance/

21
MPL output example

22
More Scraping
Web Scraping Example: House of Representatives

https://www.house.gov/representatives
23
Web Scraping Example

def main():
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
for a in soup('a')
if a.has_attr('href')]
print(len(all_urls))

Example from Data Science from Sratch, Joel Grus

24
Web Scraping Example

import re

# Must start with http:// or https://

# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!

assert re.match(regex, "http://joel.house.gov")

# And now apply

good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))
good_urls = list(set(good_urls))

Example from Data Science from Sratch, Joel Grus.

For regex see, e.g.: https://www.w3schools.com/python/python_regex.asp
25
Web Scraping Example

from bs4 import BeautifulSoup

import requests

def paragraph_mentions(text: str, keyword: str) -> bool:

"""
Returns True if a <p> inside the text mentions {keyword}
"""
soup = BeautifulSoup(text, 'html5lib')
paragraphs = [p.get_text() for p in soup('p')]

return any(keyword.lower() in paragraph.lower()

for paragraph in paragraphs)

Example from Data Science from Sratch, Joel Grus

26
Web Scraping Example

import random
from typing import Dict, Set

good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
html = requests.get(house_url).text
soup = BeautifulSoup(html, 'html5lib')
pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
print(f"{house_url}: {pr_links}")
press_releases[house_url] = pr_links

for house_url, pr_links in press_releases.items():

for pr_link in pr_links:
url = f"{house_url}/{pr_link}"
text = requests.get(url).text

if paragraph_mentions(text, 'data'):
print(f"{house_url}")
break # done with this house_url

Example from Data Science from Sratch, Joel Grus 27

Web Scraping Handbook
100% (1)
Web Scraping Handbook
115 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
Glossary: Apis and Data Collection: Term Definition
No ratings yet
Glossary: Apis and Data Collection: Term Definition
2 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
26 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Python Web Scraping with Selenium
No ratings yet
Python Web Scraping with Selenium
27 pages
Web Scraping for Job Portals
No ratings yet
Web Scraping for Job Portals
13 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Scraping
100% (1)
Scraping
25 pages
Sma U-2
No ratings yet
Sma U-2
19 pages
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
No ratings yet
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
11 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
6 pages
Introduction to Web Parsing Basics
100% (1)
Introduction to Web Parsing Basics
3 pages
Download
No ratings yet
Download
4 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Web Scraping and API Fundamentals
No ratings yet
Web Scraping and API Fundamentals
10 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Unit I
No ratings yet
Unit I
12 pages
E-commerce Review Scraper Project
No ratings yet
E-commerce Review Scraper Project
15 pages
Integrasi Level Antarmuka Pengguna
No ratings yet
Integrasi Level Antarmuka Pengguna
20 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Web Scraping: Tools and Techniques
No ratings yet
Web Scraping: Tools and Techniques
34 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Data Aggregation via Web Scraping
No ratings yet
Data Aggregation via Web Scraping
48 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Python Toolbox 100 Scripts For Developers Enhance Your Development Skills With Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
No ratings yet
Python Toolbox 100 Scripts For Developers Enhance Your Development Skills With Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
193 pages
Utilizing Python For Web Scraping and Incremental Data Extraction
No ratings yet
Utilizing Python For Web Scraping and Incremental Data Extraction
6 pages
Internal Audit May 2021
No ratings yet
Internal Audit May 2021
27 pages
Gillette Sales & EBIT Projections 1999-2005
No ratings yet
Gillette Sales & EBIT Projections 1999-2005
34 pages
Input File
No ratings yet
Input File
24 pages
Jel - Lee - Miguel - Wolfram Excel
No ratings yet
Jel - Lee - Miguel - Wolfram Excel
60 pages
Base Data For Project2
No ratings yet
Base Data For Project2
100 pages
Global Economic Projections
No ratings yet
Global Economic Projections
12 pages
Energy Consumption Growth Rates
No ratings yet
Energy Consumption Growth Rates
9 pages
US Company Tickers and ISINs List
No ratings yet
US Company Tickers and ISINs List
65 pages
Untitled
No ratings yet
Untitled
14 pages
CT Qual
No ratings yet
CT Qual
17 pages
GDPR Compliance Script
No ratings yet
GDPR Compliance Script
2 pages
Introduction to ABAP Dialog Programming
100% (1)
Introduction to ABAP Dialog Programming
11 pages
Text Processor For OCR AND FILE and Summarization
No ratings yet
Text Processor For OCR AND FILE and Summarization
3 pages
CL Salv Example
No ratings yet
CL Salv Example
17 pages
SQL Server Interview Questions (SSIS)
No ratings yet
SQL Server Interview Questions (SSIS)
4 pages
THE Design and Implementation of An E-Farming System Application For Bello Gostu Farms
No ratings yet
THE Design and Implementation of An E-Farming System Application For Bello Gostu Farms
89 pages
Linux User & Group Management Guide
No ratings yet
Linux User & Group Management Guide
11 pages
Scrivener 3 Shortcuts Guide
100% (1)
Scrivener 3 Shortcuts Guide
7 pages
Understanding Encapsulation in OOP
No ratings yet
Understanding Encapsulation in OOP
37 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
MySQL User Account and Database Setup
No ratings yet
MySQL User Account and Database Setup
64 pages
2.11 DHT11 Temperature and Humidity Sensor
No ratings yet
2.11 DHT11 Temperature and Humidity Sensor
5 pages
CSA AIML Assignments & Course Level Projects 2025-26
No ratings yet
CSA AIML Assignments & Course Level Projects 2025-26
5 pages
05-DB2 Database and Tablespace Relocation
No ratings yet
05-DB2 Database and Tablespace Relocation
56 pages
Adp CS
No ratings yet
Adp CS
62 pages
MOCK Exam Iinformation Technology
No ratings yet
MOCK Exam Iinformation Technology
62 pages
Python Data Science
100% (7)
Python Data Science
353 pages
Advanced Java Exam Guide
No ratings yet
Advanced Java Exam Guide
2 pages
C by Example-Que (1999) - Greg Perry
100% (2)
C by Example-Que (1999) - Greg Perry
848 pages
Data Modeling and Database Design
No ratings yet
Data Modeling and Database Design
21 pages
Sanjay R Resume Step
No ratings yet
Sanjay R Resume Step
2 pages
EDP101L Course Syllabus
No ratings yet
EDP101L Course Syllabus
10 pages
Automated Android Malware Detection Using Optimal Ensemble Learning Approach For Cyber Security
No ratings yet
Automated Android Malware Detection Using Optimal Ensemble Learning Approach For Cyber Security
11 pages
Capstone2 Final
No ratings yet
Capstone2 Final
17 pages
Minor Project Srs Report
No ratings yet
Minor Project Srs Report
32 pages
Codeigniter Unit-3
No ratings yet
Codeigniter Unit-3
37 pages
Android Multi-Process Architecture Explained
No ratings yet
Android Multi-Process Architecture Explained
30 pages
Vkbeautify 0 99 00 Beta Js
No ratings yet
Vkbeautify 0 99 00 Beta Js
7 pages