Financial Data Science (FIN42110)
Dr. Richard McGee
Web Scraping
1
Introduction
Definitions
Web Scraping
Using tools to gather data you can see on a webpage.
A wide range of web scraping
techniques and tools exist. These can
be as simple as copy/paste and
increase in complexity to automation
tools, HTML parsing, APIs and
programming.
2
Definitions
HTTP
HyperText Transfer Protocol
• HTTP is the foundation of data
communication for the World Wide
Web, where hypertext documents
include hyperlinks to other resources
that the user can easily access, for
example by a mouse click or by
tapping the screen in a web browser.
The protocol defines aspects of authentication, requests, status
codes, persistent connections, client/server request/response.
etc.
3
Definitions
HTML
HyperText Markup Language
• HyperText Markup Language
(HTML) is the set of markup
symbols or codes inserted into a file
intended for display on the Internet.
The markup tells web browsers how
to display a web page’s words and
images.
Each individual piece markup code (which would fall between
"<" and ">" characters) is referred to as an element, though
many people also refer to it as a tag.
4
Definitions
XML
Extensible Markup Language
• Extensible Markup Language (XML)
is a markup language and file format
for storing, transmitting, and
reconstructing arbitrary data.
It defines a set of rules for encoding documents in a format that
is both human-readable and machine-readable.
XML is about encoding data, HTML is about display.
5
XML Example
• We will check out the Books.xml example file on
Brightspace.
• With a new data set it can be useful to use an XML viewer
to view the hierarchy:
• https://www.xmlgrid.net/
6
XML parsing
from lxml.etree import fromstring
with open('Books.xml', 'r') as file:
xml = file.read()
root = fromstring(xml)
for books in root.xpath("/catalog/book"):
print(books.xpath("title")[0].text)
7
Definitions
JSON
JavaScript Object Notation
• JSON, is a lightweight computer
data interchange format. It is a
text-based, human-readable format
for representing simple data
structures and associative arrays
(called objects) in serialization and
serves as an alternative to XML.
8
Definitions
API
Application Programming Interface
• An application programming
interface (API) is a connection
between computers or between
computer programs. It is a type of
software interface, offering a service
to other pieces of software.
9
Definitions
SOAP
Simple Object Access Protocol
• SOAP is a commonly used set of
commands and objects used to
implement an API.
10
Definitions
• Parsing
• The act of analyzing the strings and symbols to reveal
only the data you need.
• Crawling
• Moving across or through a website in an attempt to
gather data from more than one URL or page
11
HTML Structure: div
<html>
<head>
<style>
.myDiv {
border: 5px outset red;
background-color: lightblue;
text-align: center;
}
</style>
</head>
<body>
<div class="myDiv">
<h2>This is a heading in a div element</h2>
<p>This is some text in a div element.</p>
</div>
</body>
</html>
• division/section/used as a container for HTML elements.
• https://www.w3schools.com/Tags/tag_div.asp
12
HTML Structure: table/tr/td
<table>
<tr>
<td>Cell A</td>
<td>Cell B</td>
</tr>
<tr>
<td>Cell C</td>
<td>Cell D</td>
</tr>
</table>
• one <table> and one or more <tr>, <th>, and <td> elements
• https://www.w3schools.com/Tags/tag_table.asp
13
Robots.txt
• Instructs web robots (typically search engine robots) how
to crawl pages on the website.
• Example https://www.buzzfeed.com/robots.txt
• Accessing at too high a frequency will get you blocked!
14
Useful Python Packages
• pip install beautifulsoup4
• pip install requests
• pip install html5lib
• pip install yfinance
• pip install mplfinance
• pip install twython
• pip install selenium
• install chrome browser
• and chrome driver matching browser version
15
Crypto Punks
Example Project: Crypto Punk Pricing
https://www.larvalabs.com/cryptopunks
Step 1: Specify what you are looking for. In this case:
• a database of 10,000 crypto punks
• their key features
• their trade history and prices
• looking to explain prices with features.
16
Example Project: Crypto Punk Pricing
Examine the web page source:
17
Example Project: Crypto Punk Pricing
Step 2: Design your database structure
• for this project I will use a simple SQLite DB
• https://sqlitebrowser.org/dl/
• I will create two tables - a punk attribute table and a trade
table.
18
Example Project: Crypto Punk Pricing
Step 3: Examine the web site structure (view page source in
browser)
• CyptoPunk is nicely structured with one page per punk
numbered 1-10,000
• e.g. punk 1 is at
https://www.larvalabs.com/cryptopunks/details/1
19
Example Project: Crypto Punk Pricing
• Example: Print trade dates and amounts for one punk.
import requests
from bs4 import BeautifulSoup
# Crypto Punk
#~~~~~~~~~~~~
BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
PunkNo = '1'
page = requests.get(BaseStr + PunkNo)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', attrs={'class':'table'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
if cols:
cols = [ele.text.strip() for ele in cols]
print(cols[4] + ' : ' + cols[3])
20
Yahoo finance API
Web Scraping from yahoo finance
import yfinance as yf
import mplfinance as mpf
import numpy as np
ticker_name = 'NFLX'
yticker = yf.Ticker(ticker_name)
nflx = yticker.history(period="1y") # max, 1y, 3mo
....
# Compute log returns
nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))
https://pypi.org/project/yfinance/
https://pypi.org/project/mplfinance/
21
MPL output example
22
More Scraping
Web Scraping Example: House of Representatives
https://www.house.gov/representatives
23
Web Scraping Example
def main():
from bs4 import BeautifulSoup
import requests
url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")
all_urls = [a['href']
for a in soup('a')
if a.has_attr('href')]
print(len(all_urls))
Example from Data Science from Sratch, Joel Grus
24
Web Scraping Example
import re
# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"
# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
# And now apply
good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))
good_urls = list(set(good_urls))
Example from Data Science from Sratch, Joel Grus.
For regex see, e.g.: https://www.w3schools.com/python/python_regex.asp
25
Web Scraping Example
from bs4 import BeautifulSoup
import requests
def paragraph_mentions(text: str, keyword: str) -> bool:
"""
Returns True if a <p> inside the text mentions {keyword}
"""
soup = BeautifulSoup(text, 'html5lib')
paragraphs = [p.get_text() for p in soup('p')]
return any(keyword.lower() in paragraph.lower()
for paragraph in paragraphs)
Example from Data Science from Sratch, Joel Grus
26
Web Scraping Example
import random
from typing import Dict, Set
good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")
press_releases: Dict[str, Set[str]] = {}
for house_url in good_urls:
html = requests.get(house_url).text
soup = BeautifulSoup(html, 'html5lib')
pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
print(f"{house_url}: {pr_links}")
press_releases[house_url] = pr_links
for house_url, pr_links in press_releases.items():
for pr_link in pr_links:
url = f"{house_url}/{pr_link}"
text = requests.get(url).text
if paragraph_mentions(text, 'data'):
print(f"{house_url}")
break # done with this house_url
Example from Data Science from Sratch, Joel Grus 27