CSCE 590 Web Scraping
Topics
Overview – Web scraping
Introduction to Python 3.5
Readings:
Chapter 1
January 10, 2017
Course Information
Contact Information:
• Instructor: Manton Matthews
• Office: Swearingen 3A53
• Email: mm <at> sc
• Office phone: 777-3285
• Course location: Sumwalt 305
• Course Time: TR 8:30-9:45
• Office Hours: TR 11:30-1:00PM, others by appointment
Textbook:
Required: Web Scraping with Python – Collecting
Data from the Modern Web by Ryan Mitchell,
O’Reilly, 2015.
–2–
CSCE 590 Web Scraping Spring 2017
Resources - Websites or online texts
Python 3.5 documentation - https://docs.python.org/3.5/
Tutorial (Required)
https://docs.python.org/3.5/tutorial/index.html
The Standard Library
https://docs.python.org/3.5/library/index.html
Python – Fluent Python – Clear, Concise and Effective
Programming by Luciano Ramalho, O’Reilly 2015.
Natural Language Toolkit – for Python 3.x
http://www.nltk.org/book/ and for Python2.x
http://www.nltk.org/book_1ed/
Scrapy -
https://doc.scrapy.org/en/latest/intro/tutorial.html
Dive into Python 3-
https://cloud.github.com/downloads/diveintomark/dive
–3–intopython3/dive-into-python3.pdf CSCE 590 Web Scraping Spring 2017
Why Scrape?
“Cheapest flight to Boston”
Google knows what are on the content pages but
not the results of queries about specific flights.
A scraper can query all the popular sites, with user
preferences and optimize the results the way you
would like
–4–
CSCE 590 Web Scraping Spring 2017
Views of Web Applications
2.1 100,000 Feet: Client-Server Architecture
2.2 50,000 Feet: Communication— HTTP and URIs
2.3 10,000 Feet: Representation— HTML and CSS
2.4 5,000 Feet: 3-Tier Architecture & Horizontal
Scaling
2.5 1,000 Feet: Model-View-Controller Architecture
Fox, Armando; Patterson, David. Engineering
Software as a Service: An Agile Approach Using
Cloud Computing (Kindle Locations 1481-1485).
Strawberry Canyon LLC. Kindle Edition.
–5–
CSCE 590 Web Scraping Spring 2017
2.1 100,000 Feet View: Client-Server
Architecture
HTTP & HTTP/2 // for browsers servers
TCP protocol
IP protocol
Ethernet protocol
–6–
CSCE 590 Web Scraping Spring 2017
2.2 50,000 Feet: Communication—
HTTP and URIs
–7–
CSCE 590 Web Scraping Spring 2017
2.3 10,000 Feet: Representation—
HTML and CSS
–8–
CSCE 590 Web Scraping Spring 2017
2.5 1,000 Feet: Model-View-
Controller Architecture
–9–
CSCE 590 Web Scraping Spring 2017
Introduction to Python
Python 2.8 vs Python 3.5
– 10 –
CSCE 590 Web Scraping Spring 2017
Installing
– 11 –
CSCE 590 Web Scraping Spring 2017
Python References
Python 3.5 documentation -
https://docs.python.org/3.5/
Tutorial (Required)
https://docs.python.org/3.5/tutorial/index.html
The Standard Library
https://docs.python.org/3.5/library/index.html
Dive into Python 3- https://
cloud.github.com/downloads/diveintomark/divein
topython3/dive-into-python3.pdf
Python – Fluent Python – Clear, Concise and
Effective Programming by Luciano Ramalho,
O’Reilly 2015.
– 12 –
CSCE 590 Web Scraping Spring 2017
Python interpreter: Expressions
50 - 5*6
(50 - 5*6) / 4
8 / 5 # division always returns a floating point number
17 / 3 # classic division returns a float
17 // 3 # floor division discards the fractional part
17 % 3 # the % operator returns the remainder of the
division
5 ** 2 # 5 squared
2 ** 7 # 2 to the power of 7
– 13 –
CSCE 590 Web Scraping Spring 2017
Variables and typing
width = 20
height = 5 * 9
width * height
print(width * height)
– 14 –
CSCE 590 Web Scraping Spring 2017
# 3.1.2 Strings
#'spam eggs' # single quotes
'doesn\'t' # use \' to escape the single quote...
"doesn't" # ...or use double quotes instead
'"Yes," he said.'
"\"Yes,\" he said."
'"Isn\'t," she said.'
'"Isn\'t," she said.'
print('"Isn\'t," she said.')
– 15 –
CSCE 590 Web Scraping Spring 2017
s = 'First line.\nSecond line.' # \n means newline
s # without print(), \n is included in the output
print(s) # with print(), \n produces a new line
print('C:\some\name') # here \n means newline!
print(r'C:\some\name') # note the r before the quote
– 16 –
CSCE 590 Web Scraping Spring 2017
concatenation
# 3 times 'un', followed by 'ium'
# 3 * 'un' + 'ium'
#
'Py' 'thon'
prefix = 'Py'
prefix 'thon' # can't concatenate a variable and a
string literal
– 17 –
CSCE 590 Web Scraping Spring 2017
word = 'Python'
word[0] # character in position 0
word[5] # character in position 5
word[-1] # last character
word[-2] # second-last character
word[-6]
– 18 –
CSCE 590 Web Scraping Spring 2017
Slices
word[0:2] # characters from position 0 (included) to 2 (excluded)
word[2:5] # characters from position 2 (included) to 5 (excluded)
word[:2] + word[2:]
word[:4] + word[4:]
word[:2] # character from the beginning to position 2 (excluded)
word[4:] # characters from position 4 (included) to the end
word[-2:] # characters from the second-last (included) to the end
– 19 –
CSCE 590 Web Scraping Spring 2017
word[42] # the word only has 6 characters
word[4:42]
word[42:]
word[0] = 'J'
word[2:] = 'py'
'J' + word[1:]
word[:2] + 'py'
s = 'supercalifragilisticexpialidocious'
len(s)
– 20 –
CSCE 590 Web Scraping Spring 2017
squares = [1, 4, 9, 16, 25]
sauares
squares[0] # indexing returns the item
squares[-1]
squares[-3:] # slicing returns a new list
squares + [36, 49, 64, 81, 100]
cubes = [1, 8, 27, 65, 125] # something's wrong here
4 ** 3 # the cube of 4 is 64, not 65!
cubes[3] = 64 # replace the wrong value
cubes
cubes.append(216) # add the cube of 6
cubes.append(7 ** 3) # and the cube of 7
– 21 –
cubes
CSCE 590 Web Scraping Spring 2017
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
letters
# replace some values
letters[2:5] = ['C', 'D', 'E']
letters
# now remove them
letters[2:5] = []
letters
# clear the list by replacing all the elements with an empty list
letters[:] = []
letters
letters = ['a', 'b', 'c', 'd']
– 22 –
len(letters)
CSCE 590 Web Scraping Spring 2017
Nesting
a = ['a', 'b', 'c']
n = [1, 2, 3]
x = [a, n]
x
x[0]
x[0][1]
– 23 –
CSCE 590 Web Scraping Spring 2017
# 3.2. First Steps Towards Programming
Fibonacci series:
# the sum of two elements defines the next
a, b = 0, 1
while b < 10:
print(b)
a, b = b, a+b
– 24 –
CSCE 590 Web Scraping Spring 2017
a, b = 0, 1
while b < 1000:
print(b, end=',')
a, b = b, a+b
– 25 –
CSCE 590 Web Scraping Spring 2017
if Statements
x = int(input("Please enter an integer: "))
if x < 0:
x=0
print('Negative changed to zero')
elif x == 0:
print('Zero')
elif x == 1:
print('Single')
else:
print('More')
– 26 –
CSCE 590 Web Scraping Spring 2017
# Measure some strings:
words = ['cat', 'window', 'defenestrate']
for w in words:
print(w, len(w))
for w in words[:]: # Loop over a slice copy of the
entire list.
if len(w) > 6:
words.insert(0, w)
– 27 –
CSCE 590 Web Scraping Spring 2017
# 4.3. The range() Function
for i in range(5):
print(i)
range(5, 10)
5 through 9
range(0, 10, 3)
0, 3, 6, 9
range(-10, -100, -30)
-10, -40, -70
– 28 –
CSCE 590 Web Scraping Spring 2017
a = ['Mary', 'had', 'a', 'little', 'lamb']
for i in range(len(a)):
print(i, a[i])
print(range(10))
list(range(5))
– 29 –
CSCE 590 Web Scraping Spring 2017
Break and continue
for n in range(2, 10):
for x in range(2, n):
if n % x == 0:
print(n, 'equals', x, '*', n//x)
break
else:
# loop fell through without finding a factor
print(n, 'is a prime number')
– 30 –
CSCE 590 Web Scraping Spring 2017
Pass
while True:
pass # Busy-wait for keyboard interrupt (Ctrl+C)
class MyEmptyClass:
pass
def initlog(*args):
pass # Remember to implement this!
– 31 –
CSCE 590 Web Scraping Spring 2017
# 4.6. Defining Functions
def fib(n): # write Fibonacci series up to n
"""Print a Fibonacci series up to n."""
a, b = 0, 1
while a < n:
print(a, end=' ')
a, b = b, a+b
print()
– 32 –
CSCE 590 Web Scraping Spring 2017