0% found this document useful (0 votes)
8 views32 pages

590 Lec 01 Overview

CSCE 590 Web Scraping is a course that covers web scraping techniques using Python 3.5, including an overview of web applications and client-server architecture. The course is taught by Manton Matthews and includes required readings from 'Web Scraping with Python' by Ryan Mitchell. Students will learn about Python programming fundamentals, data collection, and relevant libraries for web scraping.

Uploaded by

aliayaydin715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

590 Lec 01 Overview

CSCE 590 Web Scraping is a course that covers web scraping techniques using Python 3.5, including an overview of web applications and client-server architecture. The course is taught by Manton Matthews and includes required readings from 'Web Scraping with Python' by Ryan Mitchell. Students will learn about Python programming fundamentals, data collection, and relevant libraries for web scraping.

Uploaded by

aliayaydin715
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

CSCE 590 Web Scraping

Topics

Overview – Web scraping


Introduction to Python 3.5

Readings:

Chapter 1

 January 10, 2017


Course Information
Contact Information:
• Instructor: Manton Matthews
• Office: Swearingen 3A53
• Email: mm <at> sc
• Office phone: 777-3285
• Course location: Sumwalt 305
• Course Time: TR 8:30-9:45
• Office Hours: TR 11:30-1:00PM, others by appointment
Textbook:
Required: Web Scraping with Python – Collecting
Data from the Modern Web by Ryan Mitchell,
O’Reilly, 2015.

–2–
CSCE 590 Web Scraping Spring 2017
Resources - Websites or online texts
Python 3.5 documentation - https://docs.python.org/3.5/
 Tutorial (Required)
https://docs.python.org/3.5/tutorial/index.html
 The Standard Library
https://docs.python.org/3.5/library/index.html
Python – Fluent Python – Clear, Concise and Effective
Programming by Luciano Ramalho, O’Reilly 2015.
Natural Language Toolkit – for Python 3.x
http://www.nltk.org/book/ and for Python2.x
http://www.nltk.org/book_1ed/
Scrapy -
https://doc.scrapy.org/en/latest/intro/tutorial.html
Dive into Python 3-
https://cloud.github.com/downloads/diveintomark/dive
–3–intopython3/dive-into-python3.pdf CSCE 590 Web Scraping Spring 2017
Why Scrape?
“Cheapest flight to Boston”
Google knows what are on the content pages but
not the results of queries about specific flights.
A scraper can query all the popular sites, with user
preferences and optimize the results the way you
would like

–4–
CSCE 590 Web Scraping Spring 2017
Views of Web Applications
2.1 100,000 Feet: Client-Server Architecture
2.2 50,000 Feet: Communication— HTTP and URIs
2.3 10,000 Feet: Representation— HTML and CSS
2.4 5,000 Feet: 3-Tier Architecture & Horizontal
Scaling
2.5 1,000 Feet: Model-View-Controller Architecture

Fox, Armando; Patterson, David. Engineering


Software as a Service: An Agile Approach Using
Cloud Computing (Kindle Locations 1481-1485).
Strawberry Canyon LLC. Kindle Edition.

–5–
CSCE 590 Web Scraping Spring 2017
2.1 100,000 Feet View: Client-Server
Architecture

 HTTP & HTTP/2 // for browsers servers


TCP protocol
IP protocol
Ethernet protocol

–6–
CSCE 590 Web Scraping Spring 2017
2.2 50,000 Feet: Communication—
HTTP and URIs

–7–
CSCE 590 Web Scraping Spring 2017
2.3 10,000 Feet: Representation—
HTML and CSS

–8–
CSCE 590 Web Scraping Spring 2017
2.5 1,000 Feet: Model-View-
Controller Architecture

–9–
CSCE 590 Web Scraping Spring 2017
Introduction to Python
Python 2.8 vs Python 3.5

– 10 –
CSCE 590 Web Scraping Spring 2017
Installing

– 11 –
CSCE 590 Web Scraping Spring 2017
Python References
Python 3.5 documentation -
https://docs.python.org/3.5/
 Tutorial (Required)
https://docs.python.org/3.5/tutorial/index.html
 The Standard Library
https://docs.python.org/3.5/library/index.html
Dive into Python 3- https://
cloud.github.com/downloads/diveintomark/divein
topython3/dive-into-python3.pdf
Python – Fluent Python – Clear, Concise and
Effective Programming by Luciano Ramalho,
O’Reilly 2015.

– 12 –
CSCE 590 Web Scraping Spring 2017
Python interpreter: Expressions
50 - 5*6
(50 - 5*6) / 4
8 / 5 # division always returns a floating point number

17 / 3 # classic division returns a float


17 // 3 # floor division discards the fractional part
17 % 3 # the % operator returns the remainder of the
division
5 ** 2 # 5 squared
2 ** 7 # 2 to the power of 7

– 13 –
CSCE 590 Web Scraping Spring 2017
Variables and typing

width = 20
height = 5 * 9
width * height
print(width * height)

– 14 –
CSCE 590 Web Scraping Spring 2017
# 3.1.2 Strings
#'spam eggs' # single quotes

'doesn\'t' # use \' to escape the single quote...

"doesn't" # ...or use double quotes instead

'"Yes," he said.'
"\"Yes,\" he said."

'"Isn\'t," she said.'


'"Isn\'t," she said.'
print('"Isn\'t," she said.')
– 15 –
CSCE 590 Web Scraping Spring 2017
s = 'First line.\nSecond line.' # \n means newline
s # without print(), \n is included in the output

print(s) # with print(), \n produces a new line

print('C:\some\name') # here \n means newline!

print(r'C:\some\name') # note the r before the quote

– 16 –
CSCE 590 Web Scraping Spring 2017
concatenation
# 3 times 'un', followed by 'ium'
# 3 * 'un' + 'ium'
#

'Py' 'thon'

prefix = 'Py'
prefix 'thon' # can't concatenate a variable and a
string literal

– 17 –
CSCE 590 Web Scraping Spring 2017
word = 'Python'
word[0] # character in position 0

word[5] # character in position 5

word[-1] # last character

word[-2] # second-last character

word[-6]

– 18 –
CSCE 590 Web Scraping Spring 2017
Slices
word[0:2] # characters from position 0 (included) to 2 (excluded)

word[2:5] # characters from position 2 (included) to 5 (excluded)

word[:2] + word[2:]

word[:4] + word[4:]

word[:2] # character from the beginning to position 2 (excluded)

word[4:] # characters from position 4 (included) to the end

word[-2:] # characters from the second-last (included) to the end

– 19 –
CSCE 590 Web Scraping Spring 2017
word[42] # the word only has 6 characters

word[4:42]

word[42:]

word[0] = 'J'

word[2:] = 'py'

'J' + word[1:]

word[:2] + 'py'

s = 'supercalifragilisticexpialidocious'
len(s)
– 20 –
CSCE 590 Web Scraping Spring 2017
squares = [1, 4, 9, 16, 25]
sauares
squares[0] # indexing returns the item
squares[-1]
squares[-3:] # slicing returns a new list
squares + [36, 49, 64, 81, 100]

cubes = [1, 8, 27, 65, 125] # something's wrong here


4 ** 3 # the cube of 4 is 64, not 65!

cubes[3] = 64 # replace the wrong value


cubes

cubes.append(216) # add the cube of 6


cubes.append(7 ** 3) # and the cube of 7
– 21 –
cubes
CSCE 590 Web Scraping Spring 2017
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
letters
# replace some values
letters[2:5] = ['C', 'D', 'E']
letters

# now remove them


letters[2:5] = []
letters

# clear the list by replacing all the elements with an empty list
letters[:] = []
letters

letters = ['a', 'b', 'c', 'd']


– 22 –
len(letters)
CSCE 590 Web Scraping Spring 2017
Nesting
a = ['a', 'b', 'c']
n = [1, 2, 3]
x = [a, n]
x

x[0]

x[0][1]

– 23 –
CSCE 590 Web Scraping Spring 2017
# 3.2. First Steps Towards Programming
Fibonacci series:
# the sum of two elements defines the next
a, b = 0, 1
while b < 10:
print(b)
a, b = b, a+b

– 24 –
CSCE 590 Web Scraping Spring 2017
a, b = 0, 1
while b < 1000:
print(b, end=',')
a, b = b, a+b

– 25 –
CSCE 590 Web Scraping Spring 2017
if Statements
x = int(input("Please enter an integer: "))

if x < 0:
x=0
print('Negative changed to zero')
elif x == 0:
print('Zero')
elif x == 1:
print('Single')
else:
print('More')

– 26 –
CSCE 590 Web Scraping Spring 2017
# Measure some strings:
words = ['cat', 'window', 'defenestrate']
for w in words:
print(w, len(w))

for w in words[:]: # Loop over a slice copy of the


entire list.
if len(w) > 6:
words.insert(0, w)

– 27 –
CSCE 590 Web Scraping Spring 2017
# 4.3. The range() Function

for i in range(5):
print(i)
range(5, 10)
5 through 9

range(0, 10, 3)
0, 3, 6, 9

range(-10, -100, -30)


-10, -40, -70

– 28 –
CSCE 590 Web Scraping Spring 2017
a = ['Mary', 'had', 'a', 'little', 'lamb']
for i in range(len(a)):
print(i, a[i])

print(range(10))

list(range(5))

– 29 –
CSCE 590 Web Scraping Spring 2017
Break and continue
for n in range(2, 10):
for x in range(2, n):
if n % x == 0:
print(n, 'equals', x, '*', n//x)
break
else:
# loop fell through without finding a factor
print(n, 'is a prime number')

– 30 –
CSCE 590 Web Scraping Spring 2017
Pass
while True:
pass # Busy-wait for keyboard interrupt (Ctrl+C)

class MyEmptyClass:
pass

def initlog(*args):
pass # Remember to implement this!

– 31 –
CSCE 590 Web Scraping Spring 2017
# 4.6. Defining Functions
def fib(n): # write Fibonacci series up to n
"""Print a Fibonacci series up to n."""
a, b = 0, 1
while a < n:
print(a, end=' ')
a, b = b, a+b
print()

– 32 –
CSCE 590 Web Scraping Spring 2017

You might also like