MODULE 2.
1
Web Automation
Azrieli School of Continuing
Studies of the Technion
Learning Objectives
• You will be able to (programmatically) open specific webpages in your web
browser
• You will be able to (programmatically) send HTTP requests with the requests
library and to process the response
• You will be able to parse HTML pages with the beautiful soup (bs4) library
• You will be able to (programmatically) control your web browser - acting as
• human as possible - with the selenium library
Opening pages in your
Web Browser
Azrieli School of Continuing
Studies of the Technion
T h e HTTP P r o t o c o l
▪ T h e HyperText Transport Protocol is t h e set of rules d e s i g n e d to e n a b l e
b rowsers to retrieve w e b d o c u m e n t s f ro m servers over t h e internet.
o T h e d o m i n a n t A p p lication Laye r Protocol o n t h e Internet.
o I nv e n te d for t h e W e b - to retrieve HTML , i m a g e s , d o c u m e n t s , etc.
o E x t e n d e d to b e d ata in addition to d o c u m e n t s - RS S , W e b Services, etc..
o B a s i c Con ce pt: M a ke a C o n n e c t i o n - Re q u e s t a d o c u m e n t - Retrieve th e D o c u m e n t
- Clos e th e Connection.
Azrieli School of Continuing
Studies of the Technion
The H T T P Protocol
C h a ra c te r i st i c Description
P ro t o c o l t y p e Ap p lication layer protocol
U s e d for transferring hyp e rtext (text,
Purpose
i m a g e s , videos, etc.)
Client s e n d s a re q u e st to a server, a n d
R e q u e st / Re s p o n s e
server re s p o n d s with a re s p on s e
GET, P O S T, PUT, D E L E T E , H E A D , O P T I O N S ,
Methods
C O N N E C T, T R A C E , PAT C H
Identifies t h e resource (e.g., U R L for w e b
U R I ( U n i fo r m R e s o u r c e Identifier)
addresses)
U s e d to co nve y additional information in
Headers
t h e re q u e st or re s p on s e
Three-d ig it c o d e s in re s p o n s e i n d i cat i n g the
Status codes o u t c o m e of t h e re q u e st (e.g., 2 0 0 O K , 4 0 4
Not Found)
Azrieli School of Continuing
Studies of the Technion
T h e webbrowser library
▪ T h e webbrowser library in P y t h o n is a very b a s i c built-in m o d u l e t h at provides a
h i g h - level interface for w o r k i n g wit h w e b browsers.
o It allows yo u to o p e n U R L s in w e b browsers, control t h e behavior of w e b
browsers, a n d p e r fo r m b a s i c w e b - related ta sks p ro g ra m m at i c a l l y f ro m
wit h in a P y t h o n script.
o Provid es f u n c t io ns to o p e n U R L s in t h e default w e b browser, a specific w e b
browser, or a n e w w e b b rowse r window/tab. It c a n also b e u s e d to display
H T M L content, s e a rc h for a q u e r y in a w e b browser, a n d retrieve t h e current
U R L f ro m a w e b browser.
Azrieli School of Continuing
Studies of the Technion
Example…
>>> import webbrowser
>>> websites.ls
{'lms': 'https://lms-iai.cyberpro-israel.org/login/index.php', 'cywaria':
'https://azcybersecuritycenter.cywaria.net/#/login', 'mail':
'https://srv.cert.az:2096/'}
>>> for i in websites.ls.values():
... webbrowser.open(i)
Azrieli School of Continuing
Studies of the Technion
Can you think of other use cases for
this very basic webbrowser library?
Discussion
Azrieli School of Continuing
Studies of the Technion
Downloading web-pages with
the request library
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
T h e requests library
▪ T h e requests library is a p o p u l a r P y t h o n library for m a k i n g H T T P re q u e st s a n d
h a n d l i n g responses.
▪ It provides a s i m p l e a n d c o nve n i e nt w a y to interact wit h w e b APIs, s e n d H T T P
re q u est s s u c h a s GET, P O S T, PUT, D E L E T E , a n d more, a n d h a n d l e t h e responses
in a flexible a n d efficient m a n n e r.
▪ T h e library m a y n e e d to b e installed wit h pip first
▪ T h e req u est s m o d u l e w a s written b e c a u s e P y t h o n ’s urllib2 m o d u l e is
c o n s i d e re d too c o m p l i c a t e d to use.
Azrieli School of Continuing
Studies of the Technion
Example…
>>> import requests
>>> url = 'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'
>>> result = requests.get(url)
>>> type(result)
<class 'requests.models.Response'>
>>> dir(result)
[' attrs ', ' bool ', ' class ', ' delattr ', ' dict ', ' dir ', ' doc ',
' enter ', ' eq ', ' exit ', ' format ', ' ge ', ' getattribute ',
' getstate ', ' gt ', ' hash ', ' init ', ' init_subclass ', ' iter ', ' le ',
' lt ', ' module ', ' ne ', ' new ', ' nonzero ', ' reduce ', ' reduce_ex ',
' repr ', ' setattr ', ' setstate ', ' sizeof ', ' str ', ' subclasshook ',
' weakref ', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close',
'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history',
'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links',
'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
Azrieli School of Continuing
Studies of the Technion
Introduction to web scraping
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
W h a t is W e b S c r a p i n g ?
▪ W h e n a p r o g r a m or script p re te n d s to b e a b rowse r a n d retrieves w e b p a ge s,
looks at t h o s e w e b p a g e s , extracts information, a n d t h e n looks at m o r e w e b
p ages.
▪ S e a r c h e n g i n e s s c ra p e w e b p a g e s - w e call this “spidering t h e web” or “web
crawling ”.
Azrieli School of Continuing
Studies of the Technion
U s e C a s e s of S c r a p i n g
▪ P u l l intelligen c e d ata - particularly social d ata - w h o links to w h o ?
▪ G e t your o w n d ata b a c k o u t of s o m e sy s t e m t h at h a s n o “export capability ”
▪ Monitor sites for n e w information
▪ S p i d e r t h e w e b to m a k e a d ata b a s e for a s e a rc h e n g i n e
▪ A n d m a n y more…
Azrieli School of Continuing
Studies of the Technion
What can be a point of concern
about scraping the web?
Discussion
Azrieli School of Continuing
Studies of the Technion
Legal Considerations
▪ It's i m p o r ta nt to n o te t h at w e b s c ra p i n g for cybersecurity p u r p o s e s s h o u l d
a lways b e c o n d u c t e d in compliance wit h a p p lica b le laws a n d regulations, a n d
wit h p ro p er authorization f ro m we bsite o w n e rs or d ata owners.
▪ Et h ica l considerations, d ata privacy, a n d le ga l re q u i re m e nt s s h o u l d b e carefully
c o n s i d e re d a n d followed w h e n c o n d u c t i n g w e b s c ra p i n g activities for
cybersecurity purposes.
▪ R e p u b l i s h i n g c o p y r i g h t e d information is n o t allowed.
▪ Violating t e r m s of service is n o t allowed.
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Azrieli School
School of
of Continuing
Continuing
http://www.facebook.com/terms.php Studies of the
Studies of the Technion
Technion
Web scraping with bs4
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
T h e HTML F o r m a t
▪ Hypertext Markup L a n g u a g e (HTML) is t h e format t h at w e b p a g e s are written
in.
▪ L e a r n i n g resources:
o https://developer.mozilla.org /en -US/learn/html/
o https://htmldog.com/guides/html/beginner/
o https://www.codecademy.com/learn/learn -h t m l
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Why is it not a good idea to parse
html with the re library?
Discussion
Azrieli School of Continuing
Studies of the Technion
The Beautiful S o u p Library
▪ B e a u t if u l S o u p (bs4) is a p o p u l a r P y t h o n library u s e d for w e b sc ra p in g , data
extraction, a n d H T M L p a rs i n g tasks. It provides a c o nve n i e nt w a y to n av i gate
a n d s e a rc h t h r o u g h t h e H T M L or X M L structure of a w e b p a ge , a n d extract the
d ata yo u n e e d for further p ro c e s s i n g or analysis.
▪ It is a m u c h better o pt io n t h a n p a rs i n g h t m l p a g e s wit h re g u la r expressions
▪ T h e library m a y n e e d to b e installed wit h pip first
Azrieli School of Continuing
Studies of the Technion
Example…
>>> import requests, bs4
>>> url='https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'
>>> result=requests.get(url)
>>> soup=bs4.BeautifulSoup(result.text, 'html.parser')
>>> type(soup)
<class 'bs4.BeautifulSoup’>
>>> dir(soup)
… snip …
>>> h2_list = soup.select(‘h2’)
>>> str(h2_list[0])
'<h2 class="vector-pinnable-header-label">Contents</h2>’
>>> h2_list[1].getText()
'History[edit]'
Azrieli School of Continuing
Studies of the Technion
S e l e c t o r S t a t e m e n t I n select()
Selector statement Description
Selects all occurrences of the specified HTML tag, e.g., select('p') selects all <p>
tagname
tags.
Selects all occurrences of the specified CSS class, e.g., select('.container') selects
.class
all elements with class="container".
Selects the element with the specified HTML id attribute, e.g., select('#header')
#id
selects the element with id="header".
Selects all occurrences of the specified HTML tag with the specified CSS class,
tagname.class
e.g., select('div.container') selects all <div> tags with class="container".
Selects the element with the specified HTML tag and id attribute, e.g.,
tagname#id
select('input#username') selects the <input> tag with id="username".
Selects all occurrences of the specified HTML tag with the specified attribute and
tagname[attr=value] value, e.g., select('a[href="https://example.com"]') selects all <a> tags with
href="https://example.com"
Azrieli School of Continuing
Studies of the Technion
Web automation with
selenium
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
W e b A u t o m a t i o n W i t h selenium
▪ Th e selenium m o d u l e lets P y t h o n directly control t h e b rowse r b y
p ro g ra m m a t i c a l l y c l i c k i n g links a n d filling in l o g i n information, a l m o st as
t h o u g h there we re a h u m a n user inte ra c t in g wit h t h e p a ge .
▪ Yo u c a n interact wit h w e b p a g e s in a m u c h m o r e a d v a n c e d w a y t h a n with
req u est s a n d bs4; b u t b e c a u s e it l a u n c h e s a real w e b browser,
o It is a bit slower a n d h a rd to r u n in t h e b a c k g r o u n d if, say, yo u just n e e d to
d o w n l o a d s o m e files f ro m th e web.
o It c a n c i rc u m v e n t preventive m e a s u r e s f ro m websites
o It c a n p a s s for “ h u m a n ” m o r e easily
o user-a g e n t
Azrieli School of Continuing
Studies of the Technion
Controlling The Browser
kali@kali:~$ pip install –user selenium
kali@kali:~$ python3
>>> from selenium import webdriver
>>> from selenium.webdriver.common.by import By
>>> browser = webdriver.Firefox()
>>> type(browser)
<class ‘selenium.webdriver.firefox.webdriver.WebDriver’>
>>> url=‘https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
>>> browser.get(url)
Azrieli School of Continuing
Studies of the Technion
Finding Elements
Method Name Description
find_element(By.ID, "id") Finds and returns the first element with the specified id attribute.
find_element(By.NAME, "name") Finds and returns the first element with the specified name attribute.
find_element(By.CLASS_NAME, "class_name") Finds and returns the first element with the specified class attribute.
Finds and returns the first element with the specified HTML tag
find_element(By.TAG_NAME, "tag_name")
name.
Finds and returns the first element that matches the specified CSS
find_element(By.CSS_SELECTOR, "css_selector")
selector.
Finds and returns the first element that matches the specified XPath
find_element(By.XPATH, "xpath")
expression.
Finds and returns the first anchor (<a>) element whose visible text
find_element(By.LINK_TEXT, "link_text")
matches the specified link_text.
(By.PARTIAL_LINK_TEXT, "partial_link_text") Finds and returns the first anchor (<a>) element whose visible text
partially matches the specified partial_link_text.
Similar to the above methods, but returns a list of all matching
find_elements(By.*, "value")
elements instead of just the first one.
Azrieli School of Continuing
Studies of the Technion
The WebElement Object
▪ T h e f i n d i n g m e t h o d s return a W e b E l e m e n t O b j e c t wit h t h e following
properties a n d m e t h o d s
Attribute or method Description
tag_name The tag name, such as 'a' for an <a> element
get_attribute(name) The value for the element’s name attribute
text The text within the element, such as 'hello' in <span>hello </span>
clear() For text field or text area elements, clears the text typed into it
is_displayed() Returns True if the element is visible; otherwise returns False
is_enabled() For input elements, returns True if the element is enabled; otherwise returns False
is_selected() For checkbox or radio button elements, returns True if the element is selected; otherwise
returns False
location A dictionary with keys 'x' and 'y' for the position of the element in the page
click() Clicks on the element.
send_keys() Sends keystrokes to the element.
Azrieli School of Continuing
Studies of the Technion
30 minutes exercise…
▪ C re ate a script t h at ta ke s in your u s e r n a m e for t h e L M S f ro m t h e c o m m a n d l i n e
(use sys.arv)
▪ T h e n it o p e n s u p t h e L M S in t h e s e l e n i u m b rowser a n d autofill your u s e r n a m e
▪ T h e n it p r o m p t for your p a s s wo rd a n d autofill your p a s swo rd
▪ T h e n it ‘clicks’ t h e l o g i n b u t t o n to lo g in
▪ Note: n ever h a r d c o d e your p a s s wo rd in scripts!
Azrieli School of Continuing
Studies of the Technion
Other Useful Options
▪ T h e selenium m o d u l e c a n s i m u l ate c lic ks o n various b rowse r b u t t o n s a s well
t h ro u g h t h e following met h o d s:
o browser.back()- Clicks t h e B a c k button.
o browser.forward() - Clicks t h e Fo r wa rd button.
o browser.refresh() - Clicks t h e Refresh/Reload button.
o browser.quit() - Clicks t h e Clos e W i n d o w button.
▪ W i t h t h e selenium.webdriver.common.keys yo u c a n also s e n d ke y presses to
t h e browser
Azrieli School of Continuing
Studies of the Technion
Project 2 – CVE Reports
Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Learning Objectives
• You will be able to (programmatically) open specific webpages in your web
browser
• You will be able to (programmatically) send HTTP requests with the requests
library and to process the response
• You will be able to parse HTML pages with the beautiful soup (bs4) library
• You will be able to (programmatically) control your web browser - acting as
• human as possible - with the selenium library