0% found this document useful (0 votes)
12 views32 pages

Module 2.1 - Web Automation

Module 2.1 focuses on web automation, teaching how to programmatically open web pages, send HTTP requests, parse HTML, and control web browsers using libraries like requests, Beautiful Soup, and Selenium. It covers the HTTP protocol, web scraping techniques, and legal considerations for scraping data. The module emphasizes the importance of ethical practices and compliance with laws while engaging in web scraping activities.

Uploaded by

narimanam-acf202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Module 2.1 - Web Automation

Module 2.1 focuses on web automation, teaching how to programmatically open web pages, send HTTP requests, parse HTML, and control web browsers using libraries like requests, Beautiful Soup, and Selenium. It covers the HTTP protocol, web scraping techniques, and legal considerations for scraping data. The module emphasizes the importance of ethical practices and compliance with laws while engaging in web scraping activities.

Uploaded by

narimanam-acf202
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 2.

1
Web Automation

Azrieli School of Continuing


Studies of the Technion
Learning Objectives
• You will be able to (programmatically) open specific webpages in your web
browser
• You will be able to (programmatically) send HTTP requests with the requests
library and to process the response
• You will be able to parse HTML pages with the beautiful soup (bs4) library
• You will be able to (programmatically) control your web browser - acting as
• human as possible - with the selenium library
Opening pages in your
Web Browser

Azrieli School of Continuing


Studies of the Technion
T h e HTTP P r o t o c o l
▪ T h e HyperText Transport Protocol is t h e set of rules d e s i g n e d to e n a b l e
b rowsers to retrieve w e b d o c u m e n t s f ro m servers over t h e internet.
o T h e d o m i n a n t A p p lication Laye r Protocol o n t h e Internet.

o I nv e n te d for t h e W e b - to retrieve HTML , i m a g e s , d o c u m e n t s , etc.

o E x t e n d e d to b e d ata in addition to d o c u m e n t s - RS S , W e b Services, etc..

o B a s i c Con ce pt: M a ke a C o n n e c t i o n - Re q u e s t a d o c u m e n t - Retrieve th e D o c u m e n t


- Clos e th e Connection.

Azrieli School of Continuing


Studies of the Technion
The H T T P Protocol
C h a ra c te r i st i c Description
P ro t o c o l t y p e Ap p lication layer protocol
U s e d for transferring hyp e rtext (text,
Purpose
i m a g e s , videos, etc.)
Client s e n d s a re q u e st to a server, a n d
R e q u e st / Re s p o n s e
server re s p o n d s with a re s p on s e
GET, P O S T, PUT, D E L E T E , H E A D , O P T I O N S ,
Methods
C O N N E C T, T R A C E , PAT C H
Identifies t h e resource (e.g., U R L for w e b
U R I ( U n i fo r m R e s o u r c e Identifier)
addresses)
U s e d to co nve y additional information in
Headers
t h e re q u e st or re s p on s e
Three-d ig it c o d e s in re s p o n s e i n d i cat i n g the
Status codes o u t c o m e of t h e re q u e st (e.g., 2 0 0 O K , 4 0 4
Not Found)

Azrieli School of Continuing


Studies of the Technion
T h e webbrowser library
▪ T h e webbrowser library in P y t h o n is a very b a s i c built-in m o d u l e t h at provides a
h i g h - level interface for w o r k i n g wit h w e b browsers.

o It allows yo u to o p e n U R L s in w e b browsers, control t h e behavior of w e b


browsers, a n d p e r fo r m b a s i c w e b - related ta sks p ro g ra m m at i c a l l y f ro m
wit h in a P y t h o n script.

o Provid es f u n c t io ns to o p e n U R L s in t h e default w e b browser, a specific w e b


browser, or a n e w w e b b rowse r window/tab. It c a n also b e u s e d to display
H T M L content, s e a rc h for a q u e r y in a w e b browser, a n d retrieve t h e current
U R L f ro m a w e b browser.

Azrieli School of Continuing


Studies of the Technion
Example…
>>> import webbrowser
>>> websites.ls
{'lms': 'https://lms-iai.cyberpro-israel.org/login/index.php', 'cywaria':
'https://azcybersecuritycenter.cywaria.net/#/login', 'mail':
'https://srv.cert.az:2096/'}
>>> for i in websites.ls.values():
... webbrowser.open(i)

Azrieli School of Continuing


Studies of the Technion
Can you think of other use cases for
this very basic webbrowser library?

Discussion

Azrieli School of Continuing


Studies of the Technion
Downloading web-pages with
the request library

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
T h e requests library
▪ T h e requests library is a p o p u l a r P y t h o n library for m a k i n g H T T P re q u e st s a n d
h a n d l i n g responses.

▪ It provides a s i m p l e a n d c o nve n i e nt w a y to interact wit h w e b APIs, s e n d H T T P


re q u est s s u c h a s GET, P O S T, PUT, D E L E T E , a n d more, a n d h a n d l e t h e responses
in a flexible a n d efficient m a n n e r.

▪ T h e library m a y n e e d to b e installed wit h pip first

▪ T h e req u est s m o d u l e w a s written b e c a u s e P y t h o n ’s urllib2 m o d u l e is


c o n s i d e re d too c o m p l i c a t e d to use.

Azrieli School of Continuing


Studies of the Technion
Example…
>>> import requests
>>> url = 'https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'
>>> result = requests.get(url)
>>> type(result)
<class 'requests.models.Response'>
>>> dir(result)
[' attrs ', ' bool ', ' class ', ' delattr ', ' dict ', ' dir ', ' doc ',
' enter ', ' eq ', ' exit ', ' format ', ' ge ', ' getattribute ',
' getstate ', ' gt ', ' hash ', ' init ', ' init_subclass ', ' iter ', ' le ',
' lt ', ' module ', ' ne ', ' new ', ' nonzero ', ' reduce ', ' reduce_ex ',
' repr ', ' setattr ', ' setstate ', ' sizeof ', ' str ', ' subclasshook ',
' weakref ', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close',
'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history',
'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links',
'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']

Azrieli School of Continuing


Studies of the Technion
Introduction to web scraping

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
W h a t is W e b S c r a p i n g ?
▪ W h e n a p r o g r a m or script p re te n d s to b e a b rowse r a n d retrieves w e b p a ge s,
looks at t h o s e w e b p a g e s , extracts information, a n d t h e n looks at m o r e w e b
p ages.

▪ S e a r c h e n g i n e s s c ra p e w e b p a g e s - w e call this “spidering t h e web” or “web


crawling ”.

Azrieli School of Continuing


Studies of the Technion
U s e C a s e s of S c r a p i n g
▪ P u l l intelligen c e d ata - particularly social d ata - w h o links to w h o ?

▪ G e t your o w n d ata b a c k o u t of s o m e sy s t e m t h at h a s n o “export capability ”

▪ Monitor sites for n e w information

▪ S p i d e r t h e w e b to m a k e a d ata b a s e for a s e a rc h e n g i n e

▪ A n d m a n y more…

Azrieli School of Continuing


Studies of the Technion
What can be a point of concern
about scraping the web?

Discussion

Azrieli School of Continuing


Studies of the Technion
Legal Considerations
▪ It's i m p o r ta nt to n o te t h at w e b s c ra p i n g for cybersecurity p u r p o s e s s h o u l d
a lways b e c o n d u c t e d in compliance wit h a p p lica b le laws a n d regulations, a n d
wit h p ro p er authorization f ro m we bsite o w n e rs or d ata owners.

▪ Et h ica l considerations, d ata privacy, a n d le ga l re q u i re m e nt s s h o u l d b e carefully


c o n s i d e re d a n d followed w h e n c o n d u c t i n g w e b s c ra p i n g activities for
cybersecurity purposes.

▪ R e p u b l i s h i n g c o p y r i g h t e d information is n o t allowed.

▪ Violating t e r m s of service is n o t allowed.

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Azrieli School
School of
of Continuing
Continuing
http://www.facebook.com/terms.php Studies of the
Studies of the Technion
Technion
Web scraping with bs4

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
T h e HTML F o r m a t
▪ Hypertext Markup L a n g u a g e (HTML) is t h e format t h at w e b p a g e s are written
in.

▪ L e a r n i n g resources:
o https://developer.mozilla.org /en -US/learn/html/

o https://htmldog.com/guides/html/beginner/

o https://www.codecademy.com/learn/learn -h t m l

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Why is it not a good idea to parse
html with the re library?

Discussion

Azrieli School of Continuing


Studies of the Technion
The Beautiful S o u p Library
▪ B e a u t if u l S o u p (bs4) is a p o p u l a r P y t h o n library u s e d for w e b sc ra p in g , data
extraction, a n d H T M L p a rs i n g tasks. It provides a c o nve n i e nt w a y to n av i gate
a n d s e a rc h t h r o u g h t h e H T M L or X M L structure of a w e b p a ge , a n d extract the
d ata yo u n e e d for further p ro c e s s i n g or analysis.

▪ It is a m u c h better o pt io n t h a n p a rs i n g h t m l p a g e s wit h re g u la r expressions

▪ T h e library m a y n e e d to b e installed wit h pip first

Azrieli School of Continuing


Studies of the Technion
Example…
>>> import requests, bs4
>>> url='https://en.wikipedia.org/wiki/%22Hello,_World!%22_program'
>>> result=requests.get(url)
>>> soup=bs4.BeautifulSoup(result.text, 'html.parser')
>>> type(soup)
<class 'bs4.BeautifulSoup’>
>>> dir(soup)
… snip …
>>> h2_list = soup.select(‘h2’)
>>> str(h2_list[0])
'<h2 class="vector-pinnable-header-label">Contents</h2>’
>>> h2_list[1].getText()
'History[edit]'

Azrieli School of Continuing


Studies of the Technion
S e l e c t o r S t a t e m e n t I n select()
Selector statement Description
Selects all occurrences of the specified HTML tag, e.g., select('p') selects all <p>
tagname
tags.
Selects all occurrences of the specified CSS class, e.g., select('.container') selects
.class
all elements with class="container".
Selects the element with the specified HTML id attribute, e.g., select('#header')
#id
selects the element with id="header".
Selects all occurrences of the specified HTML tag with the specified CSS class,
tagname.class
e.g., select('div.container') selects all <div> tags with class="container".
Selects the element with the specified HTML tag and id attribute, e.g.,
tagname#id
select('input#username') selects the <input> tag with id="username".
Selects all occurrences of the specified HTML tag with the specified attribute and
tagname[attr=value] value, e.g., select('a[href="https://example.com"]') selects all <a> tags with
href="https://example.com"

Azrieli School of Continuing


Studies of the Technion
Web automation with
selenium

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
W e b A u t o m a t i o n W i t h selenium
▪ Th e selenium m o d u l e lets P y t h o n directly control t h e b rowse r b y
p ro g ra m m a t i c a l l y c l i c k i n g links a n d filling in l o g i n information, a l m o st as
t h o u g h there we re a h u m a n user inte ra c t in g wit h t h e p a ge .

▪ Yo u c a n interact wit h w e b p a g e s in a m u c h m o r e a d v a n c e d w a y t h a n with


req u est s a n d bs4; b u t b e c a u s e it l a u n c h e s a real w e b browser,
o It is a bit slower a n d h a rd to r u n in t h e b a c k g r o u n d if, say, yo u just n e e d to
d o w n l o a d s o m e files f ro m th e web.

o It c a n c i rc u m v e n t preventive m e a s u r e s f ro m websites

o It c a n p a s s for “ h u m a n ” m o r e easily

o user-a g e n t

Azrieli School of Continuing


Studies of the Technion
Controlling The Browser
kali@kali:~$ pip install –user selenium
kali@kali:~$ python3

>>> from selenium import webdriver


>>> from selenium.webdriver.common.by import By
>>> browser = webdriver.Firefox()
>>> type(browser)
<class ‘selenium.webdriver.firefox.webdriver.WebDriver’>
>>> url=‘https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
>>> browser.get(url)

Azrieli School of Continuing


Studies of the Technion
Finding Elements
Method Name Description
find_element(By.ID, "id") Finds and returns the first element with the specified id attribute.
find_element(By.NAME, "name") Finds and returns the first element with the specified name attribute.
find_element(By.CLASS_NAME, "class_name") Finds and returns the first element with the specified class attribute.
Finds and returns the first element with the specified HTML tag
find_element(By.TAG_NAME, "tag_name")
name.
Finds and returns the first element that matches the specified CSS
find_element(By.CSS_SELECTOR, "css_selector")
selector.
Finds and returns the first element that matches the specified XPath
find_element(By.XPATH, "xpath")
expression.
Finds and returns the first anchor (<a>) element whose visible text
find_element(By.LINK_TEXT, "link_text")
matches the specified link_text.
(By.PARTIAL_LINK_TEXT, "partial_link_text") Finds and returns the first anchor (<a>) element whose visible text
partially matches the specified partial_link_text.
Similar to the above methods, but returns a list of all matching
find_elements(By.*, "value")
elements instead of just the first one.

Azrieli School of Continuing


Studies of the Technion
The WebElement Object
▪ T h e f i n d i n g m e t h o d s return a W e b E l e m e n t O b j e c t wit h t h e following
properties a n d m e t h o d s
Attribute or method Description
tag_name The tag name, such as 'a' for an <a> element

get_attribute(name) The value for the element’s name attribute

text The text within the element, such as 'hello' in <span>hello </span>

clear() For text field or text area elements, clears the text typed into it

is_displayed() Returns True if the element is visible; otherwise returns False

is_enabled() For input elements, returns True if the element is enabled; otherwise returns False

is_selected() For checkbox or radio button elements, returns True if the element is selected; otherwise
returns False
location A dictionary with keys 'x' and 'y' for the position of the element in the page
click() Clicks on the element.
send_keys() Sends keystrokes to the element.

Azrieli School of Continuing


Studies of the Technion
30 minutes exercise…
▪ C re ate a script t h at ta ke s in your u s e r n a m e for t h e L M S f ro m t h e c o m m a n d l i n e
(use sys.arv)

▪ T h e n it o p e n s u p t h e L M S in t h e s e l e n i u m b rowser a n d autofill your u s e r n a m e

▪ T h e n it p r o m p t for your p a s s wo rd a n d autofill your p a s swo rd

▪ T h e n it ‘clicks’ t h e l o g i n b u t t o n to lo g in

▪ Note: n ever h a r d c o d e your p a s s wo rd in scripts!

Azrieli School of Continuing


Studies of the Technion
Other Useful Options
▪ T h e selenium m o d u l e c a n s i m u l ate c lic ks o n various b rowse r b u t t o n s a s well
t h ro u g h t h e following met h o d s:

o browser.back()- Clicks t h e B a c k button.

o browser.forward() - Clicks t h e Fo r wa rd button.

o browser.refresh() - Clicks t h e Refresh/Reload button.

o browser.quit() - Clicks t h e Clos e W i n d o w button.

▪ W i t h t h e selenium.webdriver.common.keys yo u c a n also s e n d ke y presses to


t h e browser

Azrieli School of Continuing


Studies of the Technion
Project 2 – CVE Reports

Azrieli School
School of
of Continuing
Continuing
Studies of the
Studies of the Technion
Technion
Learning Objectives
• You will be able to (programmatically) open specific webpages in your web
browser
• You will be able to (programmatically) send HTTP requests with the requests
library and to process the response
• You will be able to parse HTML pages with the beautiful soup (bs4) library
• You will be able to (programmatically) control your web browser - acting as
• human as possible - with the selenium library

You might also like