16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
regular expresssions
Regular Expressions are used to match strings of text such as particular characters,
words, or patterns of characters.
re is a sequence of characters that forms a search pattern.
re are apowerful language for matching text patterns.
re is also known as "regex" or "regexp"
we can match extract any string pattern from the text with the help of re.
regex is very useful in searching through large texts,emails and douments.
#applications of regex
1. form validation
email validation,pass word validation,phone number validation, and many other fields
of the form.
2. bank a/c details
every bank has an ifsc code for its differnt branches that starts with the name of the
bank. credit card number consists 16 digits and the first few digits represents whether
the card is master, visa or rupay in these all cases regex is used
3. data mining
data is in unstructured form i.e text form it needs to be converted to numbers regex
plays an important role in analyzing the data find patterns in the data
4. social media platforms
such as google,facebook, twiter,provide several techniques to search which are
different and efficient form a normal search
match object start(): give the position of occurence span(): tuple of start and end positions
of match string(): return the actual string used for pattern matching
regular expression methods()/ functions:
findall(), search(), match(),fullmatch(),split()
1.search()
it will return match object if there are any matches. re.search function will search the
regular expression pattern and return the first ocurence
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 1/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [1]: import re
In [2]: string = "Python is high level language"
x = re.search("high",string)
print(x)
if(x):
print("Found")
else:
print("Not found")
<re.Match object; span=(10, 14), match='high'>
Found
In [ ]: string = "Python is high level language"
x = re.search("high",string)
print(x.start())
print(x.end())
2. split() split the string from the given pattern returns a list where the string has been
split at each match
In [2]: import re
string = "Python is high level language"
x = re.split('\s',string)
print(x)
['Python', 'is', 'high', 'level', 'language']
3. findall() returns a list containg all matches
In [3]: import re
string = "Python is high level language"
x = re.findall('i',string)
print(x)
['i', 'i']
4.match() to check the given pattern at the begining of the string if match is available then
we will get a match object,otherwise we will get none
In [4]: import re
string = "Python is high level language"
x = re.match('Python',string)
print(x)
if x:
print("Found")
else:
print("Not match")
<re.Match object; span=(0, 6), match='Python'>
Found
5.full match() to match a pattren to all of string, if complete string matched then this
function returns match object otherwise it returns none.
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 2/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [5]: import re
string = "Python is high level language"
string1 = "Python"
x = re.fullmatch('Python',string)
y = re.fullmatch('Python',string1)
if x:
print("Found")
else:
print("Not match")
if y:
print("Found")
else:
print("Not match")
Not match
Found
RegEx : Metacharacters
Meta characters are special characters that have a special meaning:
Metacharacters are part of regular expression and are the special characters
that symbolize regex patterns or formats.
^ (Caret) - checks if the string starts with a particular string or character
In [6]: import re
string = "Python is high level language"
x = re.findall("^Python",string)
if x:
print("Yes, Starts with python")
else:
print("String not start with python")
Yes, Starts with python
$ (Dollar) - checks if the string ends with a particular string or character
In [7]: import re
string = "Python is high level language"
x = re.findall("e$",string)
if x:
print("Yes, String ends with 'e'")
else:
print("String not ends with 'e'")
Yes, String ends with 'e'
| (Or) - check either/or condition
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 3/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [1]: import re
string = "Python is high level language"
x = re.findall("Python|language",string)
print(x)
if x:
print("Yes , contains Python | language")
else:
print("No match")
['Python', 'language']
Yes , contains Python | language
.(Dot) - used to matches only a single character except for the newline character (\n)
In [8]: string = "Python is high level language"
x = re.search("a.g",string)
print(x)
<re.Match object; span=(22, 25), match='ang'>
(Slash) - used to lose the speciality of metacharacters
In [10]: import re
string = "Python is high level language."
x = re.search(".",string)
y = re.search("\.",string)
print(x)
print(x.start())
print(y.start())
<re.Match object; span=(0, 1), match='P'>
0
29
*(Star) - returns the zero or more occurrences of a character in a string
In [11]: import re
string = "Python is high level language."
x = re.findall("le*",string)
print(x)
['le', 'l', 'l']
+(Star) - returns the one or more occurrences of a character in a string
In [12]: import re
string = "Python is high level language."
x = re.findall("le+",string)
print(x)
['le']
[ ] (brackets) - represent a character class consisting of a set of characters
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 4/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [13]: import re
string = "Python is high level language."
x = re.findall("[A-Z]",string)
print(x)
y = re.findall("[^A-Z]",string)
print(y)
['P']
['y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'h', 'i', 'g', 'h',
' ', 'l', 'e', 'v', 'e', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a',
'g', 'e', '.']
{} (Curly brackets)-Matches exactly the specified number of occurrences
In [14]: string = "Python is high level language."
x = re.findall("is{0,10}",string)
print(x)
['is', 'i']
( ) (Paranthesis)-used to group sub-patterns
In [15]: import re
string = "Python is high level language."
x = re.findall("(is)",string)
print(x)
y = re.findall("(high|level)",string)
print(y)
['is']
['high', 'level']
\A- Matches if the string begins with the given character
In [9]: import re
string = "Python is high level language."
x = re.findall("\APython is",string)
print(x)
['Python is']
\b- Matches if the word begins or ends with the given character
In [17]: import re
str = "Python is high level language"
p1 = r'\b' + 'Python' + r'\b'
re.findall(p1,str)
Out[17]: ['Python']
\d - Matches any decimal digit [0-9]
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 5/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [18]: import re
string = "Python is high level language.3"
x = re.findall("\d",string)
print(x)
['3']
\D - Matches any non-digit character[^0-9]
In [19]: import re
string = "Python is high level language."
x = re.findall("\D",string)
print(x)
['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'h', 'i', 'g',
'h', ' ', 'l', 'e', 'v', 'e', 'l', ' ', 'l', 'a', 'n', 'g', 'u',
'a', 'g', 'e', '.']
\s - Matches any whitespace character
In [20]: import re
string = "Python is high level language."
x = re.findall("\s",string)
print(x)
[' ', ' ', ' ', ' ']
\S - Matches any non-whitespace character
In [21]: import re
string = "Python is high level language."
x = re.findall("\S",string)
print(x)
['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '.']
\w -Matches any alphanumeric character
In [22]: import re
string = "Python is high level language."
x = re.findall("\w",string)
print(x)
['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
\W -Matches any non-alphanumeric character
In [23]: import re
string = "Python is high level language."
x = re.findall("\W",string)
print(x)
[' ', ' ', ' ', ' ', '.']
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 6/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
\Z - Matches if the string ends with the given regex
In [2]: import re
string = "Python is high level language."
x = re.findall("language.\Z",string)
print(x)
['language.']
SETS:
[abc]- string contains any one of specified characters.
In [4]: string = "Python is high level language."
#Check if the string has any a, l, or n characters:
x = re.findall("[aln]",string)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
['n', 'l', 'l', 'l', 'a', 'n', 'a']
Yes, there is at least one match!
[a-z]- any character in between specified range (any character from lower case to upper
case)
In [5]: string = "Python is high level language."
x = re.findall("[e-h]",string)
print(x)
['h', 'h', 'g', 'h', 'e', 'e', 'g', 'g', 'e']
[A-Z]- any character from UPPER case A to uppercase Z
In [6]: string = "Python is high level LAnguage."
x = re.findall("[A-Z]",string)
print(x)
['P', 'L', 'A']
[A-z]- any character from UPPER case A to lowercase z
In [8]: string = "Python is high level LAnguage."
x = re.findall("[A-z]",string)
print(x)
['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'L', 'A', 'n', 'g', 'u', 'a', 'g', 'e']
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 7/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
[^abc]-string contain character except specified
In [9]: string = "Python is high level LAnguage."
x = re.findall("[^a-z]",string)
print(x)
['P', ' ', ' ', ' ', ' ', 'L', 'A', '.']
[1-9]-any number in between specified range
In [10]: string = "Python is high level LAnguage12."
x = re.findall("[1-9]",string)
print(x)
['1', '2']
[0-9][1-9]- string contains 2 digit numbers from 00-99
In [17]: string = "Python is high level 35 LAnguage 12 ."
x = re.findall("[0-9][1-9]",string)
print(x)
['35', '12']
In [ ]:
In [ ]:
In [ ]:
What is Web Scraping?
Web scraping refers to the extraction of data from a website. This information is collected
and then exported into a format that is more useful for the user.
Most web scrapers will output data to a CSV or Excel spreadsheet. How Do You Scrape
Data From A Website?
When you run the code for web scraping, a request is sent to the URL that you have
mentioned. As a response to the request, the server sends the data and allows you to
read the HTML or XML page. The code then, parses the HTML or XML page, finds the
data and extracts it. To extract data using web scraping with python, you need to follow
these basic
steps: Find the URL that you want to scrape Inspecting the Page Find the data you want
to extract Write the code Run the code and extract the data Store the data in the required
format
Libraries used for Web Scraping
Python is has various applications and there are different libraries for different purposes.
In our further demonstration, we will be using the following libraries:
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 8/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
BeautifulSoup: Beautiful Soup is a Python package for parsi
ng HTML and XML documents.
It creates parse trees that is helpful to extract the d
ata easily.
Pandas: Pandas is a library used for data manipulation and
analysis.
It is used to extract the data and store it in the desi
red format.
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 9/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [36]: import re
text="https://rguktrkv.ac.in/Departments.php?view=CS&staff=TS"
re.findall( r'[A-Za-z0-9]+[.-_]*[A-Za-z0-9]+@[A-Za-z0-9]+(\.[A-Z/a-
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 10/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
------------------------------------------------------------------
---------
error Traceback (most recent c
all last)
/tmp/ipykernel_10487/441205905.py in <module>
1 import re
2 text="https://rguktrkv.ac.in/Departments.php?view=CS&staff
=TS"
----> 3 re.findall( r'[A-Za-z0-9]+[.-_]*[A-Za-z0-9]+@[A-Za-z0-9]+
(\.[A-Z/a-z]{2,}+)',text)
~/anaconda3/lib/python3.9/re.py in findall(pattern, string, flags)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
242
243 def finditer(pattern, string, flags=0):
~/anaconda3/lib/python3.9/re.py in _compile(pattern, flags)
302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or
compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:
~/anaconda3/lib/python3.9/sre_compile.py in compile(p, flags)
786 if isstring(p):
787 pattern = p
--> 788 p = sre_parse.parse(p, flags)
789 else:
790 pattern = None
~/anaconda3/lib/python3.9/sre_parse.py in parse(str, flags, state)
953
954 try:
--> 955 p = _parse_sub(source, state, flags & SRE_FLAG_VER
BOSE, 0)
956 except Verbose:
957 # the VERBOSE flag was switched on inside the patt
ern. to be
~/anaconda3/lib/python3.9/sre_parse.py in _parse_sub(source, stat
e, verbose, nested)
442 start = source.tell()
443 while True:
--> 444 itemsappend(_parse(source, state, verbose, nested
+ 1,
445 not nested and not items))
446 if not sourcematch("|"):
~/anaconda3/lib/python3.9/sre_parse.py in _parse(source, state, ve
rbose, nested, first)
839 sub_verbose = ((verbose or (add_flags & SRE_FL
AG_VERBOSE)) and
840 not (del_flags & SRE_FLAG_VERBO
SE))
--> 841 p = _parse_sub(source, state, sub_verbose, nes
ted + 1)
842 if not source.match(")"):
843 raise source.error("missing ), unterminate
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 11/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
d subpattern",
~/anaconda3/lib/python3.9/sre_parse.py in _parse_sub(source, stat
e, verbose, nested)
442 start = source.tell()
443 while True:
--> 444 itemsappend(_parse(source, state, verbose, nested
+ 1,
445 not nested and not items))
446 if not sourcematch("|"):
~/anaconda3/lib/python3.9/sre_parse.py in _parse(source, state, ve
rbose, nested, first)
670 source.tell() - here +
len(this))
671 if item[0][0] in _REPEATCODES:
--> 672 raise source.error("multiple repeat",
673 source.tell() - here +
len(this))
674 if item[0][0] is SUBPATTERN:
error: multiple repeat at position 59
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 12/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
In [42]: import re
text="https://rguktrkv.ac.in/Institute.php?view=Director"
re.compile(r'[0-9]{5}+[.-_]+[0-9]{6}',text)
------------------------------------------------------------------
---------
TypeError Traceback (most recent c
all last)
/tmp/ipykernel_10487/1963751458.py in <module>
1 import re
2 text="https://rguktrkv.ac.in/Institute.php?view=Director"
----> 3 re.compile(r'[0-9]{5}+[.-_]+[0-9]{6}',text)
~/anaconda3/lib/python3.9/re.py in compile(pattern, flags)
250 def compile(pattern, flags=0):
251 "Compile a regular expression pattern, returning a Pat
tern object."
--> 252 return _compile(pattern, flags)
253
254 def purge():
~/anaconda3/lib/python3.9/re.py in _compile(pattern, flags)
302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or
compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:
~/anaconda3/lib/python3.9/sre_compile.py in compile(p, flags)
786 if isstring(p):
787 pattern = p
--> 788 p = sre_parse.parse(p, flags)
789 else:
790 pattern = None
~/anaconda3/lib/python3.9/sre_parse.py in parse(str, flags, state)
953
954 try:
--> 955 p = _parse_sub(source, state, flags & SRE_FLAG_VER
BOSE, 0)
956 except Verbose:
957 # the VERBOSE flag was switched on inside the patt
ern. to be
TypeError: unsupported operand type(s) for &: 'str' and 'int'
In [ ]:
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 13/13