0% found this document useful (0 votes)
30 views13 pages

Regular Ex Complete Notes - Jupyter Notebook

The document provides comprehensive notes on regular expressions (regex), detailing their purpose in matching text patterns and various applications such as form validation, data mining, and social media searches. It covers regex methods, metacharacters, and practical examples using Python's 're' module for searching, splitting, and matching strings. Additionally, it introduces web scraping, explaining the process of extracting data from websites using libraries like BeautifulSoup and Pandas.

Uploaded by

xiaochuuyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

Regular Ex Complete Notes - Jupyter Notebook

The document provides comprehensive notes on regular expressions (regex), detailing their purpose in matching text patterns and various applications such as form validation, data mining, and social media searches. It covers regex methods, metacharacters, and practical examples using Python's 're' module for searching, splitting, and matching strings. Additionally, it introduces web scraping, explaining the process of extracting data from websites using libraries like BeautifulSoup and Pandas.

Uploaded by

xiaochuuyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

regular expresssions
Regular Expressions are used to match strings of text such as particular characters,
words, or patterns of characters.

re is a sequence of characters that forms a search pattern.

re are apowerful language for matching text patterns.

re is also known as "regex" or "regexp"

we can match extract any string pattern from the text with the help of re.

regex is very useful in searching through large texts,emails and douments.

#applications of regex

1. form validation

email validation,pass word validation,phone number validation, and many other fields
of the form.

2. bank a/c details

every bank has an ifsc code for its differnt branches that starts with the name of the
bank. credit card number consists 16 digits and the first few digits represents whether
the card is master, visa or rupay in these all cases regex is used
3. data mining

data is in unstructured form i.e text form it needs to be converted to numbers regex
plays an important role in analyzing the data find patterns in the data
4. social media platforms

such as google,facebook, twiter,provide several techniques to search which are


different and efficient form a normal search

match object start(): give the position of occurence span(): tuple of start and end positions
of match string(): return the actual string used for pattern matching

regular expression methods()/ functions:

findall(), search(), match(),fullmatch(),split()

1.search()

it will return match object if there are any matches. re.search function will search the
regular expression pattern and return the first ocurence

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 1/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [1]: import re

In [2]: string = "Python is high level language"


x = re.search("high",string)
print(x)
if(x):
print("Found")
else:
print("Not found")
<re.Match object; span=(10, 14), match='high'>
Found

In [ ]: string = "Python is high level language"


x = re.search("high",string)
print(x.start())
print(x.end())

2. split() split the string from the given pattern returns a list where the string has been
split at each match

In [2]: import re
string = "Python is high level language"
x = re.split('\s',string)
print(x)

['Python', 'is', 'high', 'level', 'language']

3. findall() returns a list containg all matches

In [3]: import re
string = "Python is high level language"
x = re.findall('i',string)
print(x)

['i', 'i']

4.match() to check the given pattern at the begining of the string if match is available then
we will get a match object,otherwise we will get none

In [4]: import re
string = "Python is high level language"
x = re.match('Python',string)
print(x)
if x:
print("Found")
else:
print("Not match")

<re.Match object; span=(0, 6), match='Python'>


Found

5.full match() to match a pattren to all of string, if complete string matched then this
function returns match object otherwise it returns none.

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 2/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [5]: import re
string = "Python is high level language"
string1 = "Python"
x = re.fullmatch('Python',string)
y = re.fullmatch('Python',string1)
if x:
print("Found")
else:
print("Not match")
if y:
print("Found")
else:
print("Not match")

Not match
Found

RegEx : Metacharacters
Meta characters are special characters that have a special meaning:

Metacharacters are part of regular expression and are the special characters

that symbolize regex patterns or formats.

^ (Caret) - checks if the string starts with a particular string or character

In [6]: import re
string = "Python is high level language"
x = re.findall("^Python",string)
if x:
print("Yes, Starts with python")
else:
print("String not start with python")

Yes, Starts with python

$ (Dollar) - checks if the string ends with a particular string or character

In [7]: import re
string = "Python is high level language"
x = re.findall("e$",string)
if x:
print("Yes, String ends with 'e'")
else:
print("String not ends with 'e'")

Yes, String ends with 'e'

| (Or) - check either/or condition

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 3/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [1]: import re
string = "Python is high level language"
x = re.findall("Python|language",string)
print(x)
if x:
print("Yes , contains Python | language")
else:
print("No match")

['Python', 'language']
Yes , contains Python | language

.(Dot) - used to matches only a single character except for the newline character (\n)

In [8]: string = "Python is high level language"


x = re.search("a.g",string)
print(x)

<re.Match object; span=(22, 25), match='ang'>

(Slash) - used to lose the speciality of metacharacters

In [10]: import re
string = "Python is high level language."
x = re.search(".",string)
y = re.search("\.",string)
print(x)
print(x.start())
print(y.start())

<re.Match object; span=(0, 1), match='P'>


0
29

*(Star) - returns the zero or more occurrences of a character in a string

In [11]: import re
string = "Python is high level language."
x = re.findall("le*",string)
print(x)

['le', 'l', 'l']

+(Star) - returns the one or more occurrences of a character in a string

In [12]: import re
string = "Python is high level language."
x = re.findall("le+",string)
print(x)

['le']

[ ] (brackets) - represent a character class consisting of a set of characters

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 4/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [13]: import re
string = "Python is high level language."
x = re.findall("[A-Z]",string)
print(x)
y = re.findall("[^A-Z]",string)
print(y)

['P']
['y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'h', 'i', 'g', 'h',
' ', 'l', 'e', 'v', 'e', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a',
'g', 'e', '.']

{} (Curly brackets)-Matches exactly the specified number of occurrences

In [14]: string = "Python is high level language."


x = re.findall("is{0,10}",string)
print(x)

['is', 'i']

( ) (Paranthesis)-used to group sub-patterns

In [15]: import re
string = "Python is high level language."
x = re.findall("(is)",string)
print(x)
y = re.findall("(high|level)",string)
print(y)

['is']
['high', 'level']

\A- Matches if the string begins with the given character

In [9]: import re
string = "Python is high level language."
x = re.findall("\APython is",string)
print(x)

['Python is']

\b- Matches if the word begins or ends with the given character

In [17]: import re
str = "Python is high level language"
p1 = r'\b' + 'Python' + r'\b'
re.findall(p1,str)
Out[17]: ['Python']

\d - Matches any decimal digit [0-9]

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 5/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [18]: import re
string = "Python is high level language.3"
x = re.findall("\d",string)
print(x)

['3']

\D - Matches any non-digit character[^0-9]

In [19]: import re
string = "Python is high level language."
x = re.findall("\D",string)
print(x)

['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 'h', 'i', 'g',
'h', ' ', 'l', 'e', 'v', 'e', 'l', ' ', 'l', 'a', 'n', 'g', 'u',
'a', 'g', 'e', '.']

\s - Matches any whitespace character

In [20]: import re
string = "Python is high level language."
x = re.findall("\s",string)
print(x)

[' ', ' ', ' ', ' ']

\S - Matches any non-whitespace character

In [21]: import re
string = "Python is high level language."
x = re.findall("\S",string)
print(x)

['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', '.']

\w -Matches any alphanumeric character

In [22]: import re
string = "Python is high level language."
x = re.findall("\w",string)
print(x)

['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']

\W -Matches any non-alphanumeric character

In [23]: import re
string = "Python is high level language."
x = re.findall("\W",string)
print(x)

[' ', ' ', ' ', ' ', '.']

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 6/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

\Z - Matches if the string ends with the given regex

In [2]: import re
string = "Python is high level language."
x = re.findall("language.\Z",string)
print(x)
['language.']

SETS:
[abc]- string contains any one of specified characters.

In [4]: string = "Python is high level language."


#Check if the string has any a, l, or n characters:
x = re.findall("[aln]",string)

print(x)

if x:
print("Yes, there is at least one match!")
else:
print("No match")

['n', 'l', 'l', 'l', 'a', 'n', 'a']


Yes, there is at least one match!

[a-z]- any character in between specified range (any character from lower case to upper
case)

In [5]: string = "Python is high level language."


x = re.findall("[e-h]",string)
print(x)

['h', 'h', 'g', 'h', 'e', 'e', 'g', 'g', 'e']

[A-Z]- any character from UPPER case A to uppercase Z

In [6]: string = "Python is high level LAnguage."


x = re.findall("[A-Z]",string)
print(x)

['P', 'L', 'A']

[A-z]- any character from UPPER case A to lowercase z

In [8]: string = "Python is high level LAnguage."


x = re.findall("[A-z]",string)
print(x)
['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 'h', 'i', 'g', 'h', 'l',
'e', 'v', 'e', 'l', 'L', 'A', 'n', 'g', 'u', 'a', 'g', 'e']

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 7/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

[^abc]-string contain character except specified

In [9]: string = "Python is high level LAnguage."


x = re.findall("[^a-z]",string)
print(x)

['P', ' ', ' ', ' ', ' ', 'L', 'A', '.']

[1-9]-any number in between specified range

In [10]: string = "Python is high level LAnguage12."


x = re.findall("[1-9]",string)
print(x)
['1', '2']

[0-9][1-9]- string contains 2 digit numbers from 00-99

In [17]: string = "Python is high level 35 LAnguage 12 ."


x = re.findall("[0-9][1-9]",string)
print(x)

['35', '12']

In [ ]:

In [ ]:

In [ ]:

What is Web Scraping?


Web scraping refers to the extraction of data from a website. This information is collected
and then exported into a format that is more useful for the user.

Most web scrapers will output data to a CSV or Excel spreadsheet. How Do You Scrape
Data From A Website?

When you run the code for web scraping, a request is sent to the URL that you have
mentioned. As a response to the request, the server sends the data and allows you to
read the HTML or XML page. The code then, parses the HTML or XML page, finds the
data and extracts it. To extract data using web scraping with python, you need to follow
these basic

steps: Find the URL that you want to scrape Inspecting the Page Find the data you want
to extract Write the code Run the code and extract the data Store the data in the required
format

Libraries used for Web Scraping

Python is has various applications and there are different libraries for different purposes.
In our further demonstration, we will be using the following libraries:

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 8/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

BeautifulSoup: Beautiful Soup is a Python package for parsi


ng HTML and XML documents.
It creates parse trees that is helpful to extract the d
ata easily.

Pandas: Pandas is a library used for data manipulation and


analysis.
It is used to extract the data and store it in the desi
red format.

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 9/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [36]: import re
text="https://rguktrkv.ac.in/Departments.php?view=CS&staff=TS"
re.findall( r'[A-Za-z0-9]+[.-_]*[A-Za-z0-9]+@[A-Za-z0-9]+(\.[A-Z/a-

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 10/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

------------------------------------------------------------------
---------
error Traceback (most recent c
all last)
/tmp/ipykernel_10487/441205905.py in <module>
1 import re
2 text="https://rguktrkv.ac.in/Departments.php?view=CS&staff
=TS"
----> 3 re.findall( r'[A-Za-z0-9]+[.-_]*[A-Za-z0-9]+@[A-Za-z0-9]+
(\.[A-Z/a-z]{2,}+)',text)

~/anaconda3/lib/python3.9/re.py in findall(pattern, string, flags)


239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)
242
243 def finditer(pattern, string, flags=0):

~/anaconda3/lib/python3.9/re.py in _compile(pattern, flags)


302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or
compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:

~/anaconda3/lib/python3.9/sre_compile.py in compile(p, flags)


786 if isstring(p):
787 pattern = p
--> 788 p = sre_parse.parse(p, flags)
789 else:
790 pattern = None

~/anaconda3/lib/python3.9/sre_parse.py in parse(str, flags, state)


953
954 try:
--> 955 p = _parse_sub(source, state, flags & SRE_FLAG_VER
BOSE, 0)
956 except Verbose:
957 # the VERBOSE flag was switched on inside the patt
ern. to be

~/anaconda3/lib/python3.9/sre_parse.py in _parse_sub(source, stat


e, verbose, nested)
442 start = source.tell()
443 while True:
--> 444 itemsappend(_parse(source, state, verbose, nested
+ 1,
445 not nested and not items))
446 if not sourcematch("|"):

~/anaconda3/lib/python3.9/sre_parse.py in _parse(source, state, ve


rbose, nested, first)
839 sub_verbose = ((verbose or (add_flags & SRE_FL
AG_VERBOSE)) and
840 not (del_flags & SRE_FLAG_VERBO
SE))
--> 841 p = _parse_sub(source, state, sub_verbose, nes
ted + 1)
842 if not source.match(")"):
843 raise source.error("missing ), unterminate
localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 11/13
16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook
d subpattern",

~/anaconda3/lib/python3.9/sre_parse.py in _parse_sub(source, stat


e, verbose, nested)
442 start = source.tell()
443 while True:
--> 444 itemsappend(_parse(source, state, verbose, nested
+ 1,
445 not nested and not items))
446 if not sourcematch("|"):

~/anaconda3/lib/python3.9/sre_parse.py in _parse(source, state, ve


rbose, nested, first)
670 source.tell() - here +
len(this))
671 if item[0][0] in _REPEATCODES:
--> 672 raise source.error("multiple repeat",
673 source.tell() - here +
len(this))
674 if item[0][0] is SUBPATTERN:

error: multiple repeat at position 59

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 12/13


16/02/2024, 13:48 regular ex complete notes - Jupyter Notebook

In [42]: import re
text="https://rguktrkv.ac.in/Institute.php?view=Director"
re.compile(r'[0-9]{5}+[.-_]+[0-9]{6}',text)

------------------------------------------------------------------
---------
TypeError Traceback (most recent c
all last)
/tmp/ipykernel_10487/1963751458.py in <module>
1 import re
2 text="https://rguktrkv.ac.in/Institute.php?view=Director"
----> 3 re.compile(r'[0-9]{5}+[.-_]+[0-9]{6}',text)

~/anaconda3/lib/python3.9/re.py in compile(pattern, flags)


250 def compile(pattern, flags=0):
251 "Compile a regular expression pattern, returning a Pat
tern object."
--> 252 return _compile(pattern, flags)
253
254 def purge():

~/anaconda3/lib/python3.9/re.py in _compile(pattern, flags)


302 if not sre_compile.isstring(pattern):
303 raise TypeError("first argument must be string or
compiled pattern")
--> 304 p = sre_compile.compile(pattern, flags)
305 if not (flags & DEBUG):
306 if len(_cache) >= _MAXCACHE:

~/anaconda3/lib/python3.9/sre_compile.py in compile(p, flags)


786 if isstring(p):
787 pattern = p
--> 788 p = sre_parse.parse(p, flags)
789 else:
790 pattern = None

~/anaconda3/lib/python3.9/sre_parse.py in parse(str, flags, state)


953
954 try:
--> 955 p = _parse_sub(source, state, flags & SRE_FLAG_VER
BOSE, 0)
956 except Verbose:
957 # the VERBOSE flag was switched on inside the patt
ern. to be

TypeError: unsupported operand type(s) for &: 'str' and 'int'

In [ ]:

localhost:8888/notebooks/anaconda3/Python/regular ex complete notes.ipynb# 13/13

You might also like