0% found this document useful (0 votes)
17 views36 pages

Data Engineering With Python

Data Engineering with python.

Uploaded by

a66185154
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

Data Engineering With Python

Data Engineering with python.

Uploaded by

a66185154
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Sets:

A data set is a collection of data that organized in a tabular format,,where


each column represents a specific variable,& each row represents to a record.
Datasets involve a large amount of data points grouped into one table.
Datasets are used in almost all industries today for various reasons.
They are mostly used in fields like machine learning, business, and
government as they provide the raw material for analysis, insights and
decision making
ex:Kaggle,Github provides the some datasets
Datasets can be stored in multiple formats. The most common ones are
CSV, Excel, JSON, and zip files for large datasets such as image datasets.

Data ANALYSIS SEQUENCE


The different types of analysis that is available in Data Science are as
follows:
1. Descriptive Analysis
2. Exploratory Analysis
3. Inferential Analysis
4. Predictive Analysis
5. Causal Analysis
6. Mechanistic Analysis

Descriptive Analysis: The data set is described by reporting its aggregate


measures, often in a visual form. No matter what you do next, you have to
at least describe the data

Descriptive analysis is a fundamental aspect of data analysis that


focuses on summarizing and interpreting the characteristics of a
dataset. It involves analyzing historical and present data to describe
what has happened in various contexts, such as business or research.
Key points include:
Purpose: To provide a clear overview of data by employing statistical
techniques to describe its main features.

Methods: It uses various statistical measures, visualizations, and


techniques to summarize data and identify trends and relationships.
Outcome: Unlike predictive analytics, it does not forecast future
outcomes but helps in understanding past events, patterns, and trends

Exploratory Analysis: Trying to find new relationships between


existing variables. It is a important step in data science as it visualizing
data to understa nd its main features, find patterns and discover how
different parts of the data are connected
Inferential Analysis: If you have a small data sample and would like to
describe a bigger population, statistics-based inferential analysis is
right for you. Or It is a branch of statistics that allows us to make
generalizations and predictions about a population based on
sample data.
Predictive Analysis: A predictive analyst learns from the past to
predict the future.
Causal Analysis: Causal analysis identifies variables that affect each
other.
Mechanistic Analysis:Mechanistic data analysis explores exactly how
one variable affects another variable.

DATA ACQUISITION PIPELINE


Data acquisition is all about obtaining the artifacts that contain the input
data from a variety of sources, extracting the data from the artifacts, and
converting it into representations suitable for further processing, as shown
in the following figure.
1. The three main sources of data are the Internet (namely, the World
Wide Web), databases, and local files (possibly previously
downloaded by hand or using additional software).
2. Some of the local files may have been produced by other Python
programs and contain serialized or “pickled” data.
3. • Unstructured plain text in a natural language (such as English or
Chinese)
4. • Structured data, including: – Tabular data in comma separated
values (CSV) files – Tabular data from databases – Tagged data in
HyperText Markup Language (HTML) or, in general, in eXtensible
Markup Language (XML) – Tagged data in JavaScript Object
Notation (JSON)
5. Pipeline automation naturally leads to reproducible code: a set of
Python scripts that anyone can execute to convert the original raw
data into the final results as described in the report, ideally without
any additional human interaction.

REPORT STRUCTURE
The project report is what we (data scientists) submit to the data
sponsor (the customer). The report typically includes the following:
• Abstract (a brief and accessible description of the project)
• Introduction
• Methods that were used for data acquisition and processing
• Results that were obtained (do not include intermediate and
insignificant results in this section; rather, put them into an appendix)
• Conclusion
• Appendix

Files:
1. A file is the common storage unit in a computer, and all programs
and data are “written” into a file and “read” from a file.
2. A file extension, sometimes called a file suffix or a filename
extension, is the character or group of characters after the period that
makes up an entire file name.
3. For example, the file assignments.docx ends in docx, a file extension
that is associated with Microsoft Word on your computer.
4. Any file’s extensions can be renamed, but that will not convert the file
to another format or change anything about the file other than this portion
of its name.
5. File extensions are classified as executable, meaning that when
clicked, they do not just open for viewing or playing, they actually do
something all by themselves, like install a program, start a process, or run a
script.
6. All the data on your hard drive consists of files and directories. The
fundamental difference between the two is that files store data, while
directories store files and other directories.
7. Files,on the other hand, can range from a few bytes to several
gigabytes. Directories are used to organize files on your computer.

Types of Files:
1. Python supports two types of files – text files and binary files.
2. While both binary and text files contain data stored as a series of bits
(binary values of 1s and 0s), the bits in text files represent characters, while
the bits in binary files represent custom data.
3. Binary files typically contain a sequence of bytes or ordered
groupings of eight bits.
4. Binary file formats may include multiple types of data in the same file,
such as image, video, and audio data.
5. This data can be interpreted by supporting programs but will show up
as garbled text in a text editor.

6. When the image is opened in a text editor, the binary data is


converted to unrecognizable text.
7. However, you may notice that some of the text is readable. This is
because the JPG format includes small sections for storing textual data.
8. Text files are more restrictive than binary files since they can only
contain textual data.
9. However, unlike binary files, they are less likely to become corrupted.
10. An End-of-File (EOF) marker is placed after the final character, which
signals the end of the file.
11. Text files include a character encoding scheme that determines how
the characters are interpreted and what characters can be displayed.
12. Common text editors include Microsoft Notepad and WordPad, which
are bundled with Windows, and Apple TextEdit, which is included with Mac
OS X.
Common extensions for binary file formats:
13. Images: jpg, png, gif, bmp, tiff, psd,...
14. Videos: mp4, mkv, avi, mov, mpg, vob,...
15. Audio: mp3, aac, wav, flac, ogg, mka, wma,...
16. Documents: pdf, doc, xls, ppt, docx, odt,...
17. Archive: zip, rar, 7z, tar, iso,...
18. Database: mdb, accde, frm, sqlite,...
19. Executable: exe, dll, so, class,...
Common extensions for text file formats:
20. Web standards: html, xml, css, svg, json,...
21. Source code: c, cpp, h, cs, js, py, java, rb, pl, php, sh,...
22. Documents: txt, tex, markdown, asciidoc, rtf, ps,...
23. Configuration: ini, cfg, rc, reg,...
24. Tabular data: csv, tsv,..

The following fundamental rules enable applications to create and process


valid names for files and directories in both Windows and Linux operating
systems unless explicitly specified:
• Use a period to separate the base file name from the extension in the file
name.
• In Windows use backslash (\) and in Linux use forward slash (/) to
separate the
components of a path. The backslash (or forward slash) separates one
directory
name from another directory name in a path and it also divides the file
name
from the path leading to it. Backslash (\) and forward slash (/) are reserved
characters and you cannot use them in the name for the actual file or
directory.
• Do not assume case sensitivity. File and Directory names in Windows are
not case sensitive while in Linux it is case sensitive. For example, the
directory names ORANGE, Orange, and orange are the same in Windows
but are different in Linux Operating System.
• In Windows, volume designators (drive letters) are case-insensitive. For
example, "D:\" and "d:\" refer to the same drive.
• The reserved characters that should not be used in naming files and
directories are < (less than), > (greater than),: (colon), " (double quote), /
(forward slash), \ (backslash), | (vertical bar or pipe), ? (question mark) and
* (asterisk).
• In Windows Operating system reserved words like CON, PRN, AUX, NUL,
COM1,COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1,
LPT2, LPT3,LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9 should not be used
to name files and directories.
File Paths:
A path is said to be a fully qualified path if it points to the file location, which
always contains the root and the complete directory list.
Examples of a fully qualified path are given below.
• "C:\langur.txt" refers to a file named "langur.txt" under root directory C:\.
• "C:\fauna\bison.txt" refers to a file named "bison.txt" in a subdirectory
fauna under root directory C:\.
A path is also said to be a relative path if it contains “double-dots”; that is,
two consecutive periods together as one of the directory components in a
path or “single-dot”; that is, one period as one of the directory components
in a path.
Examples of the relative path are given below:
• "..\langur.txt" specifies a file named "langur.txt" located in the parent of the
current directory fauna.
• ".\bison.txt" specifies a file named "bison.txt" located in a current directory
named fauna.
• "..\..\langur.txt" specifies a file that is two directories above the current
directory india

Creating and reading text data:


Text files:
Files are not very useful unless you can access the information they contain. All files must
be opened first before they can be read from or written to using the Python’s built-in open()
function. When a file is opened using open() function, it returns a file object called a file
handler that provides methods for accessing the file.

The open() function returns a file handler object for the file name. The open() function is
commonly used with two arguments, where the first argument is a string containing the
file name to be opened which can be absolute or relative to the current working directory.

File close() method:


Opening files consume system resources, and, depending on the file mode, other pro-
grams may not be able to access them. It is important to close the file once the processing
is completed. After the file handler object is closed, you cannot further read or write from
the file. Any attempt to use the file handler object after being closed will result in an error.
The syntax for close() function is,
file_handler.close()

For example,
1. >>> file_handler = open("moon.txt","r")
2. >>> file_handler.close()

Use of with statements to open and close files:


Instead of using try-except-finally blocks to handle file opening and closing opera-
tions, a much cleaner way of doing this in Python is using the with statement. You
can use a with statement in Python such that you do not have to close the file handler
object.
The syntax of the with statement for the file I/O is,

237Files
The civilization coalesced around 3150 BC with the political unification of Upper and
Lower Egypt under the first pharaoh.
Ancient Egypt reached its pinnacle during the New Kingdom, after which it entered a
period of slow decline.

In the read_file() function definition, you open the file egypt.txt and assign the file object

to the file_handler . By default, the file is opened in read only mode as no mode is speci-

fied explicitly. Use a for loop to iterate over file_handler and print the lines –. Once the

file processing operation is over, close the file_handler . In the output, notice a blank space
between each line of the file. Understand that at the end of each line, a newline character
(\n) is present which is invisible and it indicates the end of the line. The print() function by
default always appends a newline character. This means that if you want to print data that
already ends in a newline, we get two newlines, resulting in a blank space between the
lines. In order to overcome this problem, pass an end argument to the print() function and
initialize it with an empty string (with no spaces). The end argument should always be a
string. The value of end argument is printed after the thing you want to print. By default,
the end argument contains a newline (“\n”) but it can be changed to something else, like
an empty string. This means that instead of the usual behavior of placing a newline char-
acter after the end of the line by the print() function, you can now change it to print an
empty string after each line. So, changing line as print(each_line, end="") removes the
blank spaces between the lines in the output.
9.2.3 Use of with Statements to Open and Close Files
Instead of using try-except-finally blocks to handle file opening and closing opera-
tions, a much cleaner way of doing this in Python is using the with statement. You
can use a with statement in Python such that you do not have to close the file handler
object.
The syntax of the with statement for the file I/O is,

In the syntax, the words with and as are keywords and the with keyword is followed by
the open() function and ends with a colon. The as keyword acts like an alias and is used to
assign the returning object from the open() function to a new variable file_handler.

File methods to read and write data:


When you use the open() function a file object is created. Here is the list of methods that can
be called on this object.
Reading and writing Binary files:
We can usually tell whether a file is binary or text based on its file extension. This is because
by convention the extension reflects the file format, and it is ultimately the file format that
dictates whether the file data is binary or text.
The string 'b' appended to the mode opens
the file in binary mode and now the data is read and written in the form of bytes objects.
This mode should be used for all files that don’t contain text. Files opened in binary mode
(including 'b' in the mode argument) return contents as bytes objects without any decoding.
Let’s understand bytes in detail. Consider the code below.
1. >>> print(b'Hello')
b'Hello'
2. >>> type(b'Hello')
<class 'bytes'>
3. >>> for i in b'Hello':
... print(i)
72
101
108
108
111
4. >>> bytes(3)
b'\x00\x00\x00'
5. >>> bytes([70])
b'F'
6. >>> bytes([72, 101, 108, 108, 111])
b'Hello'
7. >>> print(b'\x61')
b'a'
8. >>> bytes('Hi', 'utf-8')
b'Hi'
b'Hello' is a byte string literal . Bytes literals are always prefixed with 'b' or 'B' and they

produce an instance of the bytes type instead of the str type . Python makes a clear dis-
tinction between str and bytes types.
The syntax for bytes() class method is,
bytes(source[, encoding])
where the source is used to create a bytes object. It can be an integer or a string.
The bytes() class method returns a new bytes object. While bytes literals and representations
are based on ASCII text, bytes objects actually behave like immutable sequences of integers,
with each value in the sequence ranging from 0 to 255 .

Python os and os.path Modules:


Python os module provides a portable way of using operating system dependent func-
tionality. For accessing the filesystems, use the os module. If you want to manipulate
paths, use the os.path module.
It looks like os should be a package with a submodule path, but, in reality, os is a normal
module that does magic with sys.modules to inject os.path.
Processing HTML files:

HTML is the standard markup language for creating Web pages.

What is HTML?
● HTML stands for Hyper Text Markup Language

● HTML is the standard markup language for creating Web pages

● HTML describes the structure of a Web page

● HTML consists of a series of elements

● HTML elements tell the browser how to display the content


● HTML elements label pieces of content such as "this is a heading", "this is a
paragraph", "this is a link", etc.

A Simple HTML Document


Example

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>


<p>My first paragraph.</p>

</body>
</html>

Example Explained
● The <!DOCTYPE html> declaration defines that this document is an HTML5
document
● The <html> element is the root element of an HTML page

● The <head> element contains meta information about the HTML page

● The <title> element specifies a title for the HTML page (which is shown in
the browser's title bar or in the page's tab)
● The <body> element defines the document's body, and is a container for all
the visible contents, such as headings, paragraphs, images, hyperlinks,
tables, lists, etc.
● The <h1> element defines a large heading

● The <p> element defines a paragraph

Python is one of the most versatile programming languages. It


emphasizes code readability with extensive use of white space. It comes
with the support of a vast collection of libraries which serve for various
purposes, making our programming experience smoother and enjoyable.
Python programs are used for:

● Connecting with databases and performing backend development.


● Making web applications.
● Writing effective system scripts.
● And especially in data science and artificial intelligence.
With this said, let us see how we can use python programs to generate
HTML files as output. This is very effective for those programs which are
automatically creating hyperlinks and graphic entities.

Creating an HTML file in python


We will be storing HTML tags in a multi-line Python string and saving the
contents to a new file. This file will be saved with a .html extension rather
than a .txt extension.
Note: We would be omitting the standard <!DOCTYPE HTML>
declaration!

● Python3
# to open/create a new html file in the write
mode
f = open('GFG.html', 'w')

# the html code which will go in the file


GFG.html
html_template = """<html>
<head>
<title>Title</title>
</head>
<body>
<h2>Welcome To GFG</h2>

<p>Default code has been loaded into the


Editor.</p>

</body>
</html>
"""

# writing the code into the file


f.write(html_template)
# close the file
f.close()

The above program will create an HTML file:

Viewing the HTML source file


In order to display the HTML file as a python output, we will be using
the codecs library. This library is used to open files which have a certain
encoding. It takes a parameter encoding which makes it different from
the built-in open() function. The open() function does not contain any
parameter to specify the file encoding, which most of the time makes it
difficult for viewing files which are not ASCII but UTF-8.

● Python3
# import module
import codecs

# to open/create a new html file in the write


mode
f = open('GFG.html', 'w')

# the html code which will go in the file


GFG.html
html_template = """
<html>
<head></head>
<body>
<p>Hello World! </p>
</body>
</html>
"""

# writing the code into the file


f.write(html_template)

# close the file


f.close()

# viewing html files


# below code creates a
# codecs.StreamReaderWriter object
file = codecs.open("GFG.html", 'r', "utf-8")

# using .read method to view the html


# code from our object
print(file.read())

Output:

Viewing the HTML web file


In Python, webbrowser module provides a high-level interface which
allows displaying Web-based documents to users.
The webbrowser module can be used to launch a browser in a platform-
independent manner as shown below:

● Python3
# import module
import webbrowser

# open html file


webbrowser.open('GFG.html'
)

Output:
True
Processing XML files:

What is XML?
XML is an abbreviation name of "Extensible Markup Language". It is used to understand
data dynamically by the XML framework. It is primarily focused on creating web pages
where the data has a specific structure.
A page is created using the XML known as the XML document. XML generates a tree-like
structure that is straightforward and supports hierarchy. Let's understand some important
properties of the XML.
o XML documents have sections known as elements enclosed within the beginning <
and > ending tags. The characters between the start and ending tag are the
element's content. The element can consist of markup, including other elements,
the "child elements". The top-level element is known as the root that has all other
documents.
o The start-tag or empty elements contain the name-value pair known as Attributes.

Below is the sample structure of the XML file.

XML
1. <?xml version="1.0"?>
2. <catalog>
3. <book id="bk101">
4. <author>Gambardella, Matthew</author>
5. <title>XML Developer's Guide</title>
6. <genre>Computer</genre>
7. <price>44.95</price>
8. <publish_date>2000-10-01</publish_date>
9. <description>An in-depth look at creating applications
10. with XML.</description>
11. </book>
12. <book id="bk102">
13. <author>Ralls, Kim</author>
14. <title>Midnight Rain</title>
15. <genre>Fantasy</genre>
16. <price>5.95</price>
17. <publish_date>2000-12-16</publish_date>
18. <description>A former architect battles corporate zombies,
19. an evil sorceress, and her own childhood to become queen
20. of the world.</description>
21. </book>
22. <book id="bk103">
23. <author>Corets, Eva</author>
24. <title>Maeve Ascendant</title>
25. <genre>Fantasy</genre>
26. <price>5.95</price>
27. <publish_date>2000-11-17</publish_date>
28. <description>After the collapse of a nanotechnology
29. society in England, the young survivors lay the
30. foundation for a new society.</description>
31. </book>
32. <book id="bk104">
33. <author>Corets, Eva</author>
34. <title>Oberon's Legacy</title>
35. <genre>Fantasy</genre>
36. <price>5.95</price>
37. <publish_date>2001-03-10</publish_date>
38. <description>In post-apocalypse England, the mysterious
39. agent known only as Oberon helps to create a new life
40. for the inhabitants of London. Sequel to Maeve
41. Ascendant.</description>
42. </book>
43. <book id="bk105">
44. <author>Corets, Eva</author>
45. <title>The Sundered Grail</title>
46. <genre>Fantasy</genre>
47. <price>5.95</price>
48. <publish_date>2001-09-10</publish_date>
49. <description>The two daughters of Maeve, half-sisters,
50. battle one another for control of England. Sequel to
51. Oberon's Legacy.</description>
52. </book>
53. <book id="bk106">
54. <author>Randall, Cynthia</author>
55. <title>Lover Birds</title>
56. <genre>Romance</genre>
57. <price>4.95</price>
58. <publish_date>2000-09-02</publish_date>
59. <description>When Carla meets Paul at an ornithology
60. conference, tempers fly as feathers get ruffled.</description>
61. </book>
62. <book id="bk107">
63. <author>Thurman, Paula</author>
64. <title>Splish Splash</title>
65. <genre>Romance</genre>
66. <price>4.95</price>
67. <publish_date>2000-11-02</publish_date>
68. <description>A deep sea diver finds true love twenty
69. thousand leagues beneath the sea.</description>
70. </book>
71. <book id="bk108">
72. <author>Knorr, Stefan</author>
73. <title>Creepy Crawlies</title>
74. <genre>Horror</genre>
75. <price>4.95</price>
76. <publish_date>2000-12-06</publish_date>
77. <description>An anthology of horror stories about roaches,
78. centipedes, scorpions and other insects.</description>
79. </book>
80. <book id="bk109">
81. <author>Kress, Peter</author>
82. <title>Paradox Lost</title>
83. <genre>Science Fiction</genre>
84. <price>6.95</price>
85. <publish_date>2000-11-02</publish_date>
86. <description>After an inadvertant trip through a Heisenberg
87. Uncertainty Device, James Salway discovers the problems
88. of being quantum.</description>
89. </book>
90. </catalog>
As we can see in the above XML sample file -
o The <catlog> is single root element, that contain all the other elements such as
<book_id> or <title>.
o The child elements or sub elements are inside the <catlog> and we can see that they
are nested.
o The <book> element contains multiple "attributes" such as author, title, etc.

Note - The child elements can contain their own child elements, also known as the
"sub-child" element.
Now, let's move to the ElementTree library.

What is ElementTree?
The XML tree structure allows us to makes modification, navigations, and removal in
simple manner. Python comes with the ElementTree library that provides several functions
to read and manipulate the XMLs. It is used to parse (read information from a file and spit
it into pieces). Below is the table representation of the XML data structure.

Property Description

Tag It represents the data being stored. It is basically a string.

Attributes It contains a number of attributes stored as dictionaries

Text String It is a text string consisting of information that needs to be displayed.


Tail String It can also have tail strings if necessary

Child Elements It consists of a number of child elements stored as sequences


To use the ElementTree module, we need to import it into our program as below.

1. import xml.etree.ElementTree as ET

Parsing XML Data


The primary objective of this tutorial is to read and understand the file using Python. There
are many book details in our sample xml file, but the data is messed. Anybody can enter
the data in their way into the file, leading to inconsistency in data.
Let's see the following example.
Example -

1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4. print(root)
Output:
<Element 'catalog' at 0x000001FAD52C44A0>
We have initialized the tree in the above code and printed the XML root object. Now, we
can print each part of the tree to understand the tree structure easily.
As discussed earlier, every part of the tree contains a tag that determines the element.
Elements may contain attributes that play a significant role in validating values entered for
that tag. Let's print the root tag of the XML.

1. print(root.tag)
Output:
catalog
If we observe the XML file at the top level, this XML is rooted in the collection tag. Let's see
the root's attributes.

1. print("Attributes are:",root.attrib)
Output:
Attributes are: {}
As we can see that, there are no attributes in the root.
Parsing Using For Loop
We can iterate over the sub-elements or children in the root using the for loop. Let's
understand the following example.
Example -

1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4.
5. for ch in root:
6. print(ch.tag, ch.attrib)
Output:
Iterating root using for loop
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}
As we can see that, all book attributes are the children of root catalog. The id attribute
designated the book attribute. There are various books from the different id's.
It is quite helpful to get information of elements in entire tree. Now we use
the root.iter() method in for loop, which returns the number element we have. However, it
doesn't show the attributes or level in the tree.
Example -

1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4.
5. print("Iterating root using for loop:")
6. tags = [elem.tag for elem in root.iter()]
7. print(tags)
Output:
['catalog', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description',
'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book',
'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author',
'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title',
'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre',
'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description']
Since ElementTree is a powerful library, we can print the whole document using
the .tostring() method. We need to pass the root into this method with the encoding and
decoding of the document. For XMLs, it uses 'utf98'.
Let's understand the following code snippet.
Example -

1. print(ET.tostring(root, encoding='utf8').decode('utf8'))
Output:
<?xml version='1.0' encoding='utf8'?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
</catalog>
The root.iter() method helps us find particular interest elements. This method will give all
the subelements under the root matching the specified element. Let's see the following
code.
Example -

1. for book in root.iter('book'):


2. print(book.attrib)
Output:
{'id': 'bk101'}
{'id': 'bk102'}
{'id': 'bk103'}
{'id': 'bk104'}
{'id': 'bk105'}
{'id': 'bk106'}
{'id': 'bk107'}
{'id': 'bk108'}
{'id': 'bk109'}

Regular Expressions:
Regular expressions, also called REs, or regexes, or regex patterns,
provide a powerful way to search and manipulate strings. Regular
expressions are essentially a tiny,highly specialized programming language
embedded inside Python and made available through the re module.

Regular expressions use a sequence of characters and symbols to define a


pattern of text. Such a pattern is used to locate a chunk of text in a string
by matching up the pattern against the characters in the string.
Regular expressions are useful for finding phone numbers, email
addresses, dates, and any other data that has a consistent format.
Using Special Characters:
A regular expression pattern is composed of simple characters, such as
abc, or a combination of simple and special characters, such as ab*c.
Simple patterns are constructed of characters for which you want to find a
text match.

A RegEx, or Regular Expression, is a sequence of characters that forms a


search pattern.

RegEx can be used to check if a string contains the specified search pattern.

RegEx Module
Python has a built-in package called re, which can be used to work with Regular
Expressions.

Import the re module:

import re

RegEx in Python
When you have imported the re module, you can start using regular expressions:

ExampleGet your own Python Server

Search the string to see if it starts with "The" and ends with "Spain":

import re

txt = "The rain in Spain"


x = re.search("^The.*Spain$", txt)
RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:

Function Description

findall Returns a list containing all matches

search Returns a Match object if there is a match anywhere in the string

split Returns a list where the string has been split at each match

sub Replaces one or many matches with a string

ADVERTISEMENT

Metacharacters
Characte Description Example
r

[] A set of characters "[a-m]"

\ Signals a special sequence (can also be used to escape special "\d"


characters)

. Any character (except newline character) "he..o"

^ Starts with "^hello"

$ Ends with "planet$"

* Zero or more occurrences "he.*o"

+ One or more occurrences "he.+o"


? Zero or one occurrences "he.?o"

{} Exactly the specified number of occurrences "he.{2}o"

| Either or "falls|stays"

() Capture and group

Metacharacters are characters with a special meaning:

Special Sequences
Characte Description Example
r

\A Returns a match if the specified characters are at the "\AThe"


beginning of the string

\b Returns a match where the specified characters are at the r"\bain"


beginning or at the end of a word r"ain\b"
(the "r" in the beginning is making sure that the string is being
treated as a "raw string")

\B Returns a match where the specified characters are present, r"\Bain"


but NOT at the beginning (or at the end) of a word r"ain\B"
(the "r" in the beginning is making sure that the string is being
treated as a "raw string")

\d Returns a match where the string contains digits (numbers "\d"


from 0-9)

\D Returns a match where the string DOES NOT contain digits "\D"

\s Returns a match where the string contains a white space "\s"


character

\S Returns a match where the string DOES NOT contain a white "\S"
space character

\w Returns a match where the string contains any word "\w"


characters (characters from a to Z, digits from 0-9, and the
underscore _ character)

\W Returns a match where the string DOES NOT contain any word "\W"
characters

\Z Returns a match if the specified characters are at the end of "Spain\Z"


the string

A special sequence is a \ followed by one of the characters in the list below, and
has a special meaning:

1.
import webbrowser

f= open("D:\\prog1.html","w")

html_template="""

<html>

<head>

<title>

html in python

</title>

python

</head>

<body>

<h1>hello</h1>

<h2>hai</h2>

</body>

</html>

"""

f.write(html_template)

f.close()

#to open html file


import webbrowser

webbrowser.open('D://prog1.html')
processing Html in Python

from bs4 import BeautifulSoup

with open(“d://prog1.html","r") as f:

contents=f.read()

soup=BeautifulSoup(contents,"html.parser")

for child in soup.descendants:

if child.name:

print(child.name)

output:

html

head

title

body
h1
h2

ex2:

from bs4 import BeautifulSoup


with open("d:\prog1.html", "r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
print(soup.html)

output:
<html>
<head>
<title>
</title>
</head>
<body>
<h1>hello</h1>
<h2>hai</h2>
</body>
</html>
ex:
from bs4 import BeautifulSoup
with open("d:\prog1.html", "r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
for tags in soup.find_all("h1"):
print(tags.text)
for tags in soup.find_all("h2"):
print(tags.text)

output:
hello
hai

ex:
from bs4 import BeautifulSoup
with open("d:\prog1.html", "r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
for tags in soup.find_all("html"): #the text under the tag will be display
print(tags.text)

output:html in python

python

hello
hai

Files & Working with Text Data

Ex:1
f=open("alekhya.txt","w")
print("file name=",f.name)
print("file open mode=",f.mode)
print("file is closed",f.closed)
print("encoding Algorithm is=",f.encoding)

Output:
file name= alekhya.txt
file open mode= w
file is closed False
encoding Algorithm is= cp1252

Ex: 2

f=open("alekhya.txt","w")
f.write("hello students\n")
f.write("welcome to class\n")
f.write("Data engineering with python\n")
f.close()

Ex 3:
f=open("alekhya.txt","r")
rd=f.read()
print(rd)

OUTPUT:
hello students
welcome to class
Data engineering with python

ex 4:

f=open("alekhya.txt","r")
print(f.readline())
print(f.readline())
f.close()

Output:
hello students

welcome to class

Working with Binary File


f=open("D:\\border.jfif","rb")
data=f.read()
print(data)
\

You might also like