Data Engineering With Python
Data Engineering With Python
REPORT STRUCTURE
The project report is what we (data scientists) submit to the data
sponsor (the customer). The report typically includes the following:
• Abstract (a brief and accessible description of the project)
• Introduction
• Methods that were used for data acquisition and processing
• Results that were obtained (do not include intermediate and
insignificant results in this section; rather, put them into an appendix)
• Conclusion
• Appendix
Files:
1. A file is the common storage unit in a computer, and all programs
and data are “written” into a file and “read” from a file.
2. A file extension, sometimes called a file suffix or a filename
extension, is the character or group of characters after the period that
makes up an entire file name.
3. For example, the file assignments.docx ends in docx, a file extension
that is associated with Microsoft Word on your computer.
4. Any file’s extensions can be renamed, but that will not convert the file
to another format or change anything about the file other than this portion
of its name.
5. File extensions are classified as executable, meaning that when
clicked, they do not just open for viewing or playing, they actually do
something all by themselves, like install a program, start a process, or run a
script.
6. All the data on your hard drive consists of files and directories. The
fundamental difference between the two is that files store data, while
directories store files and other directories.
7. Files,on the other hand, can range from a few bytes to several
gigabytes. Directories are used to organize files on your computer.
Types of Files:
1. Python supports two types of files – text files and binary files.
2. While both binary and text files contain data stored as a series of bits
(binary values of 1s and 0s), the bits in text files represent characters, while
the bits in binary files represent custom data.
3. Binary files typically contain a sequence of bytes or ordered
groupings of eight bits.
4. Binary file formats may include multiple types of data in the same file,
such as image, video, and audio data.
5. This data can be interpreted by supporting programs but will show up
as garbled text in a text editor.
The open() function returns a file handler object for the file name. The open() function is
commonly used with two arguments, where the first argument is a string containing the
file name to be opened which can be absolute or relative to the current working directory.
For example,
1. >>> file_handler = open("moon.txt","r")
2. >>> file_handler.close()
237Files
The civilization coalesced around 3150 BC with the political unification of Upper and
Lower Egypt under the first pharaoh.
Ancient Egypt reached its pinnacle during the New Kingdom, after which it entered a
period of slow decline.
In the read_file() function definition, you open the file egypt.txt and assign the file object
to the file_handler . By default, the file is opened in read only mode as no mode is speci-
fied explicitly. Use a for loop to iterate over file_handler and print the lines –. Once the
file processing operation is over, close the file_handler . In the output, notice a blank space
between each line of the file. Understand that at the end of each line, a newline character
(\n) is present which is invisible and it indicates the end of the line. The print() function by
default always appends a newline character. This means that if you want to print data that
already ends in a newline, we get two newlines, resulting in a blank space between the
lines. In order to overcome this problem, pass an end argument to the print() function and
initialize it with an empty string (with no spaces). The end argument should always be a
string. The value of end argument is printed after the thing you want to print. By default,
the end argument contains a newline (“\n”) but it can be changed to something else, like
an empty string. This means that instead of the usual behavior of placing a newline char-
acter after the end of the line by the print() function, you can now change it to print an
empty string after each line. So, changing line as print(each_line, end="") removes the
blank spaces between the lines in the output.
9.2.3 Use of with Statements to Open and Close Files
Instead of using try-except-finally blocks to handle file opening and closing opera-
tions, a much cleaner way of doing this in Python is using the with statement. You
can use a with statement in Python such that you do not have to close the file handler
object.
The syntax of the with statement for the file I/O is,
In the syntax, the words with and as are keywords and the with keyword is followed by
the open() function and ends with a colon. The as keyword acts like an alias and is used to
assign the returning object from the open() function to a new variable file_handler.
produce an instance of the bytes type instead of the str type . Python makes a clear dis-
tinction between str and bytes types.
The syntax for bytes() class method is,
bytes(source[, encoding])
where the source is used to create a bytes object. It can be an integer or a string.
The bytes() class method returns a new bytes object. While bytes literals and representations
are based on ASCII text, bytes objects actually behave like immutable sequences of integers,
with each value in the sequence ranging from 0 to 255 .
What is HTML?
● HTML stands for Hyper Text Markup Language
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
</body>
</html>
Example Explained
● The <!DOCTYPE html> declaration defines that this document is an HTML5
document
● The <html> element is the root element of an HTML page
● The <head> element contains meta information about the HTML page
● The <title> element specifies a title for the HTML page (which is shown in
the browser's title bar or in the page's tab)
● The <body> element defines the document's body, and is a container for all
the visible contents, such as headings, paragraphs, images, hyperlinks,
tables, lists, etc.
● The <h1> element defines a large heading
● Python3
# to open/create a new html file in the write
mode
f = open('GFG.html', 'w')
</body>
</html>
"""
● Python3
# import module
import codecs
Output:
● Python3
# import module
import webbrowser
Output:
True
Processing XML files:
What is XML?
XML is an abbreviation name of "Extensible Markup Language". It is used to understand
data dynamically by the XML framework. It is primarily focused on creating web pages
where the data has a specific structure.
A page is created using the XML known as the XML document. XML generates a tree-like
structure that is straightforward and supports hierarchy. Let's understand some important
properties of the XML.
o XML documents have sections known as elements enclosed within the beginning <
and > ending tags. The characters between the start and ending tag are the
element's content. The element can consist of markup, including other elements,
the "child elements". The top-level element is known as the root that has all other
documents.
o The start-tag or empty elements contain the name-value pair known as Attributes.
XML
1. <?xml version="1.0"?>
2. <catalog>
3. <book id="bk101">
4. <author>Gambardella, Matthew</author>
5. <title>XML Developer's Guide</title>
6. <genre>Computer</genre>
7. <price>44.95</price>
8. <publish_date>2000-10-01</publish_date>
9. <description>An in-depth look at creating applications
10. with XML.</description>
11. </book>
12. <book id="bk102">
13. <author>Ralls, Kim</author>
14. <title>Midnight Rain</title>
15. <genre>Fantasy</genre>
16. <price>5.95</price>
17. <publish_date>2000-12-16</publish_date>
18. <description>A former architect battles corporate zombies,
19. an evil sorceress, and her own childhood to become queen
20. of the world.</description>
21. </book>
22. <book id="bk103">
23. <author>Corets, Eva</author>
24. <title>Maeve Ascendant</title>
25. <genre>Fantasy</genre>
26. <price>5.95</price>
27. <publish_date>2000-11-17</publish_date>
28. <description>After the collapse of a nanotechnology
29. society in England, the young survivors lay the
30. foundation for a new society.</description>
31. </book>
32. <book id="bk104">
33. <author>Corets, Eva</author>
34. <title>Oberon's Legacy</title>
35. <genre>Fantasy</genre>
36. <price>5.95</price>
37. <publish_date>2001-03-10</publish_date>
38. <description>In post-apocalypse England, the mysterious
39. agent known only as Oberon helps to create a new life
40. for the inhabitants of London. Sequel to Maeve
41. Ascendant.</description>
42. </book>
43. <book id="bk105">
44. <author>Corets, Eva</author>
45. <title>The Sundered Grail</title>
46. <genre>Fantasy</genre>
47. <price>5.95</price>
48. <publish_date>2001-09-10</publish_date>
49. <description>The two daughters of Maeve, half-sisters,
50. battle one another for control of England. Sequel to
51. Oberon's Legacy.</description>
52. </book>
53. <book id="bk106">
54. <author>Randall, Cynthia</author>
55. <title>Lover Birds</title>
56. <genre>Romance</genre>
57. <price>4.95</price>
58. <publish_date>2000-09-02</publish_date>
59. <description>When Carla meets Paul at an ornithology
60. conference, tempers fly as feathers get ruffled.</description>
61. </book>
62. <book id="bk107">
63. <author>Thurman, Paula</author>
64. <title>Splish Splash</title>
65. <genre>Romance</genre>
66. <price>4.95</price>
67. <publish_date>2000-11-02</publish_date>
68. <description>A deep sea diver finds true love twenty
69. thousand leagues beneath the sea.</description>
70. </book>
71. <book id="bk108">
72. <author>Knorr, Stefan</author>
73. <title>Creepy Crawlies</title>
74. <genre>Horror</genre>
75. <price>4.95</price>
76. <publish_date>2000-12-06</publish_date>
77. <description>An anthology of horror stories about roaches,
78. centipedes, scorpions and other insects.</description>
79. </book>
80. <book id="bk109">
81. <author>Kress, Peter</author>
82. <title>Paradox Lost</title>
83. <genre>Science Fiction</genre>
84. <price>6.95</price>
85. <publish_date>2000-11-02</publish_date>
86. <description>After an inadvertant trip through a Heisenberg
87. Uncertainty Device, James Salway discovers the problems
88. of being quantum.</description>
89. </book>
90. </catalog>
As we can see in the above XML sample file -
o The <catlog> is single root element, that contain all the other elements such as
<book_id> or <title>.
o The child elements or sub elements are inside the <catlog> and we can see that they
are nested.
o The <book> element contains multiple "attributes" such as author, title, etc.
Note - The child elements can contain their own child elements, also known as the
"sub-child" element.
Now, let's move to the ElementTree library.
What is ElementTree?
The XML tree structure allows us to makes modification, navigations, and removal in
simple manner. Python comes with the ElementTree library that provides several functions
to read and manipulate the XMLs. It is used to parse (read information from a file and spit
it into pieces). Below is the table representation of the XML data structure.
Property Description
1. import xml.etree.ElementTree as ET
1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4. print(root)
Output:
<Element 'catalog' at 0x000001FAD52C44A0>
We have initialized the tree in the above code and printed the XML root object. Now, we
can print each part of the tree to understand the tree structure easily.
As discussed earlier, every part of the tree contains a tag that determines the element.
Elements may contain attributes that play a significant role in validating values entered for
that tag. Let's print the root tag of the XML.
1. print(root.tag)
Output:
catalog
If we observe the XML file at the top level, this XML is rooted in the collection tag. Let's see
the root's attributes.
1. print("Attributes are:",root.attrib)
Output:
Attributes are: {}
As we can see that, there are no attributes in the root.
Parsing Using For Loop
We can iterate over the sub-elements or children in the root using the for loop. Let's
understand the following example.
Example -
1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4.
5. for ch in root:
6. print(ch.tag, ch.attrib)
Output:
Iterating root using for loop
book {'id': 'bk101'}
book {'id': 'bk102'}
book {'id': 'bk103'}
book {'id': 'bk104'}
book {'id': 'bk105'}
book {'id': 'bk106'}
book {'id': 'bk107'}
book {'id': 'bk108'}
book {'id': 'bk109'}
As we can see that, all book attributes are the children of root catalog. The id attribute
designated the book attribute. There are various books from the different id's.
It is quite helpful to get information of elements in entire tree. Now we use
the root.iter() method in for loop, which returns the number element we have. However, it
doesn't show the attributes or level in the tree.
Example -
1. import xml.etree.ElementTree as ET
2. tree = ET.parse('book.xml')
3. root = tree.getroot()
4.
5. print("Iterating root using for loop:")
6. tags = [elem.tag for elem in root.iter()]
7. print(tags)
Output:
['catalog', 'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description',
'book', 'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book',
'author', 'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author',
'title', 'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title',
'genre', 'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre',
'price', 'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description', 'book', 'author', 'title', 'genre', 'price',
'publish_date', 'description']
Since ElementTree is a powerful library, we can print the whole document using
the .tostring() method. We need to pass the root into this method with the encoding and
decoding of the document. For XMLs, it uses 'utf98'.
Let's understand the following code snippet.
Example -
1. print(ET.tostring(root, encoding='utf8').decode('utf8'))
Output:
<?xml version='1.0' encoding='utf8'?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
</catalog>
The root.iter() method helps us find particular interest elements. This method will give all
the subelements under the root matching the specified element. Let's see the following
code.
Example -
Regular Expressions:
Regular expressions, also called REs, or regexes, or regex patterns,
provide a powerful way to search and manipulate strings. Regular
expressions are essentially a tiny,highly specialized programming language
embedded inside Python and made available through the re module.
RegEx can be used to check if a string contains the specified search pattern.
RegEx Module
Python has a built-in package called re, which can be used to work with Regular
Expressions.
import re
RegEx in Python
When you have imported the re module, you can start using regular expressions:
Search the string to see if it starts with "The" and ends with "Spain":
import re
Function Description
split Returns a list where the string has been split at each match
ADVERTISEMENT
Metacharacters
Characte Description Example
r
| Either or "falls|stays"
Special Sequences
Characte Description Example
r
\D Returns a match where the string DOES NOT contain digits "\D"
\S Returns a match where the string DOES NOT contain a white "\S"
space character
\W Returns a match where the string DOES NOT contain any word "\W"
characters
A special sequence is a \ followed by one of the characters in the list below, and
has a special meaning:
1.
import webbrowser
f= open("D:\\prog1.html","w")
html_template="""
<html>
<head>
<title>
html in python
</title>
python
</head>
<body>
<h1>hello</h1>
<h2>hai</h2>
</body>
</html>
"""
f.write(html_template)
f.close()
webbrowser.open('D://prog1.html')
processing Html in Python
with open(“d://prog1.html","r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
if child.name:
print(child.name)
output:
html
head
title
body
h1
h2
ex2:
output:
<html>
<head>
<title>
</title>
</head>
<body>
<h1>hello</h1>
<h2>hai</h2>
</body>
</html>
ex:
from bs4 import BeautifulSoup
with open("d:\prog1.html", "r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
for tags in soup.find_all("h1"):
print(tags.text)
for tags in soup.find_all("h2"):
print(tags.text)
output:
hello
hai
ex:
from bs4 import BeautifulSoup
with open("d:\prog1.html", "r") as f:
contents=f.read()
soup=BeautifulSoup(contents,"html.parser")
for tags in soup.find_all("html"): #the text under the tag will be display
print(tags.text)
output:html in python
python
hello
hai
Ex:1
f=open("alekhya.txt","w")
print("file name=",f.name)
print("file open mode=",f.mode)
print("file is closed",f.closed)
print("encoding Algorithm is=",f.encoding)
Output:
file name= alekhya.txt
file open mode= w
file is closed False
encoding Algorithm is= cp1252
Ex: 2
f=open("alekhya.txt","w")
f.write("hello students\n")
f.write("welcome to class\n")
f.write("Data engineering with python\n")
f.close()
Ex 3:
f=open("alekhya.txt","r")
rd=f.read()
print(rd)
OUTPUT:
hello students
welcome to class
Data engineering with python
ex 4:
f=open("alekhya.txt","r")
print(f.readline())
print(f.readline())
f.close()
Output:
hello students
welcome to class