File handling
Course:- Foundation of Data Analytics
Outline of the topic
What is file handling?
Why file handling is important?
Different file formats
Python libraries for file handling
File operations
What is file handling
File handling in data science refers to the process of reading, writing, and
manipulating data files as part of data analysis, data preprocessing, and data
visualization tasks.
Data scientists often work with various file formats and data sources to
access, clean, transform, and analyze data.
File handling is a fundamental aspect of data science because it allows data
scientists to load data into their analysis environments, store the results of
their work, and exchange data with others.
In data analytics, file handling refers to the processes and techniques used to
manage and work with data files as part of the data analysis workflow.
Importance of file handling
File handling is of paramount importance in computer programming and data management for following
reasons.
Data Storage and Persistence: File handling allows you to store data beyond the runtime of a program.
Data can be saved to files and retrieved later, which is essential for preserving information and sharing it
across sessions or with other users.
Data Input and Output: File handling facilitates reading data from external sources (input) and writing
data to external destinations (output). This is vital for loading datasets, saving results, and interacting with
various data formats and systems.
Data Manipulation and Processing: Files are often used as the means to access, preprocess, and
transform data. Data can be cleaned, structured, and prepared for analysis or further processing.
Data Sharing and Communication: Files serve as a common medium for sharing data between
programs, systems, and users. They enable interoperability and collaboration, allowing data to be
exchanged between different applications and platforms.
Data Backup and Recovery: Files enable data backup and recovery procedures. Regularly saving data to
files ensures that important information is not lost due to system failures or errors.
file formats in Data Science
File formats are designed to store specific types of information, such as CSV, XLSX etc. The file
format also tells the computer how to display or process its content. Common file formats, such as
CSV, XLSX, ZIP, TXT etc.
If you see your future as a data scientist so you must understand the different types of file format.
Because data science is all about the data and it’s processing and if you don’t understand the file
format so may be it’s quite complicated for you. Thus, it is mandatory for you to be aware of
different file formats.
Different type of file formats:
CSV: the CSV is stand for Comma-separated values. as-well-as this name CSV file is use comma to
separated values. In CSV file each line is a data record and Each record consists of one or more than
one data fields, the field is separated by commas. 1 Eldon BaseMuhammed
for stackableMacIntyre
storage
3 shelf,
-213.25platinum38.94 35 Nunavut Storage & Organization
0.8
2 1.7 Cubic Foot
BarryCompact
French "Cube"
293 Office
457.81
Refrigerators
208.16 68.02 Nunavut Appliances 0.58
import pandas as pd
3 Cardinal Slant-D®
Barry French
Ring Binder,
293Heavy46.71
Gauge Vinyl8.69 2.99 Nunavut Binders and Binder
0.39Accessories
df = pd.read_csv("file_path / file_name.csv") 4 R380 Clay Rozendal 483 1198.97 195.99 3.99 Nunavut Telephones and0.58
Communication
5 Holmes HEPACarlos
Air Soltero
Purifier 515 30.94 21.78 5.94 Nunavut Appliances 0.5
print(df) 6 G.E. Longer-Life
CarlosIndoor
SolteroRecessed
515 Floodlight
4.43 Bulbs6.64 4.95 Nunavut Office Furnishings
0.37
7 Angle-D Binders
Carl Jackson
with Locking
613Rings,-54.04
Label Holders7.3 7.72 Nunavut Binders and Binder
0.38Accessories
8 SAFCO Mobile
Carl Desk
Jackson
Side File,
613Wire Frame
127.7 42.76 6.22 Nunavut Storage & Organization
9 SAFCO Commercial
Monica Federle
Wire Shelving,
643 -695.26
Black 138.14 35 Nunavut Storage & Organization
10 Xerox 198 Dorothy Badders678 -226.36 4.98 8.33 Nunavut Paper 0.38
Cont…
XLSX: The XLSX file is Microsoft Excel Open XML Format Spreadsheet file. This is used to store any type
of data but it’s mainly used to store financial data and to create mathematical models etc.
import pandas as pd
df = pd.ExcelFile(r'C:\Users\Shashi\Desktop\city.xlsx')
print (df)
ZIP: ZIP files are used an data containers, they store one or more than one files in t he
compressed form. it widely used in internet After you downloaded ZIP file, you need to unpack its
contents in order to use it.
import pandas as pd
df = pd.read_csv(' r'C:\Users\Shashi\Desktop\occupancy .zip')
print(df)
Cont…
TXT: TXT files are useful for storing information in plain text with no special formatting beyond basic fonts and font styles. It is recognized by any
text editing and other software programs.
import pandas as pd
df = pd.read_csv(r'C:\Users\Shashi\Desktop\occupancy.txt')
print(df)
JSON: JSON is stand for JavaScript Object Notation. JSON is a standard text-based format for representing structured data
based on JavaScript object syntax.
import pandas as pd
df = pd.read_json('C:\Users\Shashi\Desktop\fruit.json')
print(df)
{
"fruit": "Apple",
"size": "Large",
"color": "Red" Format of JSON file
}
Python libraries for data analytics
In data science, file handling plays a critical role in data preprocessing, analysis, visualization, and reporting. Here are
some Python libraries and modules commonly used for file handling in data science:
Pandas: Pandas is one of the most popular data manipulation libraries in Python. It provides data structures like Data
Frames and Series, making it easy to read, write, and manipulate structured data from various file formats, including CSV,
Excel, SQL databases, and more.
import pandas as pd
DataFrame df = pd.read_csv('data.csv')
NumPy: While NumPy primarily focuses on numerical operations, it includes functions for efficiently reading and
writing binary files, especially for numerical data.
Matplotlib and Seaborn: Matplotlib and Seaborn are popular libraries for data visualization in Python. They
are used to create various types of plots and charts from data loaded from files.
Cont…
HTML: HTML is stand for stands for Hyper Text Markup Language is use for creating web pages. we
can read html table in python pandas using read_html() function.
import pandas as pd
df = pd.read_html('C:\Users\Shashi\Desktop\ https://www.programiz.com/python-
programming/examples/hello-world.html')
print(df)
File operations
List of File Operations in Python
Methods Description
open(filename, mode):Opens a file and returns a file object. The filename argument is a string that
specifies the name of the file to open, and the mode argument specifies the mode in which to open the
file (e.g. ‘r’ for read mode, ‘w’ for write mode, etc.).
close():Closes the file object. Any further operations on the file object will raise a ValueError. It’s a good
practice to always close files after they have been opened.
Cont…
read(size=-1):Reads and returns a string from the file. If size is specified, at most size characters will
be read. If size is not specified or is negative, the entire file will be read.
Read mode: Opens the file for reading (default mode). If the file doesn’t exist, an error will be raised.
file = open('example.txt', 'r')
So output will be like
Write mode: Opens the file for writing. If the file exists, it will be truncated. If the file doesn’t exist,
it will be created.
file = open('example.txt', 'w')
Append mode: Opens the file for writing, but appends new data to the end of the file instead of
overwriting existing data. If the file doesn’t exist, it will be created.
file = open('example.txt', 'a')
Cont…
The .read() method to our variable file gives the content of the file as output. We can also specify the
number of characters we want to read at once.
file = open(“example.txt", "r")
print(file.read(8)) // size − This is the number of bytes to be read from the file.
print(file.readline()) // The readline() method returns one line from the file.
print(file.readline(5))
Thank you
Q and A