0% found this document useful (0 votes)
15 views148 pages

Data Science A Guide To Python's Key Libraries

Uploaded by

vikubhagaming07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views148 pages

Data Science A Guide To Python's Key Libraries

Uploaded by

vikubhagaming07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Blueprints

Of
Mastering
Data Science
A Guide to Python Key Libraries

Author
Khushbu N. Sama
Assistant Professor
Noble University Junagadh
2025
Preface
In today’s digital age, data has become the most valuable resource, shaping decisions
across business, healthcare, education, governance, and beyond. The ability to
collect, process, and interpret data effectively defines the success of individuals and
organizations alike. This is where Data Science emerges—not just as a discipline, but
as a way of thinking, combining statistical reasoning, computational efficiency, and
domain expertise to unlock actionable insights.

Python, with its simplicity and versatility, has rapidly established itself as the
language of choice for data science. From data manipulation with NumPy and
Pandas, to visualization with Matplotlib and Seaborn, and machine learning with
Python provides an integrated ecosystem that empowers learners and professionals
to translate concepts into solutions.

This book, Mastering Data Science: A Guide to Python, is crafted to offer readers a
clear and practical roadmap for their data science journey. Rather than presenting
Python as a collection of disjointed tools, it introduces Python as a cohesive platform
for solving real-world problems—covering data preprocessing, exploratory analysis,
model building, and deployment.

The guiding principle of this work lies in its focus on mastery through practice. Each
chapter builds upon the last, combining theoretical clarity with hands-on exercises,
case studies, and visual explanations. The goal is not only to teach how to use Python
for data science, but also to develop the why—the reasoning that leads to
meaningful, reliable, and scalable insights.

This book is intended for students beginning their exploration of data science, as well
as professionals aiming to deepen their skills with structured, best-practice
approaches. By blending foundational knowledge with advanced applications, it
equips readers with both confidence and competence to tackle data-driven
challenges in diverse domains.

In a rapidly changing technological landscape, mastering data science with Python is


more than acquiring a skill—it is cultivating a mindset of curiosity, adaptability, and
innovation. It is my hope that this guide will serve as both a foundation and an
inspiration, empowering you to navigate the world of data with clarity, creativity, and
confidence.
MASTERING DATA SCIENCE
A GUIDE TO PYTHON KEY LIBRARIES
Table Content

Chapter 1 Data Science Fundamentals with python

Introduction to Data Science



Data Science Process Skills

Data Scientists Applications

Introduction to Jupyter Notebook

Chapter 2 Data Collection & Data Manipulation

Installing and Using NumPy and Pandas



Reading and Writing CSV Files with Pandas

Data Filtering and Cleaning with Pandas

Web Data Collection using BeautifulSoup

Mini Project: Web Scraping Real-World Data

Chapter3 Data Visualizationabcd


Parts of a figure in data visualization

Types of inputs to plotting functions

Creating helper functions for plotting

Color-mapped data visualization

Introduction to Pyplot

Formatting the style of plots

Plotting with keyword strings

Controlling line properties in plots

Adding and working with text in plots

Case study: Bio-signal plotting using Matplotlib/Pandas library

Chapter 4 Exploring Data Analysis bcd

Exploratory Data Analysis (EDA) techniques



Tools for data exploration

Identifying and removing outliers using box plots

ETL process logging

Chapter 5 Web Framework : Django


Installation of Django

Creating a virtual environment for Djang


Creating your first sample Django project


Django project MVT architecture and MVC pattern


performing CRUD operations in Django using functions


Basics of using the Django template system


Need for templates in Django


Configuring templates in Django


Template loading in Django
Data Science Fundamentals with Python

Introduction

Data science is the process of using data to find solution/ to predict outcomes
for a Problem statement.
Data science can be defined as a blend of mathematics, business acumen, tools,
algorithm and machine learning techniques, all of which help us in finding out
the hidden insights or patterns from raw data which can be of major use in the
formation of big business decisions. It is used in many industries these days,
ranging from entertainment to education.
Data science is an interdisciplinary field that utilizes scientific methods,
processes, algorithms, and systems to extract knowledge and insights from
structured, semi-structured, and unstructured data. Python, due to its simplicity,
extensive libraries, and strong community support, has emerged as a leading
programming language for data science.

Why Data Science?


Organizations in a variety of industries are depending more and more on data to
inform their decisions as a result of the recent explosion in digital data. Data science
is revolutionizing how companies, governments, and researchers solve problems,
from forecasting consumer behavior to identifying fraud.

Here's an overview of the fundamentals of data science with Python:

1. Python programming basics

• Understanding fundamental Python components: This includes grasping


concepts like data types (integers, strings, floating-point numbers), loops,
and conditional statements. Essential Python data structures: Familiarize

1|Page
yourself with dictionaries, sets, lists, and tuples, understanding when to apply
each.

2. Essential Python libraries for data science

• NumPy: The cornerstone for numerical operations in Python, enabling


efficient handling of large, multi-dimensional arrays and matrices, crucial for
numerical computing tasks like linear algebra.
• Pandas: A powerful library for data manipulation and analysis, providing data
structures like DataFrames and Series to handle structured data effectively.
It streamlines tasks like data cleaning, transformation, and aggregation.
• Matplotlib: A versatile library for creating static, interactive, and animated
visualizations, offering fine-grained control over plot customization.
• Seaborn: Built on Matplotlib, Seaborn simplifies the creation of aesthetically
pleasing and informative statistical graphics with high-level functions and
built-in themes.
• Scikit-learn: A comprehensive library providing simple and efficient tools for
data mining and analysis, covering various machine learning algorithms,
including classification, regression, clustering, and dimensionality reduction

3. Key data science concepts

• Data exploration: Examining datasets to understand their structure, main


features, and potential relationships, including summarizing data with
statistics and visualizing it with charts and graphs.

• Data cleaning: The process of preparing raw data for analysis by handling
missing values, correcting errors, and removing duplicates, ensuring accurate
and reliable results.

• Data visualization: Transforming data into graphical formats to facilitate the


recognition of patterns, trends, and correlations, enhancing insights and
communication.

• Statistics: Providing the mathematical foundation for data analysis, including


basic methods like mean, median, mode, standard deviation, correlation
coefficients, hypothesis testing, and probability distributions.

• Machine learning: Developing algorithms and models that learn from data
and make predictions or decisions without explicit programming.

• Supervised learning: Algorithms learn from labeled data to predict outputs


for new, unseen data.
• Unsupervised learning: Algorithms learn from unlabeled data to find patterns
or structures within the data.

2|Page
4. Learning and development

• Hands-on experience: Engage with real-world projects and datasets to solidify


your understanding and develop practical skills.
• Structured learning: Consider courses, specializations, or certifications that
cover the fundamentals and applications of data science with Python.
• Community and resources: Leverage the vast Python data science community
for support, guidance, and access to learning resources.
• Continuous learning: Data science is a constantly evolving field. Keep
exploring new libraries, techniques, and advancements to stay ahead.

|Data science process, skills, applications, and challenges

Here's an overview of the key aspects of the data science process, the skills
required for data scientists, the applications of data science, and the associated
challenges

1. The data science process

• The data science process is a structured approach to solving data-related


problems and typically involves the following steps:
• Problem definition: Clearly defining the problem or question the data science
project aims to address.
• Data collection: Gathering relevant data from various sources such as
databases, APIs, or web scraping.
• Data cleaning and preparation: NumPy and Pandas are common tools used
for tasks like handling missing values, correcting inconsistencies, and
transforming data into a usable format.
• Exploratory data analysis (EDA): Examining and summarizing the data to
uncover initial patterns, relationships, and outliers.
• Data modeling: Building predictive or descriptive models using statistical
techniques and machine learning algorithms.
• Model evaluation and validation: Assessing the performance and reliability of
the model using appropriate metrics.
• Deployment and monitoring: Implementing the model into a production
environment and monitoring its performance over time.

2. Data scientist skills

Data scientists need a blend of technical and non-technical skills to succeed in


their roles:

3|Page
Technical skills

• Programming languages: Proficiency in languages like Python, R, and SQL for


data manipulation, analysis, and building models.
• Statistics and mathematics: A strong foundation in probability, statistics, linear
algebra, and calculus for data analysis, model building, and algorithm
understanding.
• Machine learning: Knowledge of various machine learning algorithms, their
applications, and proficiency in using libraries like Scikit-learn, TensorFlow, or
PyTorch.
• Data wrangling: Expertise in cleaning, transforming, and organizing raw data
for analysis.
• Data visualization: Ability to create clear and compelling visualizations using
tools like Matplotlib, Seaborn, Tableau, or Power BI.
• Big data technologies: Familiarity with tools and frameworks for handling and
processing large datasets, such as Hadoop and Spark.
• Cloud computing: Knowledge of cloud platforms like AWS, Azure, or Google
Cloud for storing data and deploying models.

Soft skills

• Communication and storytelling: Effectively presenting complex findings to


both technical and non-technical stakeholders, often through data storytelling.
• Problem-solving: The ability to approach problems logically, define them
clearly, and develop innovative solutions.
• Business acumen: Understanding the business context, identifying key
performance indicators, and aligning data science solutions with
organizational goals.
• Curiosity and continuous learning: A desire to explore data, ask questions, and
stay updated with the latest trends and technologies in the field.
• Data ethics: Awareness of ethical considerations like data privacy, bias, and
fairness, ensuring responsible and transparent data practices.

3. Applications of data science

Data science applications are transforming various industries, including:

• Healthcare: Early disease detection, personalized treatment plans, and


optimized hospital operations.
• Finance: Fraud detection, risk management, credit scoring, and market trend
prediction.

4|Page
• Retail and E-commerce: Customer segmentation, personalized
recommendations, inventory optimization, and sales forecasting.
• Manufacturing and Logistics: Predictive maintenance, supply chain
optimization, and demand forecasting.
• Marketing and Advertising: Targeted campaigns, customer analytics, and
advertising spend optimization.

4. Challenges in data science

Despite its potential, data science faces various challenges:

• Data quality and accessibility: Dealing with incomplete, inconsistent, or


inaccurate data and integrating data from diverse sources.
• Talent and skill gap: Finding and retaining skilled professionals with expertise
in both technical and soft skills.
• Integration with existing systems: Integrating data science solutions with
existing organizational infrastructure and workflows.
• Privacy and ethical concerns: Ensuring data privacy, mitigating bias in
algorithms, and maintaining transparency and accountability in data practices.
• Cost and resource management: Balancing investment in technology,
infrastructure, and personnel with the potential returns from data science
initiatives.
• By addressing these challenges and continuously developing relevant skills,
data scientists can unlock the full potential of data and drive innovation across
industries.

|Why Python for Data Science?

Python is a widely adopted programming language in data science due to its


simplicity, extensive libraries, and large community support. Its readability and
ease of learning make it accessible for beginners, while its powerful libraries like
NumPy, Pandas, Scikit-learn, TensorFlow, and PyTorch provide robust tools for data
manipulation, analysis, machine learning, and deep learning. Python's versatility
allows it to be used across various stages of the data science workflow, from data
collection and cleaning to model building, evaluation, and deployment.

|Data Science Life Cycle


Data Science Lifecycle revolves around the use of machine learning and different
analytical strategies to produce insights and predictions from information in order
to acquire a commercial enterprise objective.

The data science lifecycle is a structured, iterative process that guides data science
projects from their initial conception to the deployment and ongoing maintenance
of a solution. It ensures that data-driven insights are effectively leveraged to solve
business problems and drive value for organizations.

5|Page
1. Problem Statement: Clearly defining the business objective or question the
project aims to solve, establishing the foundation and success metrics.

This is the foundation of any data science project. It defines what problem you are
solving and why it matters — either from a business, academic, or technical
perspective.

Examples:

• Predict student dropout based on attendance and grades


• Forecast sales for next quarter
• Detect spam emails

Tools/Skills:

• Domain understanding
• Discussions with stakeholders
• Framing the problem into a machine learning task: classification,
regression, clustering, etc.

6|Page
2. Data Cleaning: Refining raw data by identifying and addressing errors,
inconsistencies, or missing values to ensure its quality and suitability for analysis
and modeling.

Gathering raw data from various sources. This could include structured data
(CSV, databases), unstructured data (text, images), or real-time data (APIs, web
scraping).

In Python:Example:-

To obtain relevant, sufficient, and accessible data for analysis.

3. Data Modelling: Selecting, building, training, and evaluating machine learning


or statistical models using the cleaned data to derive insights or make
predictions.

Raw data is often messy. You need to clean and prepare it for modeling.

• Handle missing values: fill, drop, or impute


• Fix data types: e.g., converting dates
• Remove duplicates
• Normalize and scale numerical features
• Encode categorical variables

Example:-

4. Optimization & Deployment: After modeling, the solution must be optimized


for speed, accuracy, and resource usage, then deployed for real-world use.
Preparing the validated model for real-world application, integrating it into
existing systems, and implementing monitoring for ongoing performance and
maintenance. Deploying the trained model using a Flask API on a web server
so users can input house details and get predicted prices.

7|Page
Example:

Introduction to Jupyter Notebook

Jupyter Notebook is an open-source web application that allows the creation


and sharing of documents containing live code, equations, visualizations, and
narrative text. It is a popular tool in data science for interactive data exploration,
analysis, and prototyping.

Jupyter Notebook is an open-source web application that allows you to create


and share documents that contain live code, equations, visualizations, and
narrative text. It is a popular tool among data scientists, researchers, and
educators for interactive computing and data analysis. The name "Jupyter" is
derived from the three core programming languages it originally supported:
Julia, Python , and R . But now since it supports more than 40 programming
languages, it is a flexible option for a range of computational jobs. Because the
notebook interface is web-based, users may use their web browsers to interact
with it.

The Jupyter Notebook is made up of the three components listed below.

1. The notebook web application

It is an interactive web application that allows you to write and run code.  The
independent processes launched by the notebook web application are known
as kernels, and they are used to execute user code in the specified language and
return results to the notebook web application.

2. Kernels

The independent processes launched by the notebook web application are


known as kernels, and they are used to execute user code in the specified
language and return results to the notebook web application.

3. Notebook documents

All content viewable in the notebook online application, including calculation


inputs and outputs, text, mathematical equations, graphs, and photos, is
represented in the notebook document.

8|Page
Installation:

Jupyter Notebook can be installed using pip, Python's package installer:

Basic Usage:

To launch Jupyter Notebook, open a terminal or command prompt and run:

This will open a new tab in your web browser, displaying the Jupyter Notebook
interface. You can then navigate to a directory and create a new notebook or
open an existing one.

Markdown Cells:

Jupyter Notebook supports Markdown for creating formatted text, headings,


lists, and links within the notebook. This allows for clear documentation and
explanations alongside your code. To create a Markdown cell, change the cell
type from "Code" to "Markdown" using the dropdown menu in the toolbar.

Code Cells:

Code cells are where you write and execute your programming code. The default
kernel for Jupyter Notebook runs Python code. You can write your Python code
in a code cell and execute it by pressing Shift + Enter or clicking the "Run" button
in the toolbar. The output of the code will be displayed directly below the cell.

Kernels: A kernel is a computational engine that executes the code in your


notebook. Jupyter supports a wide variety of kernels for different programming
languages, including Python, R, Julia, and many more.

Raw NBConvert Cell

A Raw NBConvert cell is a special type of cell that is not meant to be rendered
or executed within the Jupyter Notebook interface itself. Instead, it's used to

9|Page
provide raw output that will be processed by nbconvert, the tool used to convert
Jupyter Notebooks into other formats like HTML, PDF, or LaTeX.

Heading Cell: The header cell is not supported by the Jupyter Notebook. The
panel displayed in the screenshot below will pop open when you choose the
heading from the drop-down menu.

Essential Keyboard Shortcuts

Jupyter Notebook has many useful keyboard shortcuts that will speed up your
workflow. You can view all of them by going to Help > Keyboard Shortcuts in the
menu. Here are a few to get you started:

• Enter: Enter "Edit" mode to type in a cell.


• Esc: Enter "Command" mode to manipulate cells (e.g., move, delete, copy).
• A (in Command mode): Insert a new cell Above the current cell.
• B (in Command mode): Insert a new cell Below the current cell.
• M (in Command mode): Change the cell type to Markdown.
• Y (in Command mode): Change the cell type to code.
• D, D (in Command mode, press D twice): Delete the selected cell.
• Shift + Enter: Run the current cell and select the cell below it.
• Ctrl + Enter: Run the current cell and stay in the same cell.

|Getting Started with Jupyter Notebook

 Homepage.

 Browse to the folder in which you would like to create your first notebook, click
the “New” drop-down button in the top-right and select “Python 3(ipykernel)”:

10 | P a g e
Your first Jupyter Notebook will open in new tab — each notebook uses its own
tab because you can open multiple notebooks simultaneously.

If you switch back to the dashboard, you will see the new file Untitled.ipynb and
you should see some green text that tells you your notebook is running

ipynb file is one notebook, so each time you create a new notebook, a new
.ipynb file will be created..

Naming

You will notice that at the top of the page is the word Untitled. This is the title
for the page and the name of your Notebook. Since that isn’t a very descriptive
name, let’s change it!

Just move your mouse over the word Untitled and click on the text. You should
now see an in-browser dialog titled Rename Notebook. Let’s rename this one
to Hello Jupyter:

11 | P a g e
Running Cells

A Notebook’s cell defaults to using code whenever you first create one, and that
cell uses the kernel that you chose when you started your Notebook.

In this case, you started yours with Python 3 as your kernel, so that means you
can write Python code in your code cells. Since your initial Notebook has only
one empty cell in it, the Notebook can’t really do anything.

Thus, to verify that everything is working as it should, you can add some Python
code to the cell and try running its contents.

Let’s try adding the following code to that cell:

 Running a cell means that you will execute the cell’s contents. To execute a cell,
you can just select the cell and click the Run button that is in the row of buttons
along the top. It’s towards the middle. If you prefer using your keyboard, you
can just press

If you have multiple cells in your Notebook, and you run the cells in order, you
can share your variables and imports across cells. This makes it easy to separate
out your code into logical chunks without needing to reimport libraries or
recreate variables or functions in every cell.

The Jupyter Notebook interface has a standard menu bar at the top, which
provides access to all the functions for managing, editing, and running your
notebook.

Here is a breakdown of the main menus and their primary functions:

12 | P a g e
1. File Menu

This menu is for managing the notebook document itself.

• New Notebook: Creates a new notebook in a new browser tab.


• Open: Opens a new window with the file browser to open an existing notebook.
• Make a Copy: Creates a duplicate of the current notebook in the same directory.
• Save and Checkpoint: Saves the current state of your notebook and creates a
"checkpoint," a version you can revert to.
• Revert to Checkpoint: Allows you to restore the notebook to a previously saved
checkpoint.
• Rename: Changes the name of the notebook file.
• Download as: Exports the notebook into various formats, such as .py (Python
script), .html, .pdf, .md (Markdown), and .ipynb (another notebook file).
• Close and Halt: Closes the notebook and shuts down its associated kernel.

2. Edit Menu

This menu contains functions for manipulating the content of the cells.

• Cut, Copy, Paste Cells: Standard actions for moving and duplicating cells.
• Delete Cells: Removes the selected cell(s).
• Split Cell: Splits a cell into two at the cursor's position.
• Merge Cell Above/Below: Combines the selected cell with the one above or
below it.

3. View Menu

This menu controls how the notebook and its cells are displayed.

• Toggle Header/Toolbar: Hides or shows the notebook's header and main


toolbar.
• Toggle Line Numbers: Displays or hides line numbers in the code cells.
• Cell Toolbar: Provides options to show a specialized toolbar for each cell, such
as for rendering a slide show.

4. Insert Menu

This is for adding new cells to your notebook.

• Insert Cell Above: Creates a new empty cell above the currently selected cell.
• Insert Cell Below: Creates a new empty cell below the currently selected cell.

13 | P a g e
5. Cell Menu

This menu offers commands for running and manipulating cells.

• Run Cells: Executes the selected cell(s).


• Run All: Executes every cell in the entire notebook from top to bottom.
• Run All Above: Executes all cells located above the current one.
• Run All Below: Executes all cells from the current one to the end of the
notebook.
• Cell Type: A submenu to change a cell's type to Code, Markdown, or Raw
NBConvert.
• Current Outputs: Options to clear or toggle the visibility of the output from the
current cell.
• All Output: Options to clear or toggle the output of all cells in the notebook.

6. Kernel Menu

The kernel is the computational engine that runs your code. This menu lets you
manage it.

• Interrupt: Stops the execution of the current cell. This is useful if a cell is taking
too long to run or is stuck in an infinite loop.
• Restart: Restarts the kernel. This clears all variables and memory, giving you a
fresh start without closing the notebook.
• Restart & Clear Output: Restarts the kernel and also clears all the output from
every cell in the notebook.
• Restart & Run All: Restarts the kernel and then runs every cell in the notebook
from top to bottom.
• Shutdown: Stops the kernel completely, freeing up resources. The notebook
itself remains open, but you can't run any code.
• Change Kernel: Allows you to switch to a different kernel if you have others
installed (e.g., R, Julia, or another Python environment).

7. Widgets Menu

This menu is for managing interactive widgets, which are special objects that can
be used to create interactive controls in a notebook.

• Save Notebook Widget State: Saves the current state of all widgets.
• Clear Widget State: Resets all widgets to their default state.

8. Help Menu

• This menu provides quick access to documentation and information.


• User Interface Tour: A brief, interactive tour of the notebook interface.
• Keyboard Shortcuts: Displays a pop-up window with all available shortcuts.

14 | P a g e
2
DataCollection&DataManipulation

Introduction:-

Data collection is the process of gathering raw data from various sources for
further processing and analysis. In Python, data can be collected manually or
automatically using libraries and tools.

Common data sources and tools in Python

Files: CSV, Excel, JSON, and text files are readily handled by Python, especially
with the Pandas library.

Databases: Libraries like sqlite3, psycopg2, and SQLAlchemy allow connecting to


and querying databases (e.g., SQLite, PostgreSQL) to retrieve structured data.

• APIs (Application Programming Interfaces): APIs offer a programmatic way to


access data from various online services (e.g., social media, financial data,
weather data). The requests library is a popular choice for making HTTP requests
to APIs and parsing the JSON or XML responses.
• Web Scraping: Libraries like Beautiful Soup and Scrapy can be used to extract
data from websites that lack dedicated APIs.

Best practices for data collection

• Understand the data requirements and identify appropriate sources.


• Be mindful of data privacy and security regulations.
• Document the data source, method of collection, and any transformations
applied.
• Handle errors gracefully using try-except blocks.
• Respect API rate limits and terms of service.

15 | P a g e
Data manipulation

Data manipulation involves organizing, restructuring, and transforming collected


data to make it suitable for analysis and extracting insights.

Python libraries for data manipulation

Pandas: Pandas is the cornerstone of data manipulation in Python. It provides


intuitive data structures like DataFrames (similar to spreadsheets) and Series
(single columns).

NumPy: NumPy is foundational for numerical computations and provides


efficient multi-dimensional array objects. Pandas is built upon NumPy, leveraging
its speed and functionality.

Reading and Writing Data: Pandas simplifies importing data from various formats
(CSV, Excel) and exporting processed data.

Exploring Data: Functions like head(), tail(), info(), and describe() allow for initial
data inspection and summarizing key characteristics.

Handling Missing Values: Techniques like dropna() (dropping rows/columns with


missing values) and fillna() (filling missing values with a specified value or
strategy) are used to address incomplete data.

Data Selection and Filtering: Data can be selected and filtered based on labels
(.loc), integer positions (.iloc), or boolean conditions.

Data Transformation: Applying functions to columns, changing data types, and


renaming columns are common transformations.

Data Aggregation and Grouping: The groupby() function allows grouping data
based on certain criteria and applying aggregate functions
(e.g., mean, sum, count, min, max) to the groups.

Merging and Joining DataFrames: merge() and join() are used to combine data
from multiple DataFrames based on shared columns or indices.

Concatenation: concat() allows vertically or horizontally stacking DataFrames.

Sorting Data: sort_values() arranges data in ascending or descending order based


on one or more columns.

16 | P a g e
Introduction to NumPy Arrays
The most important object defined in NumPy is an N-dimensional array type
called ndarray.

It describes the collection of items of the same type, Items in the collection can
be accessed using a zero-based index.The main data structure in the NumPy
library is the NumPy array, which is an extremely fast and memory-efficient data
structure. The NumPy array is much faster than the common Python list and
provides vectorized matrix operations. In this chapter, you will see the different
data types that you can store in a NumPy array, the different ways to create the
NumPy arrays, how you can access items in a NumPy array, and how to add or
remove items from a NumPy array.

NumPy Data Types


The NumPy library supports all the default Python data types in addition to some
of its intrinsic data types. This means that the default Python data types, e.g.,
strings, integers, floats, Booleans, and complex data types, can be stored in
NumPy arrays. You can check the data type in a NumPy array using the
dtypeprperty. You will see the different ways of creating NumPy arrays in detail
in the next section. Here, we will show you the array() function and then print the
type of the NumPy array using the dtype property. Here is an example:

The script above defines a NumPy array with six integers. Next, the array type is
displayed via the dtype attribute. Finally, the size of each item in the array (in
bytes) is displayed via the itemsize attribute. The output below prints the array
and the type of the items in the array, i.e., int32 (integer type), followed by the

17 | P a g e
size of each item in the array, which is 4 bytes (32 bits).

The Python NumPy library supports the following data types including the default
Python types.

• i – integer

• b – boolean

• u – unsigned integer

• f – float •

• c – complex float

• m – timedelta

• M – datetime

• – object

• S – string

• U – Unicode string

• V – fixed chunk of memory for other type ( void )

Let’s see another example of how Python stores text. The following script creates
a NumPy array with three text items and displays the data type and size of each
item.

Script 2:

18 | P a g e
The output below shows that NumPy stores text in the form of Unicode string
data type denoted by U. Here, the digit 6 represents the item with the most
number of characters.

Though the NumPy array is intelligent enough to guess the data type of items
stored in it, this is not always the case. For instance, in the following script, you
store some dates in a NumPy array. Since the dates are stored in the form of texts
(enclosed in double quotations), by default, the NumPy array treats the dates as
text. Hence, if you print the data type of the items stored, you will see that it will
be a Unicode string (U10).

You can convert data types in the NumPy array to other data types via the astype()
method. But first, you need to specify the target data type in the astype() method.
For instance, the following script converts the array you created in the previous
script to the datetime data type. You can see that “M” is passed as a parameter
value to the astype() function. “M” stands for the datetime data type as
aforementioned.

19 | P a g e
In addition to converting arrays from one type to another, you can also specify
the data type for a NumPy array at the time of definition via the dtype parameter.
For instance, in the following script, you specify “M” as the value or the dtype
parameter, which tells the Python interpreter that the items must be stored as
datatime values.

Creating NumPy Arrays

Depending on the type of data you need inside your NumPy array, different
methods can be used to create a NumPy array.

Using Array Method

To create a NumPy array, you can pass a list to the array() method of the NumPy
module, as shown below:

20 | P a g e
You can also create a multi-dimensional NumPy array. To do so, you need to
create a list of lists where each internal list corresponds to the row in a two
dimensional array. Here is an example of how to create a two-dimensional array
using the array () method.

With the arrange () method, you can create a NumPy array that contains a range
of integers. The first parameter to the arrange method is the lower bound, and
the second parameter is the upper bound. The lower bound is included in the
array. However, the upper bound is not included. The following script creates a
NumPy array with integers 5 to 10.

You can also specify the step as a third parameter in the arrange() function. A step
defines the distance between two consecutive points in the array. The following
script creates a NumPy array from 5 to 11 with a step size of 2.

21 | P a g e
The ones() method can be used to create a NumPy array of all ones. Here is an
example.

You can create a two-dimensional array of all ones by passing the number of rows
and columns as the first and second parameters of the ones() method, as shown
below:

Using Zeros Method

The zeros() method can be used to create a NumPy array of all zeros. Here is an
example

22 | P a g e
You can create a two-dimensional array of all zeros by passing the number of rows
and columns as the first and second parameters of the zeros() method, as shown
below:

Using Eyes Method

The eye() method is used to create an identity matrix in the form of a two
dimensional NumPy array. An identity matrix contains 1s along the diagonal, while
the rest of the elements are 0 in the array.

23 | P a g e
Using Random Method
The random.rand() function from the NumPy module can be used to create a
NumPy array with uniform distribution.

The random.randn() function from the NumPy module can be used to create a
NumPy array with normal distribution, as shown in the following example.

Finally, the random.randint() function from the NumPy module can be used to
create a NumPy array with random integers between a certain range. The first
parameter to the randint() function specifies the lower bound, the second
parameter specifies the upper bound, and the last parameter specifies the

24 | P a g e
number of random integers to generate between the range. The following
example generates five random integers between 5 and 50.

Printing NumPy Arrays:

Depending on the dimensions, there are various ways to display the NumPy
arrays. The simplest way to print a NumPy array is to pass the array to the print
method, as you have already seen in the previous section. An example is given
below:

You can also use loops to display items in a NumPy array. It is a good idea to know
the dimensions of a NumPy array before printing the array on the console. To see
the dimensions of a NumPy array, you can use the ndim attribute, which prints
the number of dimensions for a NumPy array. To see the shape of your NumPy
array, you can use the shape attribute.

25 | P a g e
The script shows that our array is one-dimensional. The shape is (6,), which means
our array is a vector with 6 items.

To print items in a one-dimensional NumPy array, you can use a single foreach
loop, as shown below:

Now, let’s see another example of how you can use the foreach loop to print items in
a two-dimensional NumPy array. The following script creates a two-dimensional
NumPy array with four rows and five columns. The array contains random integers
between 1 and 10. The array is then printed on the console

In the output below, you can see your newly created array.

26 | P a g e
Let’s now try to see the number of dimensions and shape of our NumPy array.

The output below shows that our array has two dimensions and the shape of the
array is (4,5), which refers to four rows and five columns.

To traverse through items in a two-dimensional NumPy array, you need two


foreach loops: one for each row and the other for each column in the row. Let’s
first use one for loop to print items in our two-dimensional NumPy array.

The output shows all the rows from our two-dimensional NumPy array.

27 | P a g e
To traverse through all the items in the two-dimensional array, you can use the
nested for each loop as follow:

In the next section, you will see how to add, remove, and sort elements in a
NumPy array.

Adding Items in a NumPy Array

To add the items into a NumPy array, you can use the append() method from the
NumPy module. First, you need to pass the original array and the item that you
want to append to the array to the append() method. The append() method
returns a new array that contains newly added items appended to the end of the
original array. The following script adds a text item “Yellow” to an existing array
with three items.

28 | P a g e
In addition to adding one item at a time, you can also append an array of items to
an existing array. The method remains similar to appending a single item. You just
have to pass the existing array and the new array to the append () method, which
returns a concatenated array where items from the new array are appended at
the end of the original array.

To add items in a two-dimensional NumPy array, you have to specify whether you
want to add the new item as a row or as a column. To do so, you can take the help
of the axis attribute of the append method.

Let’s first create a 3 x 3 array of all zeros.

The output shows all the rows from our two-dimensional NumPy array.

29 | P a g e
To add a new row in the above 3 x 3 array, you need to pass the original array to
the new array in the form of a row vector and the axis attribute to the append()
method. To add a new array in the form of a row, you need to set 0 as the value
for the axis attribute.

Here is an example script.

In the output below, you can see that a new row has been appended to our
original 3 x 3 array of all zeros.

To append a new array as a column in the existing 2-D array, you need to set the
value of the axis attribute to 1.

Removing Items from a NumPy Array

To delete an item from an array, you may use the delete() method. You need to
pass the existing array and the index of the item to be deleted to the delete()
method. The following script deletes an item at index 1 (second item) from the
my_array array.

30 | P a g e
The output shows that the item at index 1, i.e., “Green,” is deleted.

If you want to delete multiple items from an array, you can pass the item indexes
in the form of a list to the delete() method. For example, the following script
deletes the items at index 1 and 2 from the NumPy array named my_array.

You can delete a row or column from a 2-D array using the delete method.
However, just as you did with the append() method for adding items, you need to
specify whether you want to delete a row or column using the axis attribute. The
following script creates an integer array with four rows and five columns. Next,
the delete() method is used to delete the row at index 1 (second row). Notice here
that to delete the array, the value of the axis attribute is set to 0.

31 | P a g e
import numpy as np
l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3]
a = np.array(l)
print('Max way1 = ',a.max())
print('Max way2 = ',np.max(a))

The output shows that the second row is deleted from the input 2-D array.

Finally, to delete a column, you can set the value of the axis attribute to 1, as
shown below:

Aggregations
min() function will return the minimum value from the ndarray, there are two
ways in which we can use min function, example of both ways are given below.

import numpy as np
l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3]
a = np.array(l)
print('Min way1 = ',a.min())
print('Min way2 = ',np.min(a))

OutPut:
Min way1 = 1
Min way2 = 1
max () function will return the maximum value from the ndarray, there are two
ways in which we can use min function, example of both ways are given below.

32 | P a g e
OutPut:
Max way1 = 11
Max way2 = 11

NumPy support many aggregation functions such as min, max, argmin, argmax, sum,
mean, std, etc…

l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
a = np.array(l)
OutPut
print('Min
: = ',a.min())
print('ArgMin = ',a.argmin())
print('Max = ',a.max())
print('ArgMax = ',a.argmax())
print('Sum = ',a.sum())
print('Mean = ',a.mean())
print('Std = ',a.std())

Output :

Min = 1
ArgMin = 3
Max = 11
ArgMax = 8
Sum = 122
Mean = 5.304347826086956
Std = 3.042235771223635

Using axis argument with aggregate functions.

When we apply aggregate functions with multidimensional ndarray, it will apply


aggregate function to all its dimensions (axis).

import numpy as np
array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print('sum = ',array2d.sum())

33 | P a g e
OutPut :

sum = 45

If we want to get sum of rows or cols we can use axis argument with the aggregate
functions.

import numpy as np
array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal

OutPut:

sum (cols) = [12 15 18]


sum (rows) = [6 15 24]

There are two ways in which you can access element of multi-dimensional array,
example of both the method is given below

arr =
np.array([['a','b','c'],['d','e','f'],['g','h','i']])
print('double = ',arr[2][1]) # double bracket notaion
print('single = ',arr[2,1]) # single bracket notation

OutPut:
double = h
single = h

Both method is valid and provides exactly the same answer, but single bracket
notation is recommended as in double bracket notation it will create a temporary
sub array of third row and then fetch the second column from it.

Single bracket notation will be easy to read and write while programming.

Slicing ndarray
Slicing in python means taking elements from one given index to another given
index.

Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.

34 | P a g e
 Default start is 0

 Default end is length of the array

 Default step is 1

import numpy as np
arr = np.array(['a','b','c','d','e','f','g','h'])
print(arr[2:5])
print(arr[:5])
print(arr[5:])
print(arr[2:7:2])
print(arr[::-1])

OutPut:

['c' 'd' 'e']


['a' 'b' 'c' 'd' 'e']
['f' 'g' 'h']
['c' 'e' 'g']
['h' 'g' 'f' 'e' 'd' 'c' 'b' 'a']

C-0 C-1 C-2 C-3 C-4


R-0

R-1

R-2
a=
R-3

R-4

35 | P a g e
Example :
a[2][3] =
a[2,3] =
a[2] =
a[0:2] =
a[0:2:2] =
a[::-1] =
a[1:3,1:3] =
a[3:,:3] =
a[:,::-1] =

Slicing multi-dimensional array would be same as single dimensional array with the
help of single bracket notation we learn earlier, lets see an example.
arr = np.array([['a','b','c'],['d','e','f'],['g','h','i']])
 Slicing multi-dimensional array would be same as single dimensional array with
print(arr[0:2 , 0:2]) #first two rows and cols
the help of single#reversed
print(arr[::-1]) bracket notation
rowswe learn earlier, lets see an example.
print(arr[: , ::-1]) #reversed cols
print(arr[::-1,::-1]) #complete reverse

OutPut:
[['a' 'b']
['d' 'e']]
[['g' 'h' 'i']
['d' 'e' 'f']
['a' 'b' 'c']]
[['c' 'b' 'a']
['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]

Warning : Array Slicing is mutable !

When we slice an array and apply some operation on them, it will also make changes
in original array, as it will not create a copy of a array while slicing.

36 | P a g e
import numpy as np
arr = np.array([1,2,3,4,5])
arrsliced = arr[0:3]
arrsliced[:] = 2 # Broadcasting
print('Original Array = ', arr)
print('Sliced Array = ',arrsliced)

OutPut:
Original Array = [2 2 2 4 5]
Sliced Array = [2 2 2]
NumPy Arithmetic Operations:

import numpy as np
arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]])
arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]])
arradd1 = arr1 + 2 # addition of matrix with scalar
arradd2 = arr1 + arr2 # addition of two matrices
print('Addition Scalar = ', arradd1)
print('Addition Matrix = ', arradd2)
arrsub1 = arr1 - 2 # substraction of matrix with scalar
arrsub2 = arr1 - arr2 # substraction of two matrices
print('Substraction Scalar = ', arrsub1)
print('Substraction Matrix = ', arrsub2)
arrdiv1 = arr1 / 2 # substraction of matrix with scalar
arrdiv2 = arr1 / arr2 # substraction of two matrices
print('Division Scalar = ', arrdiv1)
print('Division Matrix = ', arrdiv2)

OUTPUT :
Addition Scalar = [[3 4 5]
[3 4 5]
[3 4 5]]
Addition Matrix = [[5 7 9]
[5 7 9]
[5 7 9]]
Substraction Scalar = [[-1 0 1]
[-1 0 1]
[-1 0 1]]
Substraction Matrix = [[-3 -3 -3]
[-3 -3 -3]
[-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
[0.5 1. 1.5]
[0.5 1. 1.5]]
Division Matrix = [[0.25 0.4 0.5 ]
[0.25 0.4 0.5 ]
[0.25 0.4 0.5 ]]

37 | P a g e
import numpy as np
arrmul1 = arr1 * 2 # multiply matrix with scalar
arrmul2 = arr1 * arr2 # multiply two matrices
print('Multiply Scalar = ', arrmul1)
#Note : its not metrix multiplication*
print('Multiply Matrix = ', arrmul2)
# In order to do matrix multiplication
arrmatmul = np.matmul(arr1,arr2)
print('Matrix Multiplication = ',arrmatmul)
# OR
arrdot = arr1.dot(arr2)
print('Dot = ',arrdot)
# OR
arrpy3dot5plus = arr1 @ arr2
print('Python
Output: 3.5+ support = ',arrpy3dot5plus)

Output:

Multiply Scalar = [[2 4 6]


[2 4 6]
[2 4 6]]
Multiply Matrix = [[ 4 10 18]
[ 4 10 18]
[ 4 10 18]]
Matrix Multiplication = [[24 30 36]
[24 30 36]
[24 30 36]]
Dot = [[24 30 36]
[24 30 36]
[24 30 36]]
Python 3.5+ support = [[24 30 36]
[24 30 36]
[24 30 36]]

Sorting Array

The sort() function returns a sorted copy of the input array.

import numpy as np
# arr = our ndarray
np.sort(arr,axis,kind,order)
# OR arr.sort()

arr = array to sort (inplace)


axis = axis to sort (default=0)
kind = kind of algo to use
(‘quicksort’ <- default, ‘mergesort’, ‘heapsort’)
order = on which field we want to sort (if
multiple fields)

38 | P a g e
import numpy as np
arr = np.array([‘Daxa’,’Junagadh','Insitute','of’,’BCA'])
print("Before Sorting = ", arr)
arr.sort() # or np.sort(arr)
print("After Sorting = ",arr)

Output:

Before Sorting = [‘Dax’ ‘Junagadh' 'Insitute'


'of’ ‘BCA']
Conditional Selection
After Sorting = ['BCA' 'Dax' 'Insitute'
'Junagadh' 'of']

Similar to arithmetic operations when we apply any comparison operator to Numpy


Array, then it will be applied to each element in the array and a new bool Numpy
Array will be created with values True or False.

import numpy as np
arr = np.random.randint(1,100,10)
print(arr)
boolArr = arr > 50
print(boolArr)

Output:

[25 17 24 15 17 97 42 10 67 22]
[False False False False False True False False
True False]

import numpy as np
arr = np.random.randint(1,100,10)
print("All = ",arr)
boolArr = arr > 50
print("Filtered = ", arr[boolArr])

Output:

All = [31 94 25 70 23 9 11 77 48 11]


Filtered = [94 70 77]

39 | P a g e
What Is pandas?

pandas is an open-source software library built on Python for data analysis and
data manipulation. The pandas library provides data structures designed
specifically to handle tabular datasets with a simplified Python API. pandas is an
extension of Python to process and manipulate tabular data, implementing
operations such as loading, aligning, merging, and transforming datasets
efficiently.

The popularity of pandas as a data analysis tool might be attributed to its


versatility as well as efficient performance. The name "pandas" originates from
the term "panel data," referring to datasets that span multiple time periods,
emphasizing its focus on versatile data structures for handling real-world
datasets.

With its support for structured data formats like tables, matrices, and time series,
the pandas Python API provides tools to process messy or raw datasets into clean,
structured formats ready for analysis. To achieve high performance,
computationally intensive operations are implemented using C or Cython in the
back-end source code. The pandas library is inherently not multi-threaded, which
can limit its ability to take advantage of modern multi-core platforms and process
large datasets efficiently. However, new libraries and extensions in the Python
ecosystem can help address this limitation.

The pandas library integrates with other scientific tools within the broader Python
data analysis ecosystem.

How Does pandas Work?

At the core of the pandas open-source library is the DataFrame data structure for
handling tabular and statistical data. A pandas DataFrame is a two-dimensional,
array-like table where each column represents values of a specific variable, and
each row contains a set of values corresponding to those variables. The data
stored in a DataFrame can encompass numeric, categorical, or textual types,
enabling pandas to manipulate and process diverse datasets.

pandas facilitates importing and exporting datasets from various file formats,
such as CSV, SQL, and spreadsheets. These operations, combined with its data

40 | P a g e
manipulation capabilities, enable pandas to clean, shape, and analyze tabular and
statistical data.

Ultimately, the DataFrame serves as the backbone of pandas, enabling users to


manage and analyze structured datasets efficiently, from importing and exporting
raw data to performing advanced data manipulation tasks for machine
learning and beyond.

Pandas allows for importing and exporting tabular data in various formats, such
as CSV, SQL, and spreadsheet files.

pandas also allows for various data manipulation operations and data cleaning
features, including selecting a subset, creating derived columns, sorting, joining,
filling, replacing, summary statistics, and plotting.

41 | P a g e
According to organizers of the Python Package Index—a repository of software for
the Python programming language—pandas is well suited for working with several
kinds of data, including:

Tabular data with heterogeneously-typed columns, as in an SQL table or


spreadsheet.

Ordered and unordered (not necessarily fixed-frequency) time-series data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with row and


column labels.

Any other form of observational/statistical datasets. The data actually need not
be labeled at all to be placed into a pandas data structure.

42 | P a g e
What Are the Benefits of pandas?

The panda’s library offers numerous benefits to data scientists and developers,
making it a valuable tool for data analysis and manipulation. Key benefits include:

Handling of missing data (NaN): pandas simplifies working with datasets


containing missing data, represented as NaN, whether the data is numeric or non-
numeric.

GroupBy functionality: pandas provides efficient GroupBy operations, enabling


users to perform split-apply-combine workflows for data aggregation and
transformation.

DataFrame size mutability: Columns can be added or removed from DataFrames


or higher-dimensional data structures.

Automated and explicit data alignment: pandas ensures data alignment by


automatically aligning objects like Series and DataFrames to their labels,
simplifying computations.

Thorough Documentation: The simplified API and fully documented features


lower the learning curve for pandas. The short, simple tutorialsand code samples
enable new users to quickly start coding.

I/O tools: pandas supports importing and exporting data in various formats, such
as CSV, Excel, SQL, and HDF5.

Visualization-ready datasets: pandas has straightforward visualization that can be


plotted directly from the DataFrame object.

Flexible reshaping and pivoting: pandas simplifies reshaping and pivoting to single
function calls on datasets to further prepare them for analysis or visualization.

Hierarchical axis labeling: pandas supports hierarchical indexing, allowing users to


manage multi-level data structures within a single DataFrame.

Time-series functionality: pandas includes multiple time-series analysis functions,


offering tools for date-range generation, frequency conversion, moving window
calculations, and lag analysis.

43 | P a g e
How to Get Started With Accelerated pandas?

 Install :

 conda install pandas

 pip instll pandas

Data Structures in Pandas module:

There are 3 data structures provided by the Pandas module, which are as follows:

• Series: It is a 1-D size-immutable array like structure having


homogeneous data.

• DataFrames: It is a 2-D size-mutable tabular structure with


heterogeneously typed columns.

• Panel: It is a 3-D, size-mutable array.

Series

import pandas as pd
s = pd.Series(data,index,dtype,copy=False)

data = array like Iterable


index = array like index
dtype = data-type
copy = bool, default is False

import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11])
print(s)

OutPut:

dtype:

0 1
1 3
2 5
3 7
4 9
5 11 44 | P a g e
We can then access the elements inside Series just like array using square brackets
notation.

import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11])
print("S[0] = ", s[0])
b = s[0] + s[1]
print("Sum = ", b)

Output:

S[0] = 1
Sum = 4

We can specify the data type of Series using dtype parameter

import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str')
print("S[0] = ", s[0])
b = s[0] + s[1]
Oprint("Sum = ", b)

Output

S[0] = 1
Sum = 13

We can specify index to Series with the help of index parameter

import numpy as np
import pandas as pd
i = ['name','address','phone','email','website']
d = [‘noble’,’jnd',123’,’nu.com’,’noble.ac.in']
s = pd.Series(data=d,index=i)
print(s)

Output

name noble
address jnd
phone 123
email nu.com
website noble.ac.in 45 | P a g e
dtype: object
Creating Time Series

We can use some of pandas inbuilt date functions to create a time series.

import numpy as np
import pandas as pd
dates = pd.to_datetime("27th of July, 2020")
i = dates + pd.to_timedelta(np.arange(5), unit='D')
d = [50,53,25,70,60]
time_series = pd.Series(data=d,index=i)
print(time_series)

Output:

2020-07-27 50
2020-07-28 53
2020-07-29 25
2020-07-30 70
2020-07-31 60
dtype: int64

Pandas DataFrame
DataFrame is the most important and widely used data structure and is a standard
way to store data. DataFrame has data aligned in rows and columns like the SQL
table or a spreadsheet database. We can either hard code data into a DataFrame
or import a CSV file, tsv file, Excel file, SQL table, etc. We can use the below
constructor for creating a DataFrame object.

Data frames are two dimensional data structure, i.e. data is aligned in a tabular
format in rows and columns.

Data frame also contains labelled axes on rows and columns.Structure :

PDS Algo SE INS

101

102

103

….

46 | P a g e
160

pandas.DataFrame(data, index, columns, dtype, copy)

Below is a short description of the parameters:

• data - create a DataFrame object from the input data. It can be list, dict, series,
Numpy ndarrays or even, any other DataFrame.

• index - has the row labels

• columns - used to create column labels

• dtype - used to specify the data type of each column, optional parameter

• copy - used for copying data, if any

There are many ways to create a DataFrame. We can create DataFrame object
from Dictionaries or list of dictionaries. We can also create it from a list of tuples,
CSV, Excel file, etc. Let’s run a simple code to create a DataFrame from the list of
dictionaries.

Example :

import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
OutPut:
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS
'])
print(df)

Output:
PDS Algo SE INS
101 0 23 93 46
102 85 47 31 12
103 35 34 6 89
104 66 83 70 50
105 65 88 87 87

47 | P a g e
Grabbing the column:
import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
print(df['PDS'])
OutPut:
101 0
102 85
103 35
104 66
105 65
Name: PDS, dtype: int32
Grabbing the multiple column

print(df['PDS', 'SE'])

Output:
PDS SE
101 0 93
102 85 31
103 35 6
104 66 70
105 65 87

Grabbing a row
print(df.loc[101]) # using labels
#OR
print(df.iloc[0]) # using zero based index

Output:
PDS 0
Algo 23
SE 93
INS 46
Name: 101, dtype: int32

Grabbing Single Value

print(df.loc[101, 'PDS']) # using labels

48 | P a g e
Output:
0

Deleting Row

df.drop('103',inplace=True)
print(df)

Output:

PDS Algo SE INS


101 0 23 93 46
102 85 47 31 12
104 66 83 70 50
105 65 88 87 87

Creating new column

df['total'] = df['PDS'] + df['Algo'] + df['SE'] + df['INS']


print(df)
Output:

PDS Algo SE INS total


101 0 23 93 46 162
102 85 47 31 12 175
103 35 34 6 89 164
104 66 83 70 50 269
105 65 88 87 87 327

Deleting Column and Row

df.drop('total',axis=1,inplace=True)
print(df)

Output
PDS Algo SE INS
101 0 23 93 46
102 85 47 31 12
103 35 34 6 89 49 | P a g e
104 66 83 70 50
105 65 88 87 87
Getting Subset of Data Frame

print(df.loc[[101,104], [['PDS','INS']])

Output:

PDS INS
101 0 46
104 66 50

Selecting all cols except one

print(df.loc[:, df.columns != 'Algo' ])

Output:
PDS SE INS
101 0 93 46
102 85 31 12
103 35 6 89
104 66 70 50
Conditional
105 65 87Selection:
87

Similar to NumPy we can do conditional selection in pandas.

import numpy as np
import pandas as pd
np.random.seed(121)
Output:
randArr = np.random.randint(0,100,20).reshape(5,4)
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE',
'INS'])
print(df)
print(df>50)

Output:
PDS Algo SE INS
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
101 True True False True
102 True True True True
103 False False True True
104 True False True True 50 | P a g e
105 True True True False
Note : we have used np.random.seed() method and set seed to be 121,
so that when you generate random number it matches with the random
number I have generated.

We can then use this boolean DataFrame to get associated values.

dfBool = df > 50
print(df[dfBool])

Output:

PDS Algo SE INS


101 66 85 NaN 95
102 65 52 83 96
103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN

Note : It will set NaN (Not a Number) in case of False

We can apply condition on specific column.

dfBool = df['PDS'] > 50


print(df[dfBool])

Output:

PDS Algo SE INS


101 66 85 8 95
102 65 52 83 96
104 54 3 94 52
105 57 75 88 39

51 | P a g e
Setting/Resetting index

In our previous example we have seen our index does not have name,
if we want to specify name to our index we can specify it using
DataFrame.index.name property.

df.index.name('RollNo')

Output:

PDS Algo SE INS


RollNo
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
Note: We 105 57 75 88 39
have
name to
our index
now
We can use pandas built-in methods to set or reset the index

 pd.set_index('NewColumn',inplace=True), will set new column as


index,

 pd.reset_index(), will reset index to zero based numberic index.

set_index(new_index)

df.set_index('PDS') #inplace=True

Output: Algo SE INS


PDS
66 85 8 95
65 52 83 96 | P a g e
52
Note: We have PDS as our index now 46 34 52 60
54 3 94 52
57 75 88 39
df.reset_index()

Output:
RollNo PDS Algo SE INS
0 101 66 85 8 95
1 102 65 52 83 96
2 103 46 34 52 60
Note: Our RollNo(index) 3 104 54 3 94 52
become new column, and 4 105 57 75 88 39

we now have zero based


numeric index

Multi-Index DataFrame
Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate
information faster at almost no cost.

Example where we need Hierarchical indexes

Output:

Col Dep Sem RN S1 S2 S3


0 ABC CE 5 101 50 60 70
1 ABC CE 5 102 48 70 25
2 ABC CE 7 101 58 59 51
3 ABC ME 5 101 30 35 39
4 ABC ME 5 102 50 90 48
5 xyz CE 5 101 88 99 77
6 xyz CE 5 102 99 84 76
7 xyz CE 7 101 88 77 99
8 xyz ME 5 101 44 88 99

RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

53 | P a g e
Creating Multi indexes is as simple as creating single index using set_index method,
only difference is in case of multi indexes we need to provide list of indexes instead
of a single string index, lets see and example for that

dfMulti = pd.read_csv('MultiIndexDemo.csv')
dfMulti.set_index(['Col','Dep','Sem'],inplace=True)
print(dfMulti)

Output:

RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

Now we have multi-indexed DataFrame from which we can access data using
multiple index.

 For Example

 Sub DataFrame for all the students of xyz

print(dfMulti.loc['xyz'])

Output:

RN S1 S2 S3
Dep Sem
CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

54 | P a g e
Sub DataFrame for Computer Engineering

students from xyz

print(dfMulti.loc['xyz','CE'])

OutPut:

RN S1 S2 S3
Sem
5 101 88 99 77
5 102 99 84 76
7 101 88 77 99

Reading in Multiindexed DataFrame directly from CSV

dfMultiCSV =
pd.read_csv('MultiIndexDemo.csv',index_col=[0,1,2])
#for multi-index in cols we can use header parameter
print(dfMultiCSV)

Output:

RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99

ME 5 101 44 88 99

Cross Sections in DataFrame:

The xs() function is used to get cross-section from the Series/DataFrame.


This method takes a key argument to select data at a particular level of a
MultiIndex.

55 | P a g e
=== Parameters ===

key : label

axis : Axis to retrieve cross section

level : level of key

drop_level : False if you want to preserve the level

syntax:

DataFrame.xs(key, axis=0, level=None, drop_level=True)

dfMultiCSV = pd.read_csv('MultiIndexDemo.csv',
index_col=[0,1,2])
print(dfMultiCSV)
print(dfMultiCSV.xs('CE',axis=0,level='Dep'))

RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

Output:

RN S1 S2 S3
Col Sem
ABC 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
Xyz 5 101 88 99 77
5 102 Data
Dealing with Missing 99 84 76
7 101 88 77 99

56 | P a g e
There are many methods by which we can deal with the missing data, some of most
commons are listed below,

 dropna, will drop (delete) the missing data (rows/cols)

 fillna, will fill specified values in place of missing data

 interpolate, will interpolate missing data and fill interpolated value


in place of missing data.

Any groupby operation involves one of the following operations on the original
object. They are

 Splitting the Object

 Applying a function

 Combining the results

In many situations, we split the data into sets and we apply some functionality on
each subset.

we can perform the following operations

o Aggregation − computing a summary statistic

o Transformation − perform some group-specific


operation

o Filtration − discarding the data with some condition

Basic ways to use of groupby method

o df.groupby('key')

o df.groupby(['key1','key2'])

o df.groupby(key,axis=

57 | P a g e
College Enno CPI

Noble 123 8.9

Noble 124 9.2 College College Mean


CPI CPI
ABC 212 6.2 Noble 8.65
ABC 4.8
ABC 215 3.2
XYZ 5.83
ABC 218 4.2

XYZ 312 5.2

XYZ 315 6.5

XYZ 315 5.8

Example : Listing all the groups

dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby('Year').groups)

Output:
{2014: Int64Index([0, 2, 4, 9], dtype='int64'),
2015: Int64Index([1, 3, 5, 10], dtype='int64'),
2016: Int64Index([6, 8], dtype='int64'),
2017: Int64Index([7, 11], dtype='int64')}

Example : Group by multiple columns

dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby(['Year','Team']).groups)

58 | P a g e
Output:

{(2014, 'Devils'): Int64Index([2], dtype='int64'),


(2014, 'Kings'): Int64Index([4], dtype='int64'),
(2014, 'Riders'): Int64Index([0], dtype='int64'),
………
………
(2016, 'Riders'): Int64Index([8], dtype='int64'),
(2017, 'Kings'): Int64Index([7], dtype='int64'),
(2017, 'Riders'): Int64Index([11], dtype='int64')}

Example : Iterating through groups

dfIPL = pd.read_csv('IPLDataSet.csv')
groupIPL = dfIPL.groupby('Year')
for name,group in groupIPL :
print(name)
print(group)

Output:

2014
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
2015
Team Rank Year Points
1 Riders 2 2015 789
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690 59 | P a g e
Example : Aggregating groups

dfSales = pd.read_csv('SalesDataSet.csv')
print(dfSales.groupby(['YEAR_ID']).count()['QUANTITYORDERED']
)Output:
print(dfSales.groupby(['YEAR_ID']).sum()['QUANTITYORDERED'])
print(dfSales.groupby(['YEAR_ID']).mean()['QUANTITYORDERED'])

Output:
YEAR_ID
2003 1000
2004 1345
2005 478
Name: QUANTITYORDERED, dtype: int64
YEAR_ID
2003 34612
2004 46824
2005 17631
Name: QUANTITYORDERED, dtype: int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype: float64

Example : Describe details

dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby('Year').describe()['Points'])

Output:

count mean std min


25% 50% 75% max
Year
2014 4.0 795.25 87.439026 701.0 731.0 802.0 866.25
876.0
2015 4.0 769.50 65.035888 673.0 760.0 796.5 806.00
812.0
2016 2.0 725.00 43.840620 694.0 709.5 725.0 740.50
756.0
2017 2.0 739.00 69.296465 690.0 714.5 739.0 763.50
60 | P a g e
788.0
Concatenation in Pandas
Concatenation basically glues together DataFrames.
Keep in mind that dimensions should match along the axis you are concatenating
on.
You can use pd.concat and pass in a list of DataFrames to concatenate together:

dfCX = pd.read_csv('CX_Marks.csv',index_col=0)
dfCY = pd.read_csv('CY_Marks.csv',index_col=0)
dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0)
dfAllStudent = pd.concat([dfCX,dfCY,dfCZ])
print(dfAllStudent)

Output:

PDS Algo SE
101 50 55 60
102 70 80 61
103 55 89 70
104 58 96 85
201 77 96 63
202 44 78 32
203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62

Note : We can use axis=1 parameter to concat columns.

Join in Pandas

df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .

 some of important Parameters :

o dfOther : Right Data Frame

o on (Not recommended) : specify the column on which we want to join


(Default is index)

61 | P a g e
o how : How to handle the operation of the two objects.

▪ left: use calling frame’s index (Default).

▪ right: use dfOther index.

▪ outer: form union of calling frame’s index with other’s index (or column if
on is specified), and sort it. lexicographically.

▪ inner: form intersection of calling frame’s index (or column if on is


specified) with other’s index, preserving the order of the calling’s one.

dfINS = pd.read_csv('INS_Marks.csv',index_col=0)
dfLeftJoin = allStudent.join(dfINS)
print(dfLeftJoin)
dfRightJoin = allStudent.join(dfINS,how='right')
print(dfRightJoin)

Output:1 Output:2

PDS Algo SE INS PDS Algo SE INS


101 50 55 60 55.0 301 11 75 88 11
102 70 80 61 66.0 302 22 48 77 22
103 55 89 70 77.0 303 33 59 68 33
104 58 96 85 88.0 304 44 55 62 44
201 77 96 63 66.0 101 50 55 60 55
202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0

Merge in Pandas

Merge DataFrame or named Series objects with a database-style join.


Similar to join method, but used when we want to join/merge with the columns
instead of index.
 some of important Parameters :

 dfOther : Right Data Frame

62 | P a g e
 on : specify the column on which we want to join (Default is index)

 left_on : specify the column of left Dataframe

 right_on : specify the column of right Dataframe

 how : How to handle the operation of the two objects.

▪ left: use calling frame’s index (Default).

▪ right: use dfOther index.

▪ outer: form union of calling frame’s index with other’s index (or
column if on is specified), and sort it. lexicographically.

▪ inner: form intersection of calling frame’s index (or column if on


is specified) with other’s index, preserving the order of the
calling’s one.

m1 = pd.read_csv('Merge1.csv')
print(m1)
m2 = pd.read_csv('Merge2.csv')
print(m2)
m3 = m1.merge(m2,on='EnNo')
print(m3)

Output:

RollNo EnNo Name


0 101 11112222 Abc
1 102 11113333 Xyz
2 103 22224444 Def
EnNo PDS INS
0 11112222 50 60
1 11113333 60 70
RollNo EnNo Name PDS INS
0 101 11112222 Abc 50 60
1 102 11113333 Xyz 60 70

Read CSV in Pandas

read_csv() is used to read Comma Separated Values (CSV) file into a pandas
DataFrame.

63 | P a g e
 some of important Parameters :

 filePath : str, path object, or file-like object

 sep : separator (Default is comma)

 header: Row number(s) to use as the column names.

 index_col : index column(s) of the data frame.

dfINS = pd.read_csv('Marks.csv',index_col=0,header=0)
print(dfINS)

Output:

PDS Algo SE INS


101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0

Read Excel in Pandas

Read an Excel file into a pandas DataFrame.

Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local
filesystem or URL. Supports an option to read a single sheet or a list of sheets.

 some of important Parameters :

 excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object

 sheet_name : sheet no in integer or the name of the sheet, can have


list of sheets.

 index_col : index column of the data frame.

Read from MySQL Database

64 | P a g e
 We need two libraries for that,

 conda install sqlalchemy

 conda install pymysql

 After installing both the libraries, import create_engine from sqlalchemy and
import pymysql.

from sqlalchemy import create_engine


import pymysql

Then, create a database connection string and create engine using it.

db_connection_str='mysql+pymysql://username:password@host
/dbname'
db_connection = create_engine(db_connection_str)

Read from MySQL Database

After getting the engine, we can fire any sql query using pd.read_sql method.
read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL,
Oracle etc…)

df = pd.read_sql('SELECT * FROM cities',


con=db_connection)
print(df)

Output:

CityID CityName CityDescription CityCode


0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT

65 | P a g e
Web Scrapping using Beautiful Soup
Beautiful Soup is a library that makes it easy to scrape information from web pages.

It sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.

Web scraping with Beautiful Soup in Python involves extracting data from HTML or
XML documents. This process typically follows these steps:

Installation: Install the necessary libraries using pip:

requests is used to fetch the web page content, and beautifulsoup4 is the Beautiful
Soup library itself.

Fetch Web Page Content: Use the requests library to send an HTTP GET request to
the target URL and retrieve the HTML content.

Parse HTML with Beautiful Soup: Create a Beautiful Soup object by passing the
fetched HTML content and a parser (e.g., 'html.parser' or 'lxml') to
the BeautifulSoup constructor.

66 | P a g e
\

Inspect and Navigate HTML:

Use your browser's developer tools (e.g., "Inspect Element") to examine the
structure of the target website and identify the HTML tags, classes, and IDs
associated with the data you want to extract.

Extract Data:

Utilize Beautiful Soup's methods to locate and extract the desired elements:

find(): Retrieves the first matching element.

find_all(): Retrieves all matching elements.

You can filter by tag name, attributes (like class_ or id), and text content.

Access element text using .text and attributes using bracket notation
(e.g., element['href']).

Store Scraped Data: Process and store the extracted data in a suitable format, such
as a list of dictionaries, CSV file, or a database.

Important Considerations:

Respect robots.txt:

67 | P a g e
Check the website's robots.txt file to understand which parts of the site are allowed
to be crawled.

Terms of Service:

Be aware of the website's terms of service regarding web scraping.

Rate Limiting:

Implement delays between requests to avoid overwhelming the server and getting
blocked.

Error Handling:

Include error handling (e.g., try-except blocks) to manage potential issues like
network errors or missing elements.

import requests
import bs4
req = requests.get(“https://nobleuniversity.ac.in/faculty-of-computer-
applications/teaching-staff/”)
soup = bs4.BeautifulSoup(req.text,"html.parser") name_tags =
soup.select("h4.elementor-heading-title“)

for tag in name_tags: text = tag.get_text(strip=True)

name = text.split("\n")[0]

print(name)

Mr. Paresh Vora( B.com, PGDCA ,MSC(IT &


Output: CA))

Dr. Hirenkumar Thakor(BSC(Chem.) ,


DCS,PGDCA, ADCA, MCA, PhD)

Mr. Jigar Dave(BCA, MCA)

Ms.Sarika Odiya(BCA,PGDCA ,MCA)

Ms.Gunja Dave( B.Sc.IT , MCA,B.Ed)

Ms.Pooja Pandya(B.C.A, M.C.A)

Mrs. Khusbu Sama(BCA, MCA) 68 | P a g e

Mr.Bhavin Mehta(BCA ,MCA,Ph.D Pursuing)


What is Scrapy?

Scrapy is a fast, high-level Python framework used for web scraping and crawling
websites. It lets you extract data from HTML/XML pages using selectors like CSS or
XPath and output the results in CSV, JSON, or database formats.

First things first. Web scraping, also known as web data extraction, is a way to
collect information from websites. This can be done by using special software that
accesses the internet like a web browser, or by automating the process with a bot
or a so called web crawler. It is basically a method of copying specific data from the
web and saving it in a local database or spreadsheet for future use.

There are different types of data extraction tehniques, I’ll name a few:

1. Static Web Scraping: This is the most basic form of web scraping, where data
is extracted from web pages that are primarily composed of HTML and CSS. It’s
used for collecting data from websites with fixed, as its name says — static,
unchanging content.

2. Dynamic Web Scraping: Dynamic web scraping involves the use of tools or
scripts that can interact with the page and extract data from elements that
load after the initial page load (for example: pages that use JavaScript to load
content dynamically).

3. API-based Web Scraping: Many websites provide Application Programming


Interfaces (APIs) that allow developers to access their data in a structured and
organized manner. API-based web scraping is a more reliable and efficient way
to gather data, as it’s the intended way to access information. Probably the
most enjoyable way to retrive data.

4. Screen Scraping: In cases where a website doesn’t provide data in a machine-


readable format, screen scraping involves capturing the visual information
displayed on a website, often through capturing screenshots or reading text
from the screen.

5. Social Media Scraping: Collecting data from social media platforms, like
Twitter (I mean ‘X’) or Facebook, is a specialized form of web scraping. It’s
often used for sentiment analysis, trend tracking, or marketing research.

69 | P a g e
6. Image Scraping: This is the process of extracting data from images on the web,
such as text from images, logos, or other graphical elements.

Collecting information from websites? Why?

I did kind of partially answered my own question in the previous section, however I
do want to go through the most common use cases of web scraping:

• Data collection for the purpose of market research, stock market analysis,
competitor analysis, generating leads for sales and marketing.

• Price comparison that is used by many e-commerce businesses to monitor


competitors prices and make informed decisions about their business.

• Social Media analysis for sentiment analysis, trend tracking, or monitoring


public opinion on specific topics.

• Content aggregation that is used to display information from multiple sources


in one place. It is usually used by news aggregators, job search engines, and
real estate listing websites.

• Search Engine Optimization (SEO) for analyzing website rankings, tracking


keyword performance, and identifying areas for improvement in search engine
results.

• Weather and Environmental Monitoring for collecting data from various


sources for weather forecasting and climate research.

“An open source and collaborative framework for extracting the data you need

from websites. In a fast, simple, yet extensible way.” — scrapy.org

How to use Scrapy

in this section, we’ll explain how to set up a Scrapy project for web scraping use
cases. Creating a Scrapy project for web scraping in Python is a simple three-step
procedure.

1. Install Scrapy;

2. Create a Scrapy project;

70 | P a g e
3. Generate a new Spider for your web-scraping target.

Let’s start by installing Scrapy. Open your Python command terminal and type the
following pip command:

Creating a Spider:

• Generate a Spider: Navigate into your project directory (cd <project_name>) and
generate a spider:

This creates a Python file in the spiders directory, which will contain your scraping
logic.

3. Defining the Spider Logic:

• name: A unique identifier for your spider.

• allowed_domains: A list of domains your spider is allowed to crawl within.

• start_urls: A list of URLs where the spider will begin crawling.

• parse(self, response) method: This method is called for each URL


in start_urls and for subsequent requests. It receives a Response object
containing the HTML content of the page.

o Data Extraction: Use CSS selectors or XPath expressions within


the parse method to extract the desired data from
the response object.

71 | P a g e
o Following Links: Use response.follow() to create new requests to
follow links found on the current page, passing a callback function to
handle the response of the new request.

o Yielding Items: Create Item objects (defined in items.py) to hold the


extracted data and yield them from the parse method.

Example:

72 | P a g e
3 Data Visualization

Data visualisation means graphical or pictorial representation of the data using


graph, chart, etc. The purpose of plotting data is to visualise variation or show
relationships between variables. Visualisation also helps to effectively communicate
information to intended users. Traffic symbols, ultrasound reports, Atlas book of
maps, speedometer of a vehicle, tuners of instruments are few examples of
visualisation that we come across in our daily lives. Visualisation of data is effectively
used in fields like health, finance, science, mathematics, engineering, etc. In this
chapter, we will learn how to visualise data using Matplotlib library of Python by
plotting charts such as line, bar, scatter with respect to the various types of data.

Plotting using Matplotlib


Matplotlib library is used for creating static, animated, and interactive 2D- plots or
figures in Python. It can be installed using the following pip command from the
command prompt:

pip install matplotlib

For plotting using Matplotlib, we need to import its Pyplot module using the
following command:

import matplotlib.pyplot as plt

Here, plt is an alias or an alternative name for matplotlib.pyplot. We can use any
other alias also.

73 | P a g e
The pyplot module of matplotlib contains a collection of functions that can be used to
work on a plot. The plot() function of the pyplot module is used to create a figure. A
figure is the overall window where the outputs of pyplot functions are plotted. A figure
contains a plotting area, legend, axis labels, ticks, title, etc. (Figure). Each function
makes some change to a figure: example, creates a figure, creates a plotting area in a
figure, plots some lines in a plotting area, decorates the plot with labels, etc.

It is always expected that the data presented through charts easily understood.
Hence, while presenting data we should always give a chart title, label the axis of the
chart and provide legend in case we have more than one plotted data.

To plot x versus y, we can write plt.plot(x,y). The show() function is used to display
the figure created using the plot() function.

Let us consider that in a city, the maximum temperature of a day is recorded for
three consecutive days. Program 4-1 demonstrates how to plot temperature values
for the given dates. The output generated is a line chart.

74 | P a g e
Program 4-1 Plotting Temperature against Height

import matplotlib.pyplot as plt # list storing date in string format


date = ["25/12", "26/12", "27/12"]

# list storing temperature values


temp = [8.5, 10.5, 6.8]

# create a figure plotting temp versus date


plt.plot(date, temp)

# show the figure


plt.show()

Output:

Line chart as output of Program 4

In program 4-1, plot() is provided with two parameters, which indicates values for
x-axis and y-axis, respectively. The x and y ticks are displayed accordingly. As shown
in Figure the plot() function by default plots a line chart. We can click on the save
button on the output window and save the plot as an image. A figure can also be
saved by using savefig() function. The name of the figure is passed to the function
as parameter.

For example: plt.savefig('x.png').

In the previous example, we used plot() function to plot a line graph. There are
different types of data available for analysis. The plotting methods allow for a

75 | P a g e
handful of plot types other than the default line plot, as listed in Table 4.1. Choice
of plot is determined by the type of data we have.

Table 4.1 List of Pyplot functions to plot different chart

plot(\*args[, scalex, scaley, data]) Plot x versus y as lines and/or markers.

bar(x, height[, width, bottom, align, data]) Make a bar plot.

boxplot(x[, notch, sym, vert, whis, ...]) Make a box and whisker plot.

hist(x[, bins, range, density, weights, ...]) Plot a histogram.

pie(x[, explode, labels, colors, autopct, ...]) Plot a pie chart.

scatter(x, y[, s, c, marker, cmap, norm, ...]) A scatter plot of x versus y.

Customisation of Plots

Pyplot library gives us numerous functions, which can be used to customise charts
such as adding titles or legends. Some of the customisation options are listed in
Table.

List of Pyplot functions to customise plots

grid([b, which, axis]) Config ure the grid lines.

legend(\*args, \*\*kwargs) Place a legend on the axes.

savefig(\*args, \*\*kwargs) Save the current figure.

show(\*args, \*\*kw) Displa y all figures.

title(label[, fontdict, loc, pad]) Set a ti tle for the axes.

xlabel(xlabel[, fontdict, labelpad]) Set the label for the x-axis.

xticks([ticks, labels]) Get or set the current tick locations and labels of the x-axis.

ylabel(ylabel[, fontdict, labelpad]) Set the label for the y-axis.

yticks([ticks, labels]) Get or set the current tick locations and labels of the y-axis.

76 | P a g e
Plotting a line chart of date versus temperature by adding Label on X and Y axis, and
adding a Title and Grids to the chart.

import matplotlib.pyplot as plt

date = ["25/12", "26/12", "27/12"]


temp = [8.5, 10.5, 6.8]

plt.plot(date, temp)
plt.xlabel("Date") # Add label on x-axis
plt.ylabel("Temperature") # Add label on y-axis
plt.title("Date wise Temperature") # Add title to chart
plt.grid(True) # Add gridlines
plt.yticks(temp) # Set y-axis ticks to
temp values
plt.show()

Output:

line chart as output

In the above example, we have used the xlabel, ylabel, title and yticks functions.
We can see that compared to Figure 4.2, the Figure 4.3 conveys more meaning,
easily. We will learn about customizations of other plots in later sections.

77 | P a g e
Character Colour
‘b’ blue
‘g’ green
‘r’ red
‘c’ cyan
‘m’ magenta
‘y’ yellow
‘k’ black
‘w’ White

Marker
We can make certain other changes to plots by passing various parameters to the
plot() function. In Figure 4.3, we plot temperatures day-wise. It is also possible to
specify each point in the line through a marker.A marker is any symbol that
represents a data value in a line chart or a scatter plot. Table shows a list of markers
along with their corresponding symbol and description. These markers can be used
in program codes:

Table : Some of the Matplotlib Markers

Colour

It is also possible to format the plot further by changing the colour of the plotted data.
Table shows the list of colours that are supported. We can either use character codes
or the color names as values to the parameter color in the plot().

78 | P a g e
Colour abbreviations for plotting

Linewidth and Line Style:


The linewidth and linestyle property can be used to change the width and the style
of the line chart. Linewidth is specified in pixels. The default line width is 1 pixel
showing a thin line. Thus, a number greater than 1 will output a thicker line
depending on the value provided.

We can also set the line style of a line chart using the linestyle parameter. It can
take a string such as "solid", "dotted", "dashed" or "dashdot". Let us write the
Program 4-3 applying some of the customizations.

Consider the average heights and weights of persons aged 8 to 16 stored in the
following two lists:

height = [121.9,124.5,129.5,134.6,139.7,147.3,

152.4, 157.5,162.6]

weight= [19.7,21.3,23.5,25.9,28.5,32.1,35.7,39.6, 43.2]

Let us plot a line chart where:

i. x axis will represent weight

ii. y axis will represent height

iii. x axis label should be “Weight in kg”

iv. y axis label should be “Height in cm”

v. colour of the line should be green

vi. use * as marker

vii. Marker size as10

viii. The title of the chart should be “Average weight with respect to average
height”.

ix. Line style should be dashed

x. Linewidth should be 2.

79 | P a g e
import matplotlib.pyplot as plt
import pandas as pd

height=[121.9,124.5,129.5,134.6,139.7,147.3,152.4,157.5,162.6]
weight=[19.7,21.3,23.5,25.9,28.5,32.1,35.7,39.6,43.2]
df=pd.DataFrame({"height":height,"weight":weight})

plt.xlabel('Weight in kg')
plt.ylabel('Height in cm')
plt.title('Average weight with respect to average height')
plt.plot(df.weight,df.height,marker='*',markersize=10,color='gree
n',linewidth=2,linestyle='dashdot')
plt.savefig('height_weight_plot.png')

Output:

Figure 4.4: Line chart showing average weight against average height

The Pandas Plot function (Pandas Visualisation)

In Programs 4-1 and 4-2, we learnt that the plot() function of the pyplot module of
matplotlib can be used to plot a chart. However, starting from version 0.17.0,
Pandas objects Series and DataFrame come equipped with their own .plot()
methods. This plot() method is just a simple wrapper around the plot() function of
pyplot. Thus, if we have a Series or DataFrame type object (let's say 's' or 'df') we
can call the plot method by writing:

s.plot() or df.plot()

80 | P a g e
The plot() method of Pandas accepts a considerable number of arguments that can
be used to plot a variety of graphs. It allows customising different plot types by
supplying the kind keyword arguments. The general syntax is: plt.plot(kind),where
kind accepts a string indicating the type of .plot, as listed in Table 4.5. In addition,
we can use the matplotlib.pyplot methods and functions also along with the plt()
method of Pandas objects.

Table 4.5 Arguments accepted by kind for different plots

kind = Plot type


line Line plot (default)
bar Vertical bar plot
barh Horizontal bar plot
hist Histogram
box Boxplot
area Area plot
pie Pie plot
scatter Scatter plot

In the previous chapters, we have learned to store different types of data


in a two dimensional format using DataFrame. In the subsequent sections
we will learn to use plot() function to create various types of charts with
respect to the type of data stored in DataFrames.

4.4.1 Plotting a Line chart

A line plot is a graph that shows the frequency of data along a number line. It is
used to show continuous dataset. A line plot is used to visualise growth or decline
in data over a time interval. We have already plotted line charts through Programs
4-1 and 4-2. In this section, we will learn to plot a line chart for data stored in a
DataFrame. .

Program 4-4 Smile NGO has participated in a three week cultural mela. Using
Pandas, they have stored the sales (in Rs) made day wise for every week in a CSV
file named “MelaSales.csv”, as shown in Table 4.6.

81 | P a g e
Table 4.6 Day-wise mela sales data

Week 1 Week 2 Week 3


5000 4000 4000
5900 3000 5800
6500 5000 3500
3500 5500 2500
4000 3000 3000
5300 4300 5300
7900 5900 6000

Depict the sales for the three weeks using a Line chart. It should have the following:
i. Chart title as “Mela Sales Report”.
ii. axis label as Days.
iii. axis label as “Sales in Rs”.
Line colours are red for week 1, blue for week 2 and brown for week 3.

import pandas as pd
import matplotlib.pyplot as plt

# Read "MelaSales.csv" into df by giving the path to the file


df = pd.read_csv("MelaSales.csv")

# Create a line plot of different color for each week


df.plot(kind='line', color=['red', 'blue', 'brown'])

# Set title to "Mela Sales Report"


plt.title('Mela Sales Report')

# Label x-axis as "Days"


plt.xlabel('Days')

# Label y-axis as "Sales in Rs"


plt.ylabel('Sales in Rs')

# Display the figure


plt.show()

The Figure 4.5 displays a line plot as output for Program 4-4. Note that the legend is
displayed by default associating the colours with the plotted data
Output:

82 | P a g e
Figure 4.5: Line plot showing mela sales figures

The line plot takes a numeric value to display on the x axis and hence uses the index
(row labels) of the DataFrame in the above example. Thus, x tick values are the
index of the DataFramedf that contains data stored in MelaSales.CSV.

Customising Line Plot

We can substitute the ticks at x axis with a list of values of our choice by using
plt.xticks(ticks,label) where ticks is a list of locations(locs) on x axis at which ticks
should be placed, label is a list of items to place at the given ticks.
Program 4-5 Assuming the same CSV file, i.e., MelaSales. CSV, plot the line chart
with following customisations:

83 | P a g e
import pandas as pd
import matplotlib.pyplot as plt

# Read CSV file


df = pd.read_csv("MelaSales.csv")

# Create a line plot


df.plot(
kind='line',
color=['red', 'blue', 'brown'],
marker="*",
markersize=10,
linewidth=3,
linestyle="--"
)
# Add title and labels
plt.title('Mela Sales Report')
plt.xlabel('Days')
plt.ylabel('Sales in Rs')

# Set X-axis ticks to show the 'Day' column values


ticks = df.index.tolist()
Output:
plt.xticks(ticks, df.Day)

# Show the plot


plt.show()
Output:

Figure 4.6: Mela sales figures with day names

84 | P a g e
Plotting Bar Chart

The line plot in Figure 4.6 shows that the sales for all the weeks increased during the
weekend. Other than weekends, it also shows that the sales increased on Wednesday
for Week 1, on Thursday for Week 2 and on Tuesday for Week 3.
But, the lines are unable to efficiently depict comparison between the weeks for which
the sales data is plotted. In order to show comparisons, we prefer Bar charts. Unlike
line plots, bar charts can plot strings on the x axis. To plot a bar chart, we will specify
kind=’bar’. We can also specify the DataFrame columns to be used as x and y axes.

Let us now add a column “Days” consisting of day names to “MelaSales.csv” as


shown in Table 4.7.

Week 1 Week 2 Week 3 Day

5000 4000 4000 Monday

5900 3000 5800 Tuesday

6500 5000 3500 Wednesday

3500 5500 2500 Thursday

4000 3000 3000 Friday

5300 4300 5300 Saturday

7900 5900 6000 Sunday

85 | P a g e
import pandas as pd
import matplotlib.pyplot as plt

# Read CSV file


df = pd.read_csv('MelaSales.csv')

# Plot a bar chart with 'Day' as the x-axis


df.plot(kind='bar', x='Day', title='Mela Sales Report')

# Set y-axis label


plt.ylabel('Sales in Rs')

# Show the plot


plt.show()

Output:
Output:

Figure 4.7: A bar chart as output of Program 4-6

Customising Bar Chart

We can also customise the bar chart by adding certain parameters to the plot
function. We can control the edgecolor of the bar, linestyle and linewidth. We can
also control the color of the lines. The following example shows various
customisations on the bar chart of Figure 4.8
Program 4-7 Let us write a Python script to display Bar plot for the “MelaSales.csv”
file with column Day on x axis, and having the following customisation:

86 | P a g e
● Changing the color of each bar to red, yellow and purple.
● Edgecolor to green
● Linewidth as 2
● Line style as "--"

import pandas as pd
import matplotlib.pyplot as plt

# Read CSV file


df = pd.read_csv('MelaSales.csv')

# Plot bar chart


df.plot(
kind='bar',
x='Day',
title='Mela Sales Report',
color=['red', 'yellow', 'purple'], # Different colors for
bars
edgecolor='green', # Border color
linewidth=2 # Border width
)

# Set y-axis label


plt.ylabel('Sales in Rs')

# Show the plot


plt.show()

Output:

87 | P a g e
Plotting Pie Charts

Pie is a type of graph in which a circle is divided into different sectors and each sector
represents a part of the whole. A pie plot is used to represent numerical data
proportionally. To plot a pie chart, either column label y or 'subplots=True' should be
set while using df.plot(kind='pie') . If no column reference is passed and
subplots=True, a 'pie' plot is drawn for each numerical column independently.

import pandas as pd
import matplotlib.pyplot as plt

# Create DataFrame
df = pd.DataFrame(
{
'mass': [0.330, 4.87, 5.97],
'radius': [2439.7, 6051.8, 6378.1]
},
index=['Mercury', 'Venus', 'Earth']
)

# Plot a pie chart of 'mass'


df.plot(
kind='pie',
y='mass',
autopct='%1.1f%%', # Show percentages
legend=False,
ylabel='' # Remove default y-label
)

plt.title('Mass of Planets')
plt.show()

Output:

It is important to note that the default label names are the index value of
88 | P a g e
the DataFrame.

Let us consider the dataset of Table 4.10 showing the forest cover of north
eastern states that contains geographical area and corresponding forest cover
in sq km along with the names of the corresponding states.

Table 4.10 Forest cover of north eastern states

State GeoArea ForestCover


Arunachal Pradesh 83743 67353
Assam 78438 27692
Manipur 22327 17280
Meghalaya 22429 17321
Mizoram 21081 19240
Nagaland 16579 13464
Tripura 10486 8073

import pandas as pd
import matplotlib.pyplot as plt

# Create DataFrame
df = pd.DataFrame(
{
'GeoArea': [83743, 78438, 22327, 22429, 21081, 16579,
10486],
'ForestCover': [67353, 27692, 17280, 17321, 19240,
13464, 8073]
},
index=[
'Arunachal Pradesh',
'Assam',
'Manipur',
'Meghalaya',
'Mizoram',
'Nagaland',
'Tripura'
]
)

# Plot pie chart of forest cover


df.plot(
kind='pie',
y='ForestCover',
title='Forest cover of North Eastern states',
legend=False,

ylabel='' # remove y-axis label


)

# Show the chart


plt.show()
89 | P a g e
Customisation of pie chart

To customise the pie plot of we have added the following two properties of pie
chart in program
• Explode—it specifies the fraction of the radius with which to explode or expand
each slot.
• Autopct—to display the percentage of that part as a label.

90 | P a g e
import pandas as pd
import matplotlib.pyplot as plt

# Create DataFrame
df = pd.DataFrame(
{
'GeoArea': [83743, 78438, 22327, 22429, 21081, 16579,
10486],
'ForestCover': [67353, 27692, 17280, 17321, 19240, 13464,
8073]
},
index=[
'Arunachal Pradesh',
'Assam',
'Manipur',
'Meghalaya',
'Mizoram',
'Nagaland',
'Tripura'
]
)

# Explode settings (highlight slices)


exp = [0.1, 0, 0, 0, 0.2, 0, 0]

# Colors for each wedge


c = ['r', 'g', 'm', 'c', 'brown', 'pink', 'purple']

# Create pie chart using matplotlib


plt.pie(
df['ForestCover'],
labels=df.index,
explode=exp,
colors=c,
autopct="%.2f%%",
shadow=True
)

plt.title('Forest cover of North Eastern states')


plt.show()

91 | P a g e
Case Study: Bio-Signal Plotting using Matplotlib/Pandas

Background

In healthcare and biomedical engineering, bio-signals such as ECG


(Electrocardiogram), EMG (Electromyogram), and EEG (Electroencephalogram) are
essential for diagnosing health conditions.
Visualizing these signals helps doctors, researchers, and engineers analyze
patterns, detect anomalies, and make informed decisions.

Objective
To plot and analyze bio-signal data (for example, ECG signal) using Pandas for data
handling and Matplotlib for visualization.

Data Source
• A CSV file (ecg_data.csv) containing:
o Time (seconds)
o Amplitude (mV)
• The data is collected from a patient monitoring device.

Example CSV format:

Time,Amplitude
0.00,0.1
0.01,0.15
0.02,0.2
0.03,0.18
...

92 | P a g e
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Read ECG data


df = pd.read_csv("ecg_data.csv")

# Step 2: Plot ECG signal


plt.figure(figsize=(10, 5))
plt.plot(df['Time'], df['Amplitude'], color='blue', linewidth=1)

# Step 3: Add plot enhancements


plt.title("ECG Signal - Patient 01")
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude (mV)")
plt.grid(True)
plt.xlim(0, max(df['Time']))
plt.ylim(min(df['Amplitude']) - 0.1, max(df['Amplitude']) + 0.1)

# Step 4: Display
plt.show()

93 | P a g e
4 Exploring Data Analysis

Introduction

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
EDA was developed at Bell Labs by John Tukey, a mathematician and statistician
who wanted to promote more questions and actions on data based on the data
itself.
In one of his famous writings, Tukey said:
“The only way humans can do BETTER than computers is to take a chance of doing
WORSE than them.”
Above statement explains why, as a data scientist, your role and tools aren’t
limited to automatic learning algorithms but also to manual and creative
exploratory tasks.
Computers are unbeatable at optimizing, but humans are strong at discovery by
taking unexpected routes and trying unlikely but very effective solutions.

Exploratory Data Analysis (EDA) is the initial, open-ended process of analyzing a


dataset to summarize its main characteristics, often with statistical graphics and
data visualization methods. Its primary goal is to help you understand the data,
uncover patterns, detect anomalies, test hypotheses, and inform the choice of
statistical models or machine learning algorithms for further analysis. EDA is a
crucial step that precedes more formal modeling and hypothesis testing.

94 | P a g e
 With EDA we,
 Describe data
 Closely explore data distributions
 Understand the relationships between variables
 Notice unusual or unexpected situations
 Place the data into groups
 Notice unexpected patterns within the group
 Take note of group differences

EDA - Measuring Central Tendency

 We can use many inbuilt pandas functions in order to find central tendency of
the numerical data, Some of the functions are
 df.mean()

 df.median()
 df.std()
 df.max()
 df.min()
 df.quantile(np.array([0,0.25,0.5,0.75,1]))

95 | P a g e
EDA Techniques:

EDA techniques are broadly categorized based on the number of variables being
analyzed.

Exploratory Data Analysis encompasses a diverse array of techniques, broadly


categorized into non-graphical and graphical methods, and further classified by
the number of variables examined simultaneously: univariate, bivariate, and
multivariate analyses. These techniques collectively enable analysts to
systematically investigate datasets, uncover underlying structures, and identify
patterns that inform subsequent modeling and decision-making.

Univariate Analysis

This type of analysis focuses on a single variable to understand its distribution and
central tendencies. It doesn't explore relationships between variables.

•Non-graphical: These techniques use descriptive statistics to


summarize the data.
o Measures of Central Tendency: Calculate the mean, median, and
mode to understand the "center" of the data. The median is often
preferred for skewed distributions or when outliers are present, as it
is less affected by extreme values.
o Measures of Spread: Use variance, standard deviation, and range to
understand how dispersed the data is. The interquartile range (IQR) is
also useful for identifying the spread of the middle 50% of the data.

Graphical: These techniques use visualizations to show the distribution of a single


variable. Graphical methods are essential for gaining a holistic understanding of
univariate data, as non-graphical summaries alone may not fully capture the

96 | P a g e
data's nuances. These visualizations offer unparalleled power to explore and gain
insight into the data's underlying patterns.

• Histograms: Bar charts that show the frequency of data within specific
ranges, revealing the shape of the distribution (e.g., normal, skewed,
bimodal).

• Box plots: Also known as box-and-whisker plots, these plots graphically


depict the five-number summary: minimum, first quartile, median, third
quartile, and maximum. They are excellent for quickly identifying the spread
and detecting outliers.

• Rootograms: Similar to histograms, rootograms plot the square roots of the


number of observations within different ranges of a quantitative variable.
The use of square roots aims to equalize the variance of deviations between
the bars and a fitted distribution curve, which might otherwise increase with
frequency. They are often plotted with a fitted distribution, sometimes with
bars suspended from the curve for easier visual comparison with a horizontal
zero line.

• Kernel Density Plots (Distribution Plots): These plots provide a smoothed


representation of the data's distribution, estimating the probability density
function. They are particularly useful for understanding the variance and
shape of the distribution, offering a continuous alternative to histograms.

• Swarm Plots: Swarm plots are designed to show the distribution of individual
data points, avoiding overlap and revealing the density of observations. They
are effective for visualizing the concentration of data points in certain areas
and for clearly highlighting isolated points that represent outliers.

• Bar Charts/Plots: Primarily used for categorical data, bar charts display the
frequency or count of observations within different categories. They are
useful for evaluating counts, such as the quality rate of wine.

• Violin Plots: These plots combine a box-and-whisker plot with a


nonparametric density estimator to display the data for a single quantitative
sample. They are highly effective for visualizing the shape of the probability
density function of the population from which the data was drawn, providing
a richer view of the distribution than a simple box plot.

97 | P a g e
Bivariate and Multivariate Analysis
These techniques explore the relationships between two or more variables.
• Non-graphical: These methods quantify the relationship between variables.
o Correlation Analysis: Calculates a correlation coefficient (like Pearson's or
Spearman's) to measure the strength and direction of the linear or
monotonic relationship between two variables. A correlation matrix is
often used to show the correlations for all pairs of variables in a dataset.
o Cross-tabulation (Contingency Tables): Used for categorical variables to
show the frequency distribution of two or more variables at the same
time.
• Graphical: These visualizations help to see the relationships between variables.
o Scatter Plots: The most common tool for visualizing the relationship
between two continuous variables. They can reveal patterns, trends, and
clusters.
o Pair Plots: A grid of scatter plots that visualizes the relationships between
all pairs of variables in a dataset. They are a powerful way to quickly
understand multiple relationships at once.
o Heatmaps: A graphical representation of a correlation matrix where color
intensity represents the strength of the correlation, making it easy to
spot strong relationships.

Exploratory Data Analysis (EDA) leverages a wide array of tools and technologies,
ranging from versatile programming languages and their extensive libraries to
specialized automated frameworks and comprehensive business intelligence
platforms. These tools empower data professionals to effectively manipulate,
analyze, and visualize data to uncover insights.

Common Tools for EDA


• Python: The most popular choice, with powerful libraries:
o Pandas: For data manipulation and generating summary statistics.
o Matplotlib: A foundational library for creating static, animated, and interactive
visualizations.
o Seaborn: Built on top of Matplotlib, it provides a high-level interface for
drawing attractive and informative statistical graphics.
98 | P a g e
o Plotly: For creating interactive, publication-quality graphs.
• R: Another powerful language for statistical computing and graphics.
o dplyr: For data manipulation.
o ggplot2: A world-class data visualization package based on the "Grammar of
Graphics".

Plotting Quartiles and Box plot

Suppose an entrance examination of 200 marks is conducted at the national level,


and Mahi has topped the exam by scoring 120 marks. The result shows 100
percentile against Mahi’s name, which means all the candidates excluding Mahi
have scored less than Mahi. To visualise this kind of data, we use quartiles.
Quartiles are the measures which divide the data into four equal parts, and each
part contains an equal number of observations. Calculating quartiles requires
calculation of median. Quartiles are often used in educational achievement data,
sales and survey data to divide populations into groups. For example, you can use
Quartile to find the top 25 percent of students in that examination. A Box Plot is
the visual representation of the statistical summary of a given data set. The
summary includes Minimum value, Quartile 1, Quartile 2, Median, Quartile 4 and
Maximum value. The whiskers are the two lines outside the box that extend to
the highest and lowest values. It also helps in identifying the outliers. An outlier is
an observation that is numerically distant from the rest of the data.

In order to assess the performance of students of a class in the annual


examination, the class teacher stored marks of the students in all the 5 subjects

99 | P a g e
in a CSV “Marks.csv” file as shown in Table. Plot the data using boxplot and
perform a comparative analysis of performance in each subject.

Marks obtained by students in five subjects

Name English Maths Hindi Science Social_Studies


Rishika Batra 95 95 90 94 95

Waseem Ali 95 76 79 77 89

Kulpreet Singh 78 81 75 76 88

Annie Mathews 88 63 67 77 80

Shiksha 95 55 51 59 80

Naveen Gupta 82 55 63 56 74

Taleem Ahmed 73 49 54 60 77

Pragati Nigam 80 50 51 54 76

Usman Abbas 92 43 51 48 69

Gurpreet Kaur 60 43 55 52 71

Sameer Murthy 60 43 55 52 71

Angelina 78 33 39 48 68

Angad Bedi 62 43 51 48 54

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV file
data = pd.read_csv('Marks.csv')
# Create DataFrame
df = pd.DataFrame(data)
# Plot boxplot
df.plot(kind='box')
# Set chart title and labels
plt.title('Performance Analysis')
plt.xlabel('Subjects')
plt.ylabel('Marks')
# Show plot
.
plt.show()

100 | P a g e
Output:

The distance between the box and lower or upper whiskers in some boxplots are
more, and in some less. Shorter distance indicates small variation in data, and
longer distance indicates spread in data to mean larger variation.

To keep improving their services, XYZ group of hotels have asked all the three
hotels to get feedback form filled by their customers at the time of checkout.
After getting ratings on a scale of (1–5) on factors such as Food, Service,
Ambience, Activities, Distance from tourist spots they calculate the average rating
and store it in a CSV file. The data are given in Table

Year-wise average ratings on five parameters

Year Sunny Bunny Resort Happy Lucky Resort Breezy WIndy Resort
2014 4.75 3 4.5
2015 2.5 4 2
2016 3.5 2.5 3
2017 4 2 3.5
2018 1.5 4.5 1

This year, to award the best hotel they have decided to analyze the ratings of the
past 5 years for each of the hotels. Plot the data using Boxplot.

101 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
# Read the CSV file into 'data'
data = pd.read_csv('compareresort.csv')
# Convert 'data' into a DataFrame 'df'
df = pd.DataFrame(data)
# Plot a box plot for the DataFrame 'df' with a title
df.plot(kind='box', title='Compare Resorts')
# Set xlabel and ylabel
plt.xlabel('Resorts')
plt.ylabel('Rating (5 years)')
# Display the plot
plt.show()

Customizing Box plot We can display the whisker in horizontal direction


by adding a parameter vert=False in the below Program , as shown in the
following line of code. We can change the colour of the whisker as well.
The output of the modified Program is shown in figure.

df.plot(kind='box',title='Comp
are Resorts', color='red',
vert=False)

102 | P a g e
Heatmap: A graphical representation of data where individual values in
a matrix are represented as colors. It is most commonly used to visualize
a correlation matrix of all numerical variables.

A heatmap is a 2D graphical representation of data where the individual


values contained in a matrix are represented as colors. It's essentially a
way to turn a table of numbers into a visual, making it much easier and
faster to spot patterns, extremes, and relationships.

• Matrix: The data is organized in a grid or table format.

• Color Scale: A color gradient (or "colormap") is used to represent


the values. For instance, low values might be a light color (like
white or yellow), and high values might be a dark color (like dark
blue or purple).

• Intensity: The intensity of the color in each cell corresponds to the


magnitude of the value in that cell of the matrix.

• While heatmaps can be used for any kind of matrix data, their most common
and powerful application in EDA is to visualize a correlation matrix.

103 | P a g e
Example 1: Correlation Heat Map

import seaborn as sns


import pandas as pd
import matplotlib.pyplot as plt
# Sample dataset
data = {
'Height': [150, 160, 165, 170, 175, 180],
'Weight': [50, 55, 65, 70, 72, 80],
'Age': [20, 25, 30, 35, 40, 45]
}
df = pd.DataFrame(data)
# Calculate correlation
corr = df.corr()
# Plot heatmap
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Heat Map")
plt.show()

The heatmap visually represents the correlation matrix, a table that shows the
pairwise correlations between all variables in the dataset.

• The values close to 1 (which are shown in the correlation matrix and on the
heatmap) indicate a very strong positive correlation between all pairs of
variables: Height, Weight, and Age. This means that as one variable
increases, the other two also tend to increase.
The heatmap below provides a clear, color-coded visual summary of these
relationships. The warmer, redder colors highlight the strong positive correlation,
making it easy to quickly identify patterns and relationships in the data.

104 | P a g e
Random Grid Data Heat Map

import numpy as np
import matplotlib.pyplot as plt
# Create a random 10x10 data grid
data = np.random.rand(10, 10)
plt.imshow(data, cmap='YlGn', interpolation='nearest')
plt.colorbar(label="Value")
plt.title("Grid Heat Map")
plt.show()

The plot shows a 10x10 grid of random data, with the color intensity representing
the value of each cell. The color bar on the side indicates the range of values, from
low (light yellow) to high (dark green).

The plot shows a 10x10 grid of random data, with the color intensity representing
the value of each cell. The color bar on the side indicates the range of values, from
low (light yellow) to high (dark green).

Plotting Histogram

Histograms are column-charts, where each column represents a range of values, and
the height of a column corresponds to how many values are in that range.
105 | P a g e
To make a histogram, the data is sorted into "bins" and the number of data points in
each bin is counted. The height of each column in the histogram is then proportional
to the number of data points its bin contains.
The df.plot(kind=’hist’) function automatically selects the size of the bins based on
the spread of values in the data.

import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar', 'Bincy',
'Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}
}
df=pd.DataFrame(data
)
df.plot(kind='hist')
plt.show()
Output:

This plot shows the distributions of both the Height and Weight variables from
your data. This is a univariate graphical technique in EDA, where a single
variable's distribution is visualized.

• The Height histogram (blue) shows that most people have a height around
106 | P a g e
60-61.
• The Weight histogram (orange) shows a much more spread-out distribution,
with a significant outlier in the 85-90 range.
• The Below Program displays the histogram corresponding to all attributes
having numeric values, i.e., ‘Height’ and ‘Weight’ attributes as shown in
Figure On the basis of the height and weight values provided in the
DataFrame, the plot() calculated the bin values.

It is also possible to set value for the bins parameter, for example

df.plot(kind=’hist’,bins=20)
df.plot(kind='hist',bins=[18,19,20,21,22])
df.plot(kind='hist',bins=range(18,25))

Output:

107 | P a g e
Customizing Histogram
Taking the same data as above, now let see how the histogram can be customised.
Let us change the edgecolor, which is the border of each hist, to green. Also, let us
change the line style to ":" and line width to 2. Let us try another property
called fill, which takes boolean values. The default True means each hist will be filled
with color and False means each hist will be empty. Another property called hatch
can be used to fill to each hist with pattern ( '-', '+', 'x', '\\', '*', 'o', 'O', '.'). In the
Program 4-10, we have used the hatch value as "o".

import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar','Bincy','Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}

df=pd.DataFrame(data)
df.plot(kind='hist',edgecolor='Green',linewidth=2,linestyle=':',fill=False,hatch='o')
plt.show()

108 | P a g e
Using Open Data

There are many websites that provide data freely for anyone to download and do
analysis, primarily for educational purposes. These are called Open Data as the data
source is open to the public. Availability of data for access and use promotes further
analysis and innovation. A lot of emphasis is being given to open data to ensure
transparency, accessibility and innovation. “Open Government Data (OGD)
Platform India” (data. gov.in) is a platform for supporting the Open Data initiative
of the Government of India. Large datasets on different projects and parameters
are available on the platform.

Let us consider a dataset called “Seasonal and Annual Min/Max Temp Series - India
from 1901 to 2017” from the URL https://data.gov.in/resources/seasonal-and-
annual-minmax-temp-series-india-1901-2017.

Our aim is to plot the minimum and maximum temperature and observe the
number of times (frequency) a particular temperature has occurred. We only need
to extract the 'ANNUAL - MIN' and 'ANNUAL - MAX' columns from the file. Also, let
us aim to display two Histogram plots:

i) Only for 'ANNUAL - MIN'


ii) For both 'ANNUAL - MIN' and 'ANNUAL - MAX'

109 | P a g e
import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV file with only the required columns


data = pd.read_csv(
"Min_Max_Seasonal_IMD_2017.csv",
usecols=['ANNUAL - MIN', 'ANNUAL - MAX']
)

df = pd.DataFrame(data)

# Plot histogram for 'ANNUAL - MIN'


df.plot(
kind='hist',
y='ANNUAL - MIN',
title='Annual Minimum Temperature (1901-2017)'
)
plt.xlabel('Temperature')
plt.ylabel('Number of times')
plt.show()
Output:
# Plot histogram for both 'ANNUAL - MIN' and 'ANNUAL - MAX'
df.plot(
kind='hist',
title='Annual Min and Max Temperature (1901-2017)',
color=['blue', 'red']
)
plt.xlabel('Temperature')
plt.ylabel('Number of times')
plt.show()

110 | P a g e
Output:

df.plot(
kind='hist',
alpha=0.5,
title='Annual Min and Max Temperature (1901-2017)',
color=['blue', 'red']
) Output:
Output:

111 | P a g e
Plot a frequency polygon for the ‘ANNUAL – MIN’ column of the “Min/MaxTemp”
data over the histogram depicting it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV with only 'ANNUAL - MIN'
data = pd.read_csv(
"Min_Max_Seasonal_IMD_2017.csv",
usecols=['ANNUAL - MIN']
)
# Convert to DataFrame
df = pd.DataFrame(data)
# Convert the column to NumPy 1D array
minarray = np.array(df['ANNUAL - MIN'])
# Get histogram data
y, edges = np.histogram(minarray, bins=15)
# Calculate bin midpoints
mid = 0.5 * (edges[1:] + edges[:-1])
# Plot histogram
plt.hist(minarray, bins=15, color='blue', alpha=0.7)
# Overlay line plot of frequencies
plt.plot(mid, y, '-^', color='orange')
# Add labels and title
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.title('Annual Min Temperature (1901–2017)')
plt.show()
Output:

112 | P a g e
Plotting Scatter Chart

A scatter chart is a two-dimensional data visualisation method that uses dots to


represent the values obtained for two different variables —one plotted along
the x-axis and the other plotted along the y-axis.
Scatter plots are used when you want to show the relationship between two
variables. Scatter plots are sometimes called correlation plots because they
show how two variables are correlated. Additionally, the size, shape or color of
the dot could represent a third (or even fourth variable).

Prayatna sells designer bags and wallets. During the sales season, he gave
discounts ranging from 10% to 50% over a period of 5 weeks. He recorded his
sales for each type of discount in an array. Draw a scatter plot to show a
relationship between the discount offered and sales made.

import numpy as np
import matplotlib.pyplot as plt
discount= np.array([10,20,30,40,50])
saleInRs=np.array([40000,45000,48000,50000,100000])
plt.scatter(x=discount,y=saleInRs)
plt.title('Sales Vs Discount')
plt.xlabel('Discount offered')
plt.ylabel('Sales in Rs')
plt.show()

Output:

113 | P a g e
Customizing Scatter chart

The size of the bubble can also be used to reflect a value. For example, in program
4-14, we have opted for displaying the size of the bubble as 10 times the discount,
as shown in Figure The colour and markers can also be changed in the above plot
by adding the following statements:

import numpy as np
import matplotlib.pyplot as plt
discount= np.array([10,20,30,40,50])
saleInRs=np.array([40000,45000,48000,50000,100000])
size=discount*10
plt.scatter(x=discount,y=saleInRs,s=size,color='red',linewidth=3,marker='*',edgecolor='blue')
plt.title('Sales Vs Discount')
plt.xlabel('Discount offered')
plt.ylabel('Sales in Rs')
plt.show()

Output:

114 | P a g e
ETL (Extract, Transform, Load) process is a fundamental part of data
warehousing and data integration, involving three key stages: extracting
data from various sources, transforming it into a consistent and usable
format, and loading it into a target system like a data warehouse or data
lake. Logging is crucial for tracking the ETL process, ensuring data quality,
and troubleshooting issues.

Using an ETL pipeline to transform raw data to match the target system,
allows for systematic and accurate data analysis to take place in the target
repository. Specifically, the key benefits are:
More stable and faster data analysis on a single, pre-defined use case. This
is because the data set has already been structured and transformed.
Easier compliance with GDPR, HIPAA, and CCPA standards. This is because
users can omit any sensitive data prior to loading in the target system.
Identify and capture changes made to a database via the change data
capture (CDC) process or technology. These changes can then be applied to
another data repository or made available in a format consumable by ETL,
EAI, or other types of data integration tools.

Extract > Transform > Load (ETL)


In the ETL process, transformation is performed in a staging area outside of
the data warehouse and before loading it into the data warehouse. The
entire data set must be transformed before loading, so transforming large
data sets can take a lot of time up front. The benefit is that analysis can take
place immediately once the data is loaded. This is why this process is
appropriate for small data sets which require complex transformations.

How ETL Works


The traditional ETL process is broken out as follows:
• Extract refers to pulling a predetermined subset of data from a source
such as an SQL or NoSQL database, a cloud platform or an XML file.
• Transform refers to converting the structure or format of a data set to
match that of the target system. This is typically performed in a staging
area in ways such as data mapping, applying concatenations or
calculations. Transforming the data before it is loaded is necessary to deal
with the constraints of traditional data warehouses.
• Load refers to the process of placing the data into the target system,

115 | P a g e
typically a data warehouse, where it is ready to be analyzed by BI tools or
data analytics tools.

ETL Use Cases


The ETL process helps eliminate data errors, bottlenecks, and latency to
provide for a smooth flow of data from one system to the other. Here are
some of the key use cases:
• Migrating data from a legacy system to a new repository.
• Centralizing data sources to gain a consolidated version of the data.
• Enriching data in one system with data from another system.
• Providing a stable dataset for data analytics tools to quickly access a single,
pre-defined analytics use case given that the data set has already been
structured and transformed.

ETL Tools
There are four primary types of ETL tools:
Batch Processing: Traditionally, on-premises batch processing was the primary
ETL process. In the past, processing large data sets impacted an organization’s
computing power and so these processes were performed in batches during
off-hours. Today’s ETL tools can still do batch processing, but since they’re
often cloud-based, they’re less constrained in terms of when and how quickly
the processing occurs.
Cloud-Native. Cloud-native ETL tools can extract and load data from sources
directly into a cloud data warehouse. They then use the power and scale of the
cloud to transform the data.
Open Source. Open-source tools such as Apache Kafka offer a low-cost
116 | P a g e
alternative to commercial ETL tools. However, some open source tools only
support one stage of the process, such as extracting data, and some are not
designed to handle data complexities or change data capture (CDC). Plus, it can
be tough to get support for open source tools.
Real-Time. Today’s business demands real-time access to data. This requires
organizations to process data in real time, with a distributed model and
streaming capabilities. Streaming ETL tools, both commercial and open source,
offer this capability.

ETL Logging:

Logging is essential for monitoring the ETL process and identifying potential
issues. Effective ETL logging should include:
• Process Start/End Times: Record when each ETL job begins and ends, including
timestamps.
• Data Source Information: Log which sources were accessed and the number
of records extracted.
• Transformation Details: Capture information about the transformations
applied, such as the number of records processed, errors encountered, and
any data quality issues identified.
• Load Details: Log the number of records loaded into the target system and any
errors during the loading process.
• Error Handling: Log all errors, warnings, and exceptions that occur during the
ETL process, including error messages and stack traces.
• Performance Metrics: Track the time taken for each stage of the ETL process
and identify any performance bottlenecks.
• User Information: Log which user initiated the ETL process (if applicable).
• Configuration Information: Log the ETL job configuration, including
parameters and settings.
Benefits of ETL Logging:
• Troubleshooting: Enables quick identification and resolution of issues during
the ETL process.
• Auditing: Provides a record of all ETL activity for auditing and compliance
purposes.
• Performance Monitoring: Helps identify performance bottlenecks and
optimize the ETL process.
• Data Quality Monitoring: Tracks data quality issues and helps ensure data
accuracy.
• Automation: Facilitates automation of the ETL process by providing insights
into its execution.
By implementing robust ETL logging, organizations can ensure the reliability,
accuracy, and efficiency of their data integration processes.

117 | P a g e
5 WebFramework: Django python

Introduction

What does "Django" means?

Django is named after Django Reinhardt, a jazz manouche guitarist from the 1930s
to early 1950s. To this day, he's considered one of the best guitarists of all time.
Django is pronounced JANG-oh. Rhymes with FANG-oh. The "D" is silent. Django is
a free open source Web frameworks written in python.

Django Reinhardt

Django is a web development framework that assists in building and maintaining


quality web applications. Django helps eliminate repetitive tasks making the
development process an easy and time saving experience. It is a server-side
framework for developing dynamic websites. Django makes it easier to build better
web apps quickly and with less code. Django is a Python-based free and open-source
web framework that follows the model-view-template (MVT) architectural pattern.

118 | P a g e
History

Created in 2003, When the Web programmers at Lawrence Joumal-World


newspaper, Adrian Holovaty and Simen Willison, began using Python to build
applications.
It was released publicly in 2008, Official site: https:/ewww.djangoproject.com
Django's now run by an international team. Of volunteers Which sites use Django??
DjungoStes.org contains list of django websites and you can register yours.

Adrian Holovaty Simen Willison

Well-known sites: Disqus-BitBucket - Instagram - Mozilla Firefox(help page,


Addons)-interest( 33 million visit per month) - NASA - Onion(satirical articles) -The
Washington post-Eventbrite

2003- started by Adrian Holovaty and Simen Willison as internal project at the
Lawrence journal world newspaper.
2005- publicly released under BSD license in 21, July 2005 and named it Django.
Currently, DSF (Django Software Foundation) maintains its development and
release cycle.
Django is now an open-source project with contributors across the world.
Django is especially helpful for database driven websites.

Advantages of Django

Object-Relational Mapping (ORM) Support - Django provides a bridge between the


data model and the database engine, and supports a large set of database systems
including SQLITE3, MySQL,NOSQL..etc. It's very easy to switch database in Django
framework.
Administration GUI - Django provides a nice ready-to-use user interface for
administrative activities. It has built-in admin interface which makes easy to work
with it.
Development Environment - Django comes with a lightweight web server to facilitate
end-to-end application development and testing.

Less Coding - Less code so in turn a quick development.

119 | P a g e
Don't Repeat Yourself (DRY) - Everything should be developed only in exactly one
place instead of repeating it again and again.

Fast Development - Django's philosophy is to do all it can to facilitate hyper-fast


development.

How does Django Work?

Django follows the MVT design pattern (Model View Template).


• Model - The data you want to present, usually data from a database.
• View - A request handler that returns the relevant template and content -
based on the request from the user.
• Template - A text file (like an HTML file) containing the layout of the web
page, with logic on how to display the data.

Model

The model provides data from the database.In Django, the data is delivered as
an Object Relational Mapping (ORM), which is a technique designed to make
it easier to work with databases. The most common way to extract data from
a database is SQL. One problem with SQL is that you have to have a pretty good
understanding of the database structure to be able to work with it. Django,
with ORM, makes it easier to communicate with the database, without having
to write complex SQL statements. The models are usually located in a file
called models.py.

View

A view is a function or method that takes http requests as arguments, imports


the relevant model(s), and finds out what data to send to the template, and
returns the final result. The views are usually located in a file called views.py.

Template
A template is a file where you describe how the result should be
represented.Templates are often .html files, with HTML code describing the
layout of a web page, but it can also be in other file formats to present other
results, but we will concentrate on .html files.
Django uses standard HTML to describe the layout, but uses Django
tags to add logic:
The templates of an application is located in a folder named templates.
URLs
Django also provides a way to navigate around the different pages in a website.
When a user requests a URL, Django decides which view it will send it to.
This is done in a file called urls.py.
So, What is Going On?
When you have installed Django and created your first Django web application,
and the browser requests the URL, this is basically what happens:
120 | P a g e
1. Django receives the URL, checks the urls.py file, and calls the view that
matches the URL.
2. The view, located in views.py, checks for relevant models.
3. The models are imported from the models.py file.
4. The view then sends the data to a specified template in the template folder.
5. The template contains HTML and Django tags, and with the data it returns
finished HTML content back to the browser.

Django can do a lot more than this, but this is basically what you will learn in this
tutorial, and are the basic steps in a simple web application made with Django.

Django Getting Started

To install Django, you must have Python installed, and a package manager
like PIP.

PIP is included in Python from version 3.4.

Django Requires Python

To check if your system has Python installed, run this command in the command
prompt:

python –version

If Python is installed, you will get a result with the version number, like this

Python 3.13.2

PIP

To install Django, you must use a package manager like PIP, which is included in
Python from version 3.4.

To check if your system has PIP installed, run this command in the command
prompt:

pip –version

If PIP is installed, you will get a result with the version number.

For me, on a windows machine, the result looks like this:

121 | P a g e
Virtual Environment
It is suggested to have a dedicated virtual environment for each Django
project, and one way to manage a virtual environment is venv, which is
included in Python.

The name of the virtual environment is your choice, in this tutorial we will call
it myworld.

Type the following in the command prompt, remember to navigate to where


you want to create your project:

his will set up a virtual environment, and create a folder named "myworld"
with subfolders and files, like this:

Myworld/ ← Your virtual environment root folder



├── Include/ ← C header files (rarely touched)

├── Lib/ ← Python standard libraries + installed packages
│ │
│ └── site-packages/ ← Installed packages (Django will be here)

├── Scripts/ ← Executables (Python, pip, activation scripts)
│ │
│ ├── activate ← Activate venv (Linux/Mac)
│ ├── activate.bat ← Activate venv (Windows CMD)
│ ├── activate.ps1 ← Activate venv (Windows PowerShell)
│ ├── pip.exe ← Pip installer
│ └── python.exe ← Python interpreter

├── .gitignore ← Optional Git ignore rules

└── pyvenv.cfg ← Virtual environment configuration file.

Then you have to activate the environment, by typing this command:


Windows:
myworld\Scripts\activate.bat
Once the environment is activated, you will see this result in the command prompt:

(myworld) C:\Users\Your Name>

122 | P a g e
Note: You must activate the virtual environment every time you open the command
prompt to work on your project.

Install Django
Now, that we have created a virtual environment, we are ready to install Django.

Note: Remember to install Django while you are in the virtual environment!

Django is installed using pip, with this command:

(myworld) ... $ python -m pip install Django

Which will give a result that looks like this (at least on my Windows machine):

That's it! Now you have installed Django in your new project, running in a virtual
environment!
Check Django Version
You can check if Django is installed by asking for its version number like this:
(myworld) C:\Users\Your Name>django-admin –version
If Django is installed, you will get a result with the version number:
5.1.7

My First Project
Once you have come up with a suitable name for your Django project, like
mine: my_d1 navigate to where in the file system you want to store the code (in the
virtual environment), I will navigate to the my world folder, and run this command
in the command prompt:

django-admin startproject my_d1

Django creates a my_d1 folder on my computer, with this content:

123 | P a g e
Myworld/ → Virtual Environment
→ my_d1/ → Your Django Project
→ manage.py → 🎛 Project management script
→ my_d1/ → Core project settings folder
→ __init__.py → Marks folder as a Python package
→ settings.py → Main configuration
→ urls.py → URL routing
→ asgi.py → Async server support
→ wsgi.py → Web server support

These are all files and folders with a specific meaning, you will learn about some of
them later in this tutorial, but for now, it is more important to know that this is the
location of your project, and that you can start building applications in it.

Run the Django Project


Now that you have a Django project, you can run it, and see what it looks like in a
browser.
Navigate to the /my_d1 folder and execute this command in the command prompt:
cd my_d1
python manage.py runserver
Which will produce this result:
D:\Python_project>cd my_d1

D:\Python_project\my_d1>python manage.py runserver


Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).

You have 18 unapplied migration(s). Your project may not work properly until
you apply the migrations for app(s): admin, auth, contenttypes, sessions.
Run 'python manage.py migrate' to apply them.
August 23, 2025 - 12:58:13
Django version 5.1.3, using settings 'my_d1.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.

Open a new browser window and type 127.0.0.1:8000 in the address bar.

124 | P a g e
An app is a web application that has a specific meaning in your project, like a home
page, a contact form, or a members database.
In this tutorial we will create an app that allows us to list and register members in
a database.
But first, let's just create a simple Django app that displays "Hello World!".

I will name my app members.


I will name my app members.
Start by navigating to the selected location where you want to store the app, in my
case the my_app folder, and run the command below.
If the server is still running, and you are not able to write commands, press [CTRL]
[BREAK], or [CTRL] [C] to stop the server and you should be back in the virtual
environment.

Django creates a folder named members in my project, with this content:


python manage.py startapp members.

125 | P a g e
→ my_app/ → Your Django Project
→ manage.py → 🎛 Project management script

→ my_app/ → Core project settings folder


→ __init__.py → Marks folder as a Python package
→ settings.py → Main configuration file
→ urls.py → URL routing
→ asgi.py → Async server support
→ wsgi.py → Web server support

→ members/ → Django App (handles features like Members)


→ __init__.py → Marks folder as a Python package
→ admin.py → Admin panel registration
→ apps.py → App configuration
→ models.py → Database models
→ tests.py → Testing
→ views.py → Handles requests & responses
→ migrations/ → Database migrations
→ __init__.py → Marks folder as a Python package

These are all files and folders with a specific meaning.


First, take a look at the file called views.py.

This is where we gather the information we need to send back a proper response.

Views

Django views are Python functions that take http requests and return http response,
like HTML documents.
A web page that uses Django is full of views with different tasks and missions.
Views are usually put in a file called views.py located on your app's folder.
There is a views.py in your members folder that looks like this:
my_app/members/views.py:

from django.shortcuts import render

# Create your views here.

Find it and open it, and replace the content with this:
my_app/members/views.py:

126 | P a g e
from django.shortcuts import rend
from django.http import HttpResponse

def members(request):
return HttpResponse("Hello world!")

URLs

Create a file named urls.py in the same folder as the views.py file, and type this code
in it:
my_app/members/urls.py:

from django.urls import path


from . import views

urlpatterns = [
path('members/', views.members, name='members'),
]

The urls.py file you just created is specific for the members application. We have to
do some routing in the root directory my_app as well. This may seem complicated,
but for now, just follow the instructions below.

There
–– is a file called urls.py on the my_app folder, open that file and add
the include module in the import statement, and also add a path() function in
the urlpatterns[] list, with arguments that will route users that comes in
via 127.0.0.1:8000/.
Then your file will look like this:
my_app/ my_app /urls.py:

from django.contrib import admin


from django.urls import include, path

urlpatterns = [
path('', include('members.urls')),
path('admin/', admin.site.urls),
]

––
If the server is not running, navigate to the /my_app folder and execute this
command in the command prompt:

127 | P a g e
python manage.py runserver

In the browser window, type 127.0.0.1:8000/members/ in the address bar.

Templates

In Django, a template is basically an HTML file (with optional Django Template


Language (DTL) code) that defines how the front-end (UI) of your web page should
look.
Create a templates folder inside the members folder, and create a HTML file
named myfirst.html.
The file structure should be like this:

my_app
manage.py
members/
templates/
myfirst.html

128 | P a g e
Open the HTML file and insert the following:
my_app/members/templates/myfirst.html:

<!DOCTYPE html>
<html>
<body>

<h1>Hello World!</h1>

<p>Welcome to my first Django project!</p>

</body>
</html>

Modify the View

Open the views.py file in the members folder, and replace its content with this:
my_app /members/views.py:

from django.http import HttpResponse


from django.template import loader

def members(request):
template = loader.get_template('myfirst.html')
return HttpResponse(template.render())

Change Settings

To be able to work with more complicated stuff than "Hello World!", We have to tell
Django that a new app is created.
This is done in the settings.py file in the my_app folder.
Look up the INSTALLED_APPS[] list and add the members app like this:
my_tennis_club/my_app/settings.py:

INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'members'
]
129 | P a g e
Then run this command:

python manage.py migrate

Which will produce this output:

Operations to perform:
Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
Applying contenttypes.0001_initial... OK
Applying auth.0001_initial... OK
Applying admin.0001_initial... OK
Applying admin.0002_logentry_remove_auto_add... OK
Applying admin.0003_logentry_add_action_flag_choices... OK
Applying contenttypes.0002_remove_content_type_name... OK
Applying auth.0002_alter_permission_name_max_length... OK
Applying auth.0003_alter_user_email_max_length... OK
Applying auth.0004_alter_user_username_opts... OK
Applying auth.0005_alter_user_last_login_null... OK
Applying auth.0006_require_contenttypes_0002... OK
Applying auth.0007_alter_validators_add_error_messages... OK
Applying auth.0008_alter_user_username_max_length... OK
Applying auth.0009_alter_user_last_name_max_length... OK
Applying auth.0010_alter_group_name_max_length... OK
Applying auth.0011_update_proxy_permissions... OK
Applying auth.0012_alter_user_first_name_max_length... OK
Applying sessions.0001_initial... OK

(myworld) C:\Users\Your Name\myworld\my_app>

Start the server by navigating to the /my_app folder and execute this command:
python manage.py runserver

In the browser window, type 127.0.0.1:8000/members/ in the address bar.


The result should look like this:

130 | P a g e
Django Models

A Django model is a table in your database.

Up until now in this content, output has been static data from Python or HTML
templates.
Now we will see how Django allows us to work with data, without having to change
or upload files in the process.
In Django, data is created in objects, called Models, and is actually tables in a
database.

A Model in Django is a class imported from the django.db library that acts as the
bridge between your database and server. This class is a representation of the data
structure used by your website. It will directly relate this data-structure with the
database. So that you don’t have to learn SQL for the database.

Create Table (Model)


To create a model, navigate to the models.py file in the /members/ folder.
Open it, and add a Member table by creating a Member class, and describe the table
fields in it:

my_app/members/models.py:

from django.db import models

class Member(models.Model):
firstname = models.CharField(max_length=255)
lastname = models.CharField(max_length=255) 131 | P a g e
The first field, firstname, is a Text field, and will contain the first name of the
members.
The second field, lastname, is also a Text field, with the member's last name.
Both firstname and lastname is set up to have a maximum of 255 characters.

Django Admin

Django Admin is a really great tool in Django, it is actually a CRUD* user interface of
all your models!

*CRUD stands for Create Read Update Delete.


It is free and comes ready-to-use with Django:

To enter the admin user interface, start the server by navigating to


the /myworld/my-tennis-club/ folder and execute this command:
python manage.py runserver

In the browser window, type 127.0.0.1:8000/admin/ in the address bar.

The result should look like this:

132 | P a g e
The reason why this URL goes to the Django admin log in page can be found in
the urls.py file of your project:

my_app/my_app/urls.py:

from django.contrib import admin


from django.urls import include, path

urlpatterns = [
path('', include('members.urls')),
path('admin/', admin.site.urls),
]

The urlpatterns[] list takes requests going to admin/ and sends them
to admin.site.urls, which is part of a built-in application that comes with Django,
and contains a lot of functionality and user interfaces, one of them being the log-in
user interface.

Django Admin - Create User


Create User
To be able to log into the admin application, we need to create a user.
This is done by typing this command in the command view:

python manage.py createsuperuser

Which will give this prompt:


Username:

Here you must enter: username, e-mail address, (you can just pick a fake e-mail
address), and password:

Username: BCA
Email address: [email protected]
Password:
Password (again):
This password is too short. It must contain at least 8 characters.
This password is too common.
This password is entirely numeric.
Bypass password validation and create user anyway? [y/N]:

My password did not meet the criteria, but this is a test environment, and I choose
to create user anyway, by enter y:

Bypass password validation and create user anyway? [y/N]: y

If you press [Enter], you should have successfully created a user:

133 | P a g e
Superuser created successfully.

Now start the server again:

python manage.py runserver

In the browser window, type 127.0.0.1:8000/admin/ in the address bar.


And fill in the form with the correct username and password:

Which should result in this user interface:

134 | P a g e
Here you can create, read, update, and delete groups and users, but where is the
Members model?

The Members model is missing, as it should be, you have to tell Django which models
that should be visible in the admin interface.

Include Member in the Admin Interface


To include the Member model in the admin interface, we have to tell Django that
this model should be visible in the admin interface.
This is done in a file called admin.py, and is located in your app's folder, which in our
case is the members folder.
Open it, and it should look like this:

my_app/members/admin.py:

from django.contrib import admin

# Register your models here.

Insert a couple of lines here to make the Member model visible in the admin page:

my_app/members/admin.py:
from django.contrib import admin
from .models import Member

# Register your models here.


admin.site.register(Member)

Now go back to the browser (127.0.0.1:8000/admin/) and you should get this result:

135 | P a g e
Click Members and see the five records we inserted earlier in this tutorial:

This project provides a comprehensive Student Management System, enabling


users to create, view, update, and delete (CRUD) student records. Built with
Django, it demonstrates the core functionality of a CRUD-based web application
136 | P a g e
while ensuring efficient handling of student data. The system is designed to
streamline the management of student information, making it easier to organize,
maintain, and update student data collections for educational institutions or
training centers.

CRUD Operations In Django

the "Recipe Update" project is a part of a CRUD (Create, Read, Update, Delete)
application for Student Dara. Here's a summary of its key functionality:
• Create: The project enables users to create new recipes by providing a name,
description, and an image upload.
• Read: Users can view a list of recipes with details, including names,
descriptions, and images. They can also search for Student using a search
form.
• Update: Users can update existing Student by editing their names,
descriptions, or images. This functionality is provided through a form that
populates with the Student's current details.
• Delete: The project allows users to delete Student by clicking a "Delete"
button associated with each student entry in the list.

Create Django Project and App

Project Structure

student_mgmt/
│── student_mgmt/
│ ├── settings.py
│ ├── urls.py
│ └── ...
│── students/
│ ├── models.py
│ ├── views.py
│ ├── urls.py
│ ├── forms.py
│ └── templates/
│ └── students/
│ ├── base.html
│ ├── student_list.html
│ ├── student_form.html
│ └── student_confirm_delete.html

137 | P a g e
⚙ 1. models.py

from django.db import models

class Student(models.Model):
name = models.CharField(max_length=100)
email = models.EmailField(unique=True)
age = models.IntegerField()
course = models.CharField(max_length=50)

def __str__(self):
return self.name

2. forms.py

from django import forms


from .models import Student

class StudentForm(forms.ModelForm):
class Meta:
model = Student
fields = ['name', 'email', 'age', 'course']

3. views.py

from django.shortcuts import render, redirect, get_object_or_404


from .models import Student
from .forms import StudentForm

# List + Search
def student_list(request):
query = request.GET.get("q")
if query:
students = Student.objects.filter(name__icontains=query)
else:
students = Student.objects.all()
return render(request, "students/student_list.html", {"students": students})

# Create
def student_create(request):
if request.method == "POST":
form = StudentForm(request.POST)
if form.is_valid():
138 | P a g e
form.save()
return redirect("student_list")
else:
form = StudentForm()
return render(request, "students/student_form.html", {"form": form})

# Update
def student_update(request, pk):
student = get_object_or_404(Student, pk=pk)
if request.method == "POST":
form = StudentForm(request.POST, instance=student)
if form.is_valid():
form.save()
return redirect("student_list")
else:
form = StudentForm(instance=student)
return render(request, "students/student_form.html", {"form": form})

# Delete
def student_delete(request, pk):
student = get_object_or_404(Student, pk=pk)
if request.method == "POST":
student.delete()
return redirect("student_list")
return render(request, "students/student_confirm_delete.html", {"student":
student})

🛣 4. students/urls.py

from django.urls import path


from . import views

urlpatterns = [
path('', views.student_list, name='student_list'),
path('new/', views.student_create, name='student_create'),
path('edit/<int:pk>/', views.student_update, name='student_update'),
path('delete/<int:pk>/', views.student_delete, name='student_delete'),
]

139 | P a g e
5. Main urls.py (student_mgmt/urls.py)

from django.contrib import admin


from django.urls import path, include

urlpatterns = [
path('admin/', admin.site.urls),
path('students/', include('students.urls')),
]

6. Templates
base.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Student Management</title>
<link
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
rel="stylesheet">
</head>
<body class="bg-light">

<div class="container mt-5">


{% block content %}{% endblock %}
</div>

</body>
</html>

Student_list.html
{% extends "students/base.html" %}
{% block content %}

<h2 class="text-center text-success mb-4">Student List</h2>

<form method="get" class="d-flex justify-content-center mb-3">


<input type="text" name="q" class="form-control w-50" placeholder="Search">
<button type="submit" class="btn btn-primary ms-2">Search</button>
</form>

<div class="mb-3 text-end">


<a href="{% url 'student_create' %}" class="btn btn-success">+ Add Student</a>
</div>

<table class="table table-bordered table-striped text-center">


140 | P a g e
<thead class="table-dark">
<tr>
<th>#</th>
<th>Name</th>
<th>Email</th>
<th>Age</th>
<th>Course</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
{% for student in students %}
<tr>
<td>{{ forloop.counter }}</td>
<td>{{ student.name }}</td>
<td>{{ student.email }}</td>
<td>{{ student.age }}</td>
<td>{{ student.course }}</td>
<td>
<a href="{% url 'student_update' student.id %}" class="btn btn-warning btn-
sm">Edit</a>
<a href="{% url 'student_delete' student.id %}" class="btn btn-danger btn-
sm">Delete</a>
</td>
</tr>
{% empty %}
<tr><td colspan="6">No students found.</td></tr>
{% endfor %}
</tbody>
</table>

{% endblock %}

student_form.html

{% extends "students/base.html" %}
{% block content %}

<div class="card shadow p-4">


<h2 class="text-success text-center">Add / Edit Student</h2>
<form method="post">
{% csrf_token %}
{{ form.as_p }}
<div class="d-grid">

141 | P a g e
<button type="submit" class="btn btn-success">Save
Student</button>
</div>
</form>
</div>

{% endblock %}

student_confirm_delete.html

{% extends "students/base.html" %}
{% block content %}

<div class="card shadow p-4 text-center">


<h2 class="text-danger">Are you sure?</h2>
<p>Do you really want to delete <b>{{ student.name }}</b>?</p>
<form method="post">
{% csrf_token %}
<button type="submit" class="btn btn-danger">Yes,
delete</button>
<a href="{% url 'student_list' %}" class="btn btn-
secondary">Cancel</a>
</form>
</div>

{% endblock %}

Create and Run Migrations


Run these commands to apply the migrations:
python manage.py makemigrations
python manage.py migrate

Run the Development Server


Run the server with the help of following command:
python manage.py runserver

142 | P a g e
Output:

143 | P a g e

You might also like