Data Science A Guide To Python's Key Libraries
Data Science A Guide To Python's Key Libraries
Of
Mastering
Data Science
A Guide to Python Key Libraries
Author
Khushbu N. Sama
Assistant Professor
Noble University Junagadh
2025
Preface
In today’s digital age, data has become the most valuable resource, shaping decisions
across business, healthcare, education, governance, and beyond. The ability to
collect, process, and interpret data effectively defines the success of individuals and
organizations alike. This is where Data Science emerges—not just as a discipline, but
as a way of thinking, combining statistical reasoning, computational efficiency, and
domain expertise to unlock actionable insights.
Python, with its simplicity and versatility, has rapidly established itself as the
language of choice for data science. From data manipulation with NumPy and
Pandas, to visualization with Matplotlib and Seaborn, and machine learning with
Python provides an integrated ecosystem that empowers learners and professionals
to translate concepts into solutions.
This book, Mastering Data Science: A Guide to Python, is crafted to offer readers a
clear and practical roadmap for their data science journey. Rather than presenting
Python as a collection of disjointed tools, it introduces Python as a cohesive platform
for solving real-world problems—covering data preprocessing, exploratory analysis,
model building, and deployment.
The guiding principle of this work lies in its focus on mastery through practice. Each
chapter builds upon the last, combining theoretical clarity with hands-on exercises,
case studies, and visual explanations. The goal is not only to teach how to use Python
for data science, but also to develop the why—the reasoning that leads to
meaningful, reliable, and scalable insights.
This book is intended for students beginning their exploration of data science, as well
as professionals aiming to deepen their skills with structured, best-practice
approaches. By blending foundational knowledge with advanced applications, it
equips readers with both confidence and competence to tackle data-driven
challenges in diverse domains.
➢
Creating your first sample Django project
➢
Django project MVT architecture and MVC pattern
➢
performing CRUD operations in Django using functions
➢
Basics of using the Django template system
➢
Need for templates in Django
➢
Configuring templates in Django
➢
Template loading in Django
Data Science Fundamentals with Python
Introduction
Data science is the process of using data to find solution/ to predict outcomes
for a Problem statement.
Data science can be defined as a blend of mathematics, business acumen, tools,
algorithm and machine learning techniques, all of which help us in finding out
the hidden insights or patterns from raw data which can be of major use in the
formation of big business decisions. It is used in many industries these days,
ranging from entertainment to education.
Data science is an interdisciplinary field that utilizes scientific methods,
processes, algorithms, and systems to extract knowledge and insights from
structured, semi-structured, and unstructured data. Python, due to its simplicity,
extensive libraries, and strong community support, has emerged as a leading
programming language for data science.
1|Page
yourself with dictionaries, sets, lists, and tuples, understanding when to apply
each.
• Data cleaning: The process of preparing raw data for analysis by handling
missing values, correcting errors, and removing duplicates, ensuring accurate
and reliable results.
• Machine learning: Developing algorithms and models that learn from data
and make predictions or decisions without explicit programming.
2|Page
4. Learning and development
Here's an overview of the key aspects of the data science process, the skills
required for data scientists, the applications of data science, and the associated
challenges
3|Page
Technical skills
Soft skills
4|Page
• Retail and E-commerce: Customer segmentation, personalized
recommendations, inventory optimization, and sales forecasting.
• Manufacturing and Logistics: Predictive maintenance, supply chain
optimization, and demand forecasting.
• Marketing and Advertising: Targeted campaigns, customer analytics, and
advertising spend optimization.
The data science lifecycle is a structured, iterative process that guides data science
projects from their initial conception to the deployment and ongoing maintenance
of a solution. It ensures that data-driven insights are effectively leveraged to solve
business problems and drive value for organizations.
5|Page
1. Problem Statement: Clearly defining the business objective or question the
project aims to solve, establishing the foundation and success metrics.
This is the foundation of any data science project. It defines what problem you are
solving and why it matters — either from a business, academic, or technical
perspective.
Examples:
Tools/Skills:
• Domain understanding
• Discussions with stakeholders
• Framing the problem into a machine learning task: classification,
regression, clustering, etc.
6|Page
2. Data Cleaning: Refining raw data by identifying and addressing errors,
inconsistencies, or missing values to ensure its quality and suitability for analysis
and modeling.
Gathering raw data from various sources. This could include structured data
(CSV, databases), unstructured data (text, images), or real-time data (APIs, web
scraping).
In Python:Example:-
Raw data is often messy. You need to clean and prepare it for modeling.
Example:-
7|Page
Example:
It is an interactive web application that allows you to write and run code. The
independent processes launched by the notebook web application are known
as kernels, and they are used to execute user code in the specified language and
return results to the notebook web application.
2. Kernels
3. Notebook documents
8|Page
Installation:
Basic Usage:
This will open a new tab in your web browser, displaying the Jupyter Notebook
interface. You can then navigate to a directory and create a new notebook or
open an existing one.
Markdown Cells:
Code Cells:
Code cells are where you write and execute your programming code. The default
kernel for Jupyter Notebook runs Python code. You can write your Python code
in a code cell and execute it by pressing Shift + Enter or clicking the "Run" button
in the toolbar. The output of the code will be displayed directly below the cell.
A Raw NBConvert cell is a special type of cell that is not meant to be rendered
or executed within the Jupyter Notebook interface itself. Instead, it's used to
9|Page
provide raw output that will be processed by nbconvert, the tool used to convert
Jupyter Notebooks into other formats like HTML, PDF, or LaTeX.
Heading Cell: The header cell is not supported by the Jupyter Notebook. The
panel displayed in the screenshot below will pop open when you choose the
heading from the drop-down menu.
Jupyter Notebook has many useful keyboard shortcuts that will speed up your
workflow. You can view all of them by going to Help > Keyboard Shortcuts in the
menu. Here are a few to get you started:
Homepage.
Browse to the folder in which you would like to create your first notebook, click
the “New” drop-down button in the top-right and select “Python 3(ipykernel)”:
10 | P a g e
Your first Jupyter Notebook will open in new tab — each notebook uses its own
tab because you can open multiple notebooks simultaneously.
If you switch back to the dashboard, you will see the new file Untitled.ipynb and
you should see some green text that tells you your notebook is running
ipynb file is one notebook, so each time you create a new notebook, a new
.ipynb file will be created..
Naming
You will notice that at the top of the page is the word Untitled. This is the title
for the page and the name of your Notebook. Since that isn’t a very descriptive
name, let’s change it!
Just move your mouse over the word Untitled and click on the text. You should
now see an in-browser dialog titled Rename Notebook. Let’s rename this one
to Hello Jupyter:
11 | P a g e
Running Cells
A Notebook’s cell defaults to using code whenever you first create one, and that
cell uses the kernel that you chose when you started your Notebook.
In this case, you started yours with Python 3 as your kernel, so that means you
can write Python code in your code cells. Since your initial Notebook has only
one empty cell in it, the Notebook can’t really do anything.
Thus, to verify that everything is working as it should, you can add some Python
code to the cell and try running its contents.
Running a cell means that you will execute the cell’s contents. To execute a cell,
you can just select the cell and click the Run button that is in the row of buttons
along the top. It’s towards the middle. If you prefer using your keyboard, you
can just press
If you have multiple cells in your Notebook, and you run the cells in order, you
can share your variables and imports across cells. This makes it easy to separate
out your code into logical chunks without needing to reimport libraries or
recreate variables or functions in every cell.
The Jupyter Notebook interface has a standard menu bar at the top, which
provides access to all the functions for managing, editing, and running your
notebook.
12 | P a g e
1. File Menu
2. Edit Menu
This menu contains functions for manipulating the content of the cells.
• Cut, Copy, Paste Cells: Standard actions for moving and duplicating cells.
• Delete Cells: Removes the selected cell(s).
• Split Cell: Splits a cell into two at the cursor's position.
• Merge Cell Above/Below: Combines the selected cell with the one above or
below it.
3. View Menu
This menu controls how the notebook and its cells are displayed.
4. Insert Menu
• Insert Cell Above: Creates a new empty cell above the currently selected cell.
• Insert Cell Below: Creates a new empty cell below the currently selected cell.
13 | P a g e
5. Cell Menu
6. Kernel Menu
The kernel is the computational engine that runs your code. This menu lets you
manage it.
• Interrupt: Stops the execution of the current cell. This is useful if a cell is taking
too long to run or is stuck in an infinite loop.
• Restart: Restarts the kernel. This clears all variables and memory, giving you a
fresh start without closing the notebook.
• Restart & Clear Output: Restarts the kernel and also clears all the output from
every cell in the notebook.
• Restart & Run All: Restarts the kernel and then runs every cell in the notebook
from top to bottom.
• Shutdown: Stops the kernel completely, freeing up resources. The notebook
itself remains open, but you can't run any code.
• Change Kernel: Allows you to switch to a different kernel if you have others
installed (e.g., R, Julia, or another Python environment).
7. Widgets Menu
This menu is for managing interactive widgets, which are special objects that can
be used to create interactive controls in a notebook.
• Save Notebook Widget State: Saves the current state of all widgets.
• Clear Widget State: Resets all widgets to their default state.
8. Help Menu
14 | P a g e
2
DataCollection&DataManipulation
Introduction:-
Data collection is the process of gathering raw data from various sources for
further processing and analysis. In Python, data can be collected manually or
automatically using libraries and tools.
Files: CSV, Excel, JSON, and text files are readily handled by Python, especially
with the Pandas library.
15 | P a g e
Data manipulation
Reading and Writing Data: Pandas simplifies importing data from various formats
(CSV, Excel) and exporting processed data.
Exploring Data: Functions like head(), tail(), info(), and describe() allow for initial
data inspection and summarizing key characteristics.
Data Selection and Filtering: Data can be selected and filtered based on labels
(.loc), integer positions (.iloc), or boolean conditions.
Data Aggregation and Grouping: The groupby() function allows grouping data
based on certain criteria and applying aggregate functions
(e.g., mean, sum, count, min, max) to the groups.
Merging and Joining DataFrames: merge() and join() are used to combine data
from multiple DataFrames based on shared columns or indices.
16 | P a g e
Introduction to NumPy Arrays
The most important object defined in NumPy is an N-dimensional array type
called ndarray.
It describes the collection of items of the same type, Items in the collection can
be accessed using a zero-based index.The main data structure in the NumPy
library is the NumPy array, which is an extremely fast and memory-efficient data
structure. The NumPy array is much faster than the common Python list and
provides vectorized matrix operations. In this chapter, you will see the different
data types that you can store in a NumPy array, the different ways to create the
NumPy arrays, how you can access items in a NumPy array, and how to add or
remove items from a NumPy array.
The script above defines a NumPy array with six integers. Next, the array type is
displayed via the dtype attribute. Finally, the size of each item in the array (in
bytes) is displayed via the itemsize attribute. The output below prints the array
and the type of the items in the array, i.e., int32 (integer type), followed by the
17 | P a g e
size of each item in the array, which is 4 bytes (32 bits).
The Python NumPy library supports the following data types including the default
Python types.
• i – integer
• b – boolean
• u – unsigned integer
• f – float •
• c – complex float
• m – timedelta
• M – datetime
• – object
• S – string
• U – Unicode string
Let’s see another example of how Python stores text. The following script creates
a NumPy array with three text items and displays the data type and size of each
item.
Script 2:
18 | P a g e
The output below shows that NumPy stores text in the form of Unicode string
data type denoted by U. Here, the digit 6 represents the item with the most
number of characters.
Though the NumPy array is intelligent enough to guess the data type of items
stored in it, this is not always the case. For instance, in the following script, you
store some dates in a NumPy array. Since the dates are stored in the form of texts
(enclosed in double quotations), by default, the NumPy array treats the dates as
text. Hence, if you print the data type of the items stored, you will see that it will
be a Unicode string (U10).
You can convert data types in the NumPy array to other data types via the astype()
method. But first, you need to specify the target data type in the astype() method.
For instance, the following script converts the array you created in the previous
script to the datetime data type. You can see that “M” is passed as a parameter
value to the astype() function. “M” stands for the datetime data type as
aforementioned.
19 | P a g e
In addition to converting arrays from one type to another, you can also specify
the data type for a NumPy array at the time of definition via the dtype parameter.
For instance, in the following script, you specify “M” as the value or the dtype
parameter, which tells the Python interpreter that the items must be stored as
datatime values.
Depending on the type of data you need inside your NumPy array, different
methods can be used to create a NumPy array.
To create a NumPy array, you can pass a list to the array() method of the NumPy
module, as shown below:
20 | P a g e
You can also create a multi-dimensional NumPy array. To do so, you need to
create a list of lists where each internal list corresponds to the row in a two
dimensional array. Here is an example of how to create a two-dimensional array
using the array () method.
With the arrange () method, you can create a NumPy array that contains a range
of integers. The first parameter to the arrange method is the lower bound, and
the second parameter is the upper bound. The lower bound is included in the
array. However, the upper bound is not included. The following script creates a
NumPy array with integers 5 to 10.
You can also specify the step as a third parameter in the arrange() function. A step
defines the distance between two consecutive points in the array. The following
script creates a NumPy array from 5 to 11 with a step size of 2.
21 | P a g e
The ones() method can be used to create a NumPy array of all ones. Here is an
example.
You can create a two-dimensional array of all ones by passing the number of rows
and columns as the first and second parameters of the ones() method, as shown
below:
The zeros() method can be used to create a NumPy array of all zeros. Here is an
example
22 | P a g e
You can create a two-dimensional array of all zeros by passing the number of rows
and columns as the first and second parameters of the zeros() method, as shown
below:
The eye() method is used to create an identity matrix in the form of a two
dimensional NumPy array. An identity matrix contains 1s along the diagonal, while
the rest of the elements are 0 in the array.
23 | P a g e
Using Random Method
The random.rand() function from the NumPy module can be used to create a
NumPy array with uniform distribution.
The random.randn() function from the NumPy module can be used to create a
NumPy array with normal distribution, as shown in the following example.
Finally, the random.randint() function from the NumPy module can be used to
create a NumPy array with random integers between a certain range. The first
parameter to the randint() function specifies the lower bound, the second
parameter specifies the upper bound, and the last parameter specifies the
24 | P a g e
number of random integers to generate between the range. The following
example generates five random integers between 5 and 50.
Depending on the dimensions, there are various ways to display the NumPy
arrays. The simplest way to print a NumPy array is to pass the array to the print
method, as you have already seen in the previous section. An example is given
below:
You can also use loops to display items in a NumPy array. It is a good idea to know
the dimensions of a NumPy array before printing the array on the console. To see
the dimensions of a NumPy array, you can use the ndim attribute, which prints
the number of dimensions for a NumPy array. To see the shape of your NumPy
array, you can use the shape attribute.
25 | P a g e
The script shows that our array is one-dimensional. The shape is (6,), which means
our array is a vector with 6 items.
To print items in a one-dimensional NumPy array, you can use a single foreach
loop, as shown below:
Now, let’s see another example of how you can use the foreach loop to print items in
a two-dimensional NumPy array. The following script creates a two-dimensional
NumPy array with four rows and five columns. The array contains random integers
between 1 and 10. The array is then printed on the console
In the output below, you can see your newly created array.
26 | P a g e
Let’s now try to see the number of dimensions and shape of our NumPy array.
The output below shows that our array has two dimensions and the shape of the
array is (4,5), which refers to four rows and five columns.
The output shows all the rows from our two-dimensional NumPy array.
27 | P a g e
To traverse through all the items in the two-dimensional array, you can use the
nested for each loop as follow:
In the next section, you will see how to add, remove, and sort elements in a
NumPy array.
To add the items into a NumPy array, you can use the append() method from the
NumPy module. First, you need to pass the original array and the item that you
want to append to the array to the append() method. The append() method
returns a new array that contains newly added items appended to the end of the
original array. The following script adds a text item “Yellow” to an existing array
with three items.
28 | P a g e
In addition to adding one item at a time, you can also append an array of items to
an existing array. The method remains similar to appending a single item. You just
have to pass the existing array and the new array to the append () method, which
returns a concatenated array where items from the new array are appended at
the end of the original array.
To add items in a two-dimensional NumPy array, you have to specify whether you
want to add the new item as a row or as a column. To do so, you can take the help
of the axis attribute of the append method.
The output shows all the rows from our two-dimensional NumPy array.
29 | P a g e
To add a new row in the above 3 x 3 array, you need to pass the original array to
the new array in the form of a row vector and the axis attribute to the append()
method. To add a new array in the form of a row, you need to set 0 as the value
for the axis attribute.
In the output below, you can see that a new row has been appended to our
original 3 x 3 array of all zeros.
To append a new array as a column in the existing 2-D array, you need to set the
value of the axis attribute to 1.
To delete an item from an array, you may use the delete() method. You need to
pass the existing array and the index of the item to be deleted to the delete()
method. The following script deletes an item at index 1 (second item) from the
my_array array.
30 | P a g e
The output shows that the item at index 1, i.e., “Green,” is deleted.
If you want to delete multiple items from an array, you can pass the item indexes
in the form of a list to the delete() method. For example, the following script
deletes the items at index 1 and 2 from the NumPy array named my_array.
You can delete a row or column from a 2-D array using the delete method.
However, just as you did with the append() method for adding items, you need to
specify whether you want to delete a row or column using the axis attribute. The
following script creates an integer array with four rows and five columns. Next,
the delete() method is used to delete the row at index 1 (second row). Notice here
that to delete the array, the value of the axis attribute is set to 0.
31 | P a g e
import numpy as np
l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3]
a = np.array(l)
print('Max way1 = ',a.max())
print('Max way2 = ',np.max(a))
The output shows that the second row is deleted from the input 2-D array.
Finally, to delete a column, you can set the value of the axis attribute to 1, as
shown below:
Aggregations
min() function will return the minimum value from the ndarray, there are two
ways in which we can use min function, example of both ways are given below.
import numpy as np
l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3]
a = np.array(l)
print('Min way1 = ',a.min())
print('Min way2 = ',np.min(a))
OutPut:
Min way1 = 1
Min way2 = 1
max () function will return the maximum value from the ndarray, there are two
ways in which we can use min function, example of both ways are given below.
32 | P a g e
OutPut:
Max way1 = 11
Max way2 = 11
NumPy support many aggregation functions such as min, max, argmin, argmax, sum,
mean, std, etc…
l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
a = np.array(l)
OutPut
print('Min
: = ',a.min())
print('ArgMin = ',a.argmin())
print('Max = ',a.max())
print('ArgMax = ',a.argmax())
print('Sum = ',a.sum())
print('Mean = ',a.mean())
print('Std = ',a.std())
Output :
Min = 1
ArgMin = 3
Max = 11
ArgMax = 8
Sum = 122
Mean = 5.304347826086956
Std = 3.042235771223635
import numpy as np
array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print('sum = ',array2d.sum())
33 | P a g e
OutPut :
sum = 45
If we want to get sum of rows or cols we can use axis argument with the aggregate
functions.
import numpy as np
array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal
OutPut:
There are two ways in which you can access element of multi-dimensional array,
example of both the method is given below
arr =
np.array([['a','b','c'],['d','e','f'],['g','h','i']])
print('double = ',arr[2][1]) # double bracket notaion
print('single = ',arr[2,1]) # single bracket notation
OutPut:
double = h
single = h
Both method is valid and provides exactly the same answer, but single bracket
notation is recommended as in double bracket notation it will create a temporary
sub array of third row and then fetch the second column from it.
Single bracket notation will be easy to read and write while programming.
Slicing ndarray
Slicing in python means taking elements from one given index to another given
index.
Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
34 | P a g e
Default start is 0
Default step is 1
import numpy as np
arr = np.array(['a','b','c','d','e','f','g','h'])
print(arr[2:5])
print(arr[:5])
print(arr[5:])
print(arr[2:7:2])
print(arr[::-1])
OutPut:
R-1
R-2
a=
R-3
R-4
35 | P a g e
Example :
a[2][3] =
a[2,3] =
a[2] =
a[0:2] =
a[0:2:2] =
a[::-1] =
a[1:3,1:3] =
a[3:,:3] =
a[:,::-1] =
Slicing multi-dimensional array would be same as single dimensional array with the
help of single bracket notation we learn earlier, lets see an example.
arr = np.array([['a','b','c'],['d','e','f'],['g','h','i']])
Slicing multi-dimensional array would be same as single dimensional array with
print(arr[0:2 , 0:2]) #first two rows and cols
the help of single#reversed
print(arr[::-1]) bracket notation
rowswe learn earlier, lets see an example.
print(arr[: , ::-1]) #reversed cols
print(arr[::-1,::-1]) #complete reverse
OutPut:
[['a' 'b']
['d' 'e']]
[['g' 'h' 'i']
['d' 'e' 'f']
['a' 'b' 'c']]
[['c' 'b' 'a']
['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]
When we slice an array and apply some operation on them, it will also make changes
in original array, as it will not create a copy of a array while slicing.
36 | P a g e
import numpy as np
arr = np.array([1,2,3,4,5])
arrsliced = arr[0:3]
arrsliced[:] = 2 # Broadcasting
print('Original Array = ', arr)
print('Sliced Array = ',arrsliced)
OutPut:
Original Array = [2 2 2 4 5]
Sliced Array = [2 2 2]
NumPy Arithmetic Operations:
import numpy as np
arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]])
arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]])
arradd1 = arr1 + 2 # addition of matrix with scalar
arradd2 = arr1 + arr2 # addition of two matrices
print('Addition Scalar = ', arradd1)
print('Addition Matrix = ', arradd2)
arrsub1 = arr1 - 2 # substraction of matrix with scalar
arrsub2 = arr1 - arr2 # substraction of two matrices
print('Substraction Scalar = ', arrsub1)
print('Substraction Matrix = ', arrsub2)
arrdiv1 = arr1 / 2 # substraction of matrix with scalar
arrdiv2 = arr1 / arr2 # substraction of two matrices
print('Division Scalar = ', arrdiv1)
print('Division Matrix = ', arrdiv2)
OUTPUT :
Addition Scalar = [[3 4 5]
[3 4 5]
[3 4 5]]
Addition Matrix = [[5 7 9]
[5 7 9]
[5 7 9]]
Substraction Scalar = [[-1 0 1]
[-1 0 1]
[-1 0 1]]
Substraction Matrix = [[-3 -3 -3]
[-3 -3 -3]
[-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
[0.5 1. 1.5]
[0.5 1. 1.5]]
Division Matrix = [[0.25 0.4 0.5 ]
[0.25 0.4 0.5 ]
[0.25 0.4 0.5 ]]
37 | P a g e
import numpy as np
arrmul1 = arr1 * 2 # multiply matrix with scalar
arrmul2 = arr1 * arr2 # multiply two matrices
print('Multiply Scalar = ', arrmul1)
#Note : its not metrix multiplication*
print('Multiply Matrix = ', arrmul2)
# In order to do matrix multiplication
arrmatmul = np.matmul(arr1,arr2)
print('Matrix Multiplication = ',arrmatmul)
# OR
arrdot = arr1.dot(arr2)
print('Dot = ',arrdot)
# OR
arrpy3dot5plus = arr1 @ arr2
print('Python
Output: 3.5+ support = ',arrpy3dot5plus)
Output:
Sorting Array
import numpy as np
# arr = our ndarray
np.sort(arr,axis,kind,order)
# OR arr.sort()
38 | P a g e
import numpy as np
arr = np.array([‘Daxa’,’Junagadh','Insitute','of’,’BCA'])
print("Before Sorting = ", arr)
arr.sort() # or np.sort(arr)
print("After Sorting = ",arr)
Output:
import numpy as np
arr = np.random.randint(1,100,10)
print(arr)
boolArr = arr > 50
print(boolArr)
Output:
[25 17 24 15 17 97 42 10 67 22]
[False False False False False True False False
True False]
import numpy as np
arr = np.random.randint(1,100,10)
print("All = ",arr)
boolArr = arr > 50
print("Filtered = ", arr[boolArr])
Output:
39 | P a g e
What Is pandas?
pandas is an open-source software library built on Python for data analysis and
data manipulation. The pandas library provides data structures designed
specifically to handle tabular datasets with a simplified Python API. pandas is an
extension of Python to process and manipulate tabular data, implementing
operations such as loading, aligning, merging, and transforming datasets
efficiently.
With its support for structured data formats like tables, matrices, and time series,
the pandas Python API provides tools to process messy or raw datasets into clean,
structured formats ready for analysis. To achieve high performance,
computationally intensive operations are implemented using C or Cython in the
back-end source code. The pandas library is inherently not multi-threaded, which
can limit its ability to take advantage of modern multi-core platforms and process
large datasets efficiently. However, new libraries and extensions in the Python
ecosystem can help address this limitation.
The pandas library integrates with other scientific tools within the broader Python
data analysis ecosystem.
At the core of the pandas open-source library is the DataFrame data structure for
handling tabular and statistical data. A pandas DataFrame is a two-dimensional,
array-like table where each column represents values of a specific variable, and
each row contains a set of values corresponding to those variables. The data
stored in a DataFrame can encompass numeric, categorical, or textual types,
enabling pandas to manipulate and process diverse datasets.
pandas facilitates importing and exporting datasets from various file formats,
such as CSV, SQL, and spreadsheets. These operations, combined with its data
40 | P a g e
manipulation capabilities, enable pandas to clean, shape, and analyze tabular and
statistical data.
Pandas allows for importing and exporting tabular data in various formats, such
as CSV, SQL, and spreadsheet files.
pandas also allows for various data manipulation operations and data cleaning
features, including selecting a subset, creating derived columns, sorting, joining,
filling, replacing, summary statistics, and plotting.
41 | P a g e
According to organizers of the Python Package Index—a repository of software for
the Python programming language—pandas is well suited for working with several
kinds of data, including:
Any other form of observational/statistical datasets. The data actually need not
be labeled at all to be placed into a pandas data structure.
42 | P a g e
What Are the Benefits of pandas?
The panda’s library offers numerous benefits to data scientists and developers,
making it a valuable tool for data analysis and manipulation. Key benefits include:
I/O tools: pandas supports importing and exporting data in various formats, such
as CSV, Excel, SQL, and HDF5.
Flexible reshaping and pivoting: pandas simplifies reshaping and pivoting to single
function calls on datasets to further prepare them for analysis or visualization.
43 | P a g e
How to Get Started With Accelerated pandas?
Install :
There are 3 data structures provided by the Pandas module, which are as follows:
Series
import pandas as pd
s = pd.Series(data,index,dtype,copy=False)
import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11])
print(s)
OutPut:
dtype:
0 1
1 3
2 5
3 7
4 9
5 11 44 | P a g e
We can then access the elements inside Series just like array using square brackets
notation.
import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11])
print("S[0] = ", s[0])
b = s[0] + s[1]
print("Sum = ", b)
Output:
S[0] = 1
Sum = 4
import pandas as pd
s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str')
print("S[0] = ", s[0])
b = s[0] + s[1]
Oprint("Sum = ", b)
Output
S[0] = 1
Sum = 13
import numpy as np
import pandas as pd
i = ['name','address','phone','email','website']
d = [‘noble’,’jnd',123’,’nu.com’,’noble.ac.in']
s = pd.Series(data=d,index=i)
print(s)
Output
name noble
address jnd
phone 123
email nu.com
website noble.ac.in 45 | P a g e
dtype: object
Creating Time Series
We can use some of pandas inbuilt date functions to create a time series.
import numpy as np
import pandas as pd
dates = pd.to_datetime("27th of July, 2020")
i = dates + pd.to_timedelta(np.arange(5), unit='D')
d = [50,53,25,70,60]
time_series = pd.Series(data=d,index=i)
print(time_series)
Output:
2020-07-27 50
2020-07-28 53
2020-07-29 25
2020-07-30 70
2020-07-31 60
dtype: int64
Pandas DataFrame
DataFrame is the most important and widely used data structure and is a standard
way to store data. DataFrame has data aligned in rows and columns like the SQL
table or a spreadsheet database. We can either hard code data into a DataFrame
or import a CSV file, tsv file, Excel file, SQL table, etc. We can use the below
constructor for creating a DataFrame object.
Data frames are two dimensional data structure, i.e. data is aligned in a tabular
format in rows and columns.
101
102
103
….
46 | P a g e
160
• data - create a DataFrame object from the input data. It can be list, dict, series,
Numpy ndarrays or even, any other DataFrame.
• dtype - used to specify the data type of each column, optional parameter
There are many ways to create a DataFrame. We can create DataFrame object
from Dictionaries or list of dictionaries. We can also create it from a list of tuples,
CSV, Excel file, etc. Let’s run a simple code to create a DataFrame from the list of
dictionaries.
Example :
import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
OutPut:
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS
'])
print(df)
Output:
PDS Algo SE INS
101 0 23 93 46
102 85 47 31 12
103 35 34 6 89
104 66 83 70 50
105 65 88 87 87
47 | P a g e
Grabbing the column:
import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
print(df['PDS'])
OutPut:
101 0
102 85
103 35
104 66
105 65
Name: PDS, dtype: int32
Grabbing the multiple column
print(df['PDS', 'SE'])
Output:
PDS SE
101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
Grabbing a row
print(df.loc[101]) # using labels
#OR
print(df.iloc[0]) # using zero based index
Output:
PDS 0
Algo 23
SE 93
INS 46
Name: 101, dtype: int32
48 | P a g e
Output:
0
Deleting Row
df.drop('103',inplace=True)
print(df)
Output:
df.drop('total',axis=1,inplace=True)
print(df)
Output
PDS Algo SE INS
101 0 23 93 46
102 85 47 31 12
103 35 34 6 89 49 | P a g e
104 66 83 70 50
105 65 88 87 87
Getting Subset of Data Frame
print(df.loc[[101,104], [['PDS','INS']])
Output:
PDS INS
101 0 46
104 66 50
Output:
PDS SE INS
101 0 93 46
102 85 31 12
103 35 6 89
104 66 70 50
Conditional
105 65 87Selection:
87
import numpy as np
import pandas as pd
np.random.seed(121)
Output:
randArr = np.random.randint(0,100,20).reshape(5,4)
df =
pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE',
'INS'])
print(df)
print(df>50)
Output:
PDS Algo SE INS
101 66 85 8 95
102 65 52 83 96
103 46 34 52 60
104 54 3 94 52
105 57 75 88 39
PDS Algo SE INS
101 True True False True
102 True True True True
103 False False True True
104 True False True True 50 | P a g e
105 True True True False
Note : we have used np.random.seed() method and set seed to be 121,
so that when you generate random number it matches with the random
number I have generated.
dfBool = df > 50
print(df[dfBool])
Output:
Output:
51 | P a g e
Setting/Resetting index
In our previous example we have seen our index does not have name,
if we want to specify name to our index we can specify it using
DataFrame.index.name property.
df.index.name('RollNo')
Output:
set_index(new_index)
df.set_index('PDS') #inplace=True
Output:
RollNo PDS Algo SE INS
0 101 66 85 8 95
1 102 65 52 83 96
2 103 46 34 52 60
Note: Our RollNo(index) 3 104 54 3 94 52
become new column, and 4 105 57 75 88 39
Multi-Index DataFrame
Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate
information faster at almost no cost.
Output:
RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
53 | P a g e
Creating Multi indexes is as simple as creating single index using set_index method,
only difference is in case of multi indexes we need to provide list of indexes instead
of a single string index, lets see and example for that
dfMulti = pd.read_csv('MultiIndexDemo.csv')
dfMulti.set_index(['Col','Dep','Sem'],inplace=True)
print(dfMulti)
Output:
RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Now we have multi-indexed DataFrame from which we can access data using
multiple index.
For Example
print(dfMulti.loc['xyz'])
Output:
RN S1 S2 S3
Dep Sem
CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
54 | P a g e
Sub DataFrame for Computer Engineering
print(dfMulti.loc['xyz','CE'])
OutPut:
RN S1 S2 S3
Sem
5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
dfMultiCSV =
pd.read_csv('MultiIndexDemo.csv',index_col=[0,1,2])
#for multi-index in cols we can use header parameter
print(dfMultiCSV)
Output:
RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
55 | P a g e
=== Parameters ===
key : label
syntax:
dfMultiCSV = pd.read_csv('MultiIndexDemo.csv',
index_col=[0,1,2])
print(dfMultiCSV)
print(dfMultiCSV.xs('CE',axis=0,level='Dep'))
RN S1 S2 S3
Col Dep Sem
ABC CE 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
xyz CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output:
RN S1 S2 S3
Col Sem
ABC 5 101 50 60 70
5 102 48 70 25
7 101 58 59 51
Xyz 5 101 88 99 77
5 102 Data
Dealing with Missing 99 84 76
7 101 88 77 99
56 | P a g e
There are many methods by which we can deal with the missing data, some of most
commons are listed below,
Any groupby operation involves one of the following operations on the original
object. They are
Applying a function
In many situations, we split the data into sets and we apply some functionality on
each subset.
o df.groupby('key')
o df.groupby(['key1','key2'])
o df.groupby(key,axis=
57 | P a g e
College Enno CPI
dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby('Year').groups)
Output:
{2014: Int64Index([0, 2, 4, 9], dtype='int64'),
2015: Int64Index([1, 3, 5, 10], dtype='int64'),
2016: Int64Index([6, 8], dtype='int64'),
2017: Int64Index([7, 11], dtype='int64')}
dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby(['Year','Team']).groups)
58 | P a g e
Output:
dfIPL = pd.read_csv('IPLDataSet.csv')
groupIPL = dfIPL.groupby('Year')
for name,group in groupIPL :
print(name)
print(group)
Output:
2014
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
2015
Team Rank Year Points
1 Riders 2 2015 789
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690 59 | P a g e
Example : Aggregating groups
dfSales = pd.read_csv('SalesDataSet.csv')
print(dfSales.groupby(['YEAR_ID']).count()['QUANTITYORDERED']
)Output:
print(dfSales.groupby(['YEAR_ID']).sum()['QUANTITYORDERED'])
print(dfSales.groupby(['YEAR_ID']).mean()['QUANTITYORDERED'])
Output:
YEAR_ID
2003 1000
2004 1345
2005 478
Name: QUANTITYORDERED, dtype: int64
YEAR_ID
2003 34612
2004 46824
2005 17631
Name: QUANTITYORDERED, dtype: int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype: float64
dfIPL = pd.read_csv('IPLDataSet.csv')
print(dfIPL.groupby('Year').describe()['Points'])
Output:
dfCX = pd.read_csv('CX_Marks.csv',index_col=0)
dfCY = pd.read_csv('CY_Marks.csv',index_col=0)
dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0)
dfAllStudent = pd.concat([dfCX,dfCY,dfCZ])
print(dfAllStudent)
Output:
PDS Algo SE
101 50 55 60
102 70 80 61
103 55 89 70
104 58 96 85
201 77 96 63
202 44 78 32
203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
Join in Pandas
df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
61 | P a g e
o how : How to handle the operation of the two objects.
▪ outer: form union of calling frame’s index with other’s index (or column if
on is specified), and sort it. lexicographically.
dfINS = pd.read_csv('INS_Marks.csv',index_col=0)
dfLeftJoin = allStudent.join(dfINS)
print(dfLeftJoin)
dfRightJoin = allStudent.join(dfINS,how='right')
print(dfRightJoin)
Output:1 Output:2
Merge in Pandas
62 | P a g e
on : specify the column on which we want to join (Default is index)
▪ outer: form union of calling frame’s index with other’s index (or
column if on is specified), and sort it. lexicographically.
m1 = pd.read_csv('Merge1.csv')
print(m1)
m2 = pd.read_csv('Merge2.csv')
print(m2)
m3 = m1.merge(m2,on='EnNo')
print(m3)
Output:
read_csv() is used to read Comma Separated Values (CSV) file into a pandas
DataFrame.
63 | P a g e
some of important Parameters :
dfINS = pd.read_csv('Marks.csv',index_col=0,header=0)
print(dfINS)
Output:
Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local
filesystem or URL. Supports an option to read a single sheet or a list of sheets.
64 | P a g e
We need two libraries for that,
After installing both the libraries, import create_engine from sqlalchemy and
import pymysql.
Then, create a database connection string and create engine using it.
db_connection_str='mysql+pymysql://username:password@host
/dbname'
db_connection = create_engine(db_connection_str)
After getting the engine, we can fire any sql query using pd.read_sql method.
read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL,
Oracle etc…)
Output:
65 | P a g e
Web Scrapping using Beautiful Soup
Beautiful Soup is a library that makes it easy to scrape information from web pages.
It sits atop an HTML or XML parser, providing Pythonic idioms for iterating,
searching, and modifying the parse tree.
Web scraping with Beautiful Soup in Python involves extracting data from HTML or
XML documents. This process typically follows these steps:
requests is used to fetch the web page content, and beautifulsoup4 is the Beautiful
Soup library itself.
Fetch Web Page Content: Use the requests library to send an HTTP GET request to
the target URL and retrieve the HTML content.
Parse HTML with Beautiful Soup: Create a Beautiful Soup object by passing the
fetched HTML content and a parser (e.g., 'html.parser' or 'lxml') to
the BeautifulSoup constructor.
66 | P a g e
\
Use your browser's developer tools (e.g., "Inspect Element") to examine the
structure of the target website and identify the HTML tags, classes, and IDs
associated with the data you want to extract.
Extract Data:
Utilize Beautiful Soup's methods to locate and extract the desired elements:
You can filter by tag name, attributes (like class_ or id), and text content.
Access element text using .text and attributes using bracket notation
(e.g., element['href']).
Store Scraped Data: Process and store the extracted data in a suitable format, such
as a list of dictionaries, CSV file, or a database.
Important Considerations:
Respect robots.txt:
67 | P a g e
Check the website's robots.txt file to understand which parts of the site are allowed
to be crawled.
Terms of Service:
Rate Limiting:
Implement delays between requests to avoid overwhelming the server and getting
blocked.
Error Handling:
Include error handling (e.g., try-except blocks) to manage potential issues like
network errors or missing elements.
import requests
import bs4
req = requests.get(“https://nobleuniversity.ac.in/faculty-of-computer-
applications/teaching-staff/”)
soup = bs4.BeautifulSoup(req.text,"html.parser") name_tags =
soup.select("h4.elementor-heading-title“)
name = text.split("\n")[0]
print(name)
Scrapy is a fast, high-level Python framework used for web scraping and crawling
websites. It lets you extract data from HTML/XML pages using selectors like CSS or
XPath and output the results in CSV, JSON, or database formats.
First things first. Web scraping, also known as web data extraction, is a way to
collect information from websites. This can be done by using special software that
accesses the internet like a web browser, or by automating the process with a bot
or a so called web crawler. It is basically a method of copying specific data from the
web and saving it in a local database or spreadsheet for future use.
There are different types of data extraction tehniques, I’ll name a few:
1. Static Web Scraping: This is the most basic form of web scraping, where data
is extracted from web pages that are primarily composed of HTML and CSS. It’s
used for collecting data from websites with fixed, as its name says — static,
unchanging content.
2. Dynamic Web Scraping: Dynamic web scraping involves the use of tools or
scripts that can interact with the page and extract data from elements that
load after the initial page load (for example: pages that use JavaScript to load
content dynamically).
5. Social Media Scraping: Collecting data from social media platforms, like
Twitter (I mean ‘X’) or Facebook, is a specialized form of web scraping. It’s
often used for sentiment analysis, trend tracking, or marketing research.
69 | P a g e
6. Image Scraping: This is the process of extracting data from images on the web,
such as text from images, logos, or other graphical elements.
I did kind of partially answered my own question in the previous section, however I
do want to go through the most common use cases of web scraping:
• Data collection for the purpose of market research, stock market analysis,
competitor analysis, generating leads for sales and marketing.
“An open source and collaborative framework for extracting the data you need
in this section, we’ll explain how to set up a Scrapy project for web scraping use
cases. Creating a Scrapy project for web scraping in Python is a simple three-step
procedure.
1. Install Scrapy;
70 | P a g e
3. Generate a new Spider for your web-scraping target.
Let’s start by installing Scrapy. Open your Python command terminal and type the
following pip command:
Creating a Spider:
• Generate a Spider: Navigate into your project directory (cd <project_name>) and
generate a spider:
This creates a Python file in the spiders directory, which will contain your scraping
logic.
71 | P a g e
o Following Links: Use response.follow() to create new requests to
follow links found on the current page, passing a callback function to
handle the response of the new request.
Example:
72 | P a g e
3 Data Visualization
For plotting using Matplotlib, we need to import its Pyplot module using the
following command:
Here, plt is an alias or an alternative name for matplotlib.pyplot. We can use any
other alias also.
73 | P a g e
The pyplot module of matplotlib contains a collection of functions that can be used to
work on a plot. The plot() function of the pyplot module is used to create a figure. A
figure is the overall window where the outputs of pyplot functions are plotted. A figure
contains a plotting area, legend, axis labels, ticks, title, etc. (Figure). Each function
makes some change to a figure: example, creates a figure, creates a plotting area in a
figure, plots some lines in a plotting area, decorates the plot with labels, etc.
It is always expected that the data presented through charts easily understood.
Hence, while presenting data we should always give a chart title, label the axis of the
chart and provide legend in case we have more than one plotted data.
To plot x versus y, we can write plt.plot(x,y). The show() function is used to display
the figure created using the plot() function.
Let us consider that in a city, the maximum temperature of a day is recorded for
three consecutive days. Program 4-1 demonstrates how to plot temperature values
for the given dates. The output generated is a line chart.
74 | P a g e
Program 4-1 Plotting Temperature against Height
Output:
In program 4-1, plot() is provided with two parameters, which indicates values for
x-axis and y-axis, respectively. The x and y ticks are displayed accordingly. As shown
in Figure the plot() function by default plots a line chart. We can click on the save
button on the output window and save the plot as an image. A figure can also be
saved by using savefig() function. The name of the figure is passed to the function
as parameter.
In the previous example, we used plot() function to plot a line graph. There are
different types of data available for analysis. The plotting methods allow for a
75 | P a g e
handful of plot types other than the default line plot, as listed in Table 4.1. Choice
of plot is determined by the type of data we have.
boxplot(x[, notch, sym, vert, whis, ...]) Make a box and whisker plot.
Customisation of Plots
Pyplot library gives us numerous functions, which can be used to customise charts
such as adding titles or legends. Some of the customisation options are listed in
Table.
xticks([ticks, labels]) Get or set the current tick locations and labels of the x-axis.
yticks([ticks, labels]) Get or set the current tick locations and labels of the y-axis.
76 | P a g e
Plotting a line chart of date versus temperature by adding Label on X and Y axis, and
adding a Title and Grids to the chart.
plt.plot(date, temp)
plt.xlabel("Date") # Add label on x-axis
plt.ylabel("Temperature") # Add label on y-axis
plt.title("Date wise Temperature") # Add title to chart
plt.grid(True) # Add gridlines
plt.yticks(temp) # Set y-axis ticks to
temp values
plt.show()
Output:
In the above example, we have used the xlabel, ylabel, title and yticks functions.
We can see that compared to Figure 4.2, the Figure 4.3 conveys more meaning,
easily. We will learn about customizations of other plots in later sections.
77 | P a g e
Character Colour
‘b’ blue
‘g’ green
‘r’ red
‘c’ cyan
‘m’ magenta
‘y’ yellow
‘k’ black
‘w’ White
Marker
We can make certain other changes to plots by passing various parameters to the
plot() function. In Figure 4.3, we plot temperatures day-wise. It is also possible to
specify each point in the line through a marker.A marker is any symbol that
represents a data value in a line chart or a scatter plot. Table shows a list of markers
along with their corresponding symbol and description. These markers can be used
in program codes:
Colour
It is also possible to format the plot further by changing the colour of the plotted data.
Table shows the list of colours that are supported. We can either use character codes
or the color names as values to the parameter color in the plot().
78 | P a g e
Colour abbreviations for plotting
We can also set the line style of a line chart using the linestyle parameter. It can
take a string such as "solid", "dotted", "dashed" or "dashdot". Let us write the
Program 4-3 applying some of the customizations.
Consider the average heights and weights of persons aged 8 to 16 stored in the
following two lists:
height = [121.9,124.5,129.5,134.6,139.7,147.3,
152.4, 157.5,162.6]
viii. The title of the chart should be “Average weight with respect to average
height”.
x. Linewidth should be 2.
79 | P a g e
import matplotlib.pyplot as plt
import pandas as pd
height=[121.9,124.5,129.5,134.6,139.7,147.3,152.4,157.5,162.6]
weight=[19.7,21.3,23.5,25.9,28.5,32.1,35.7,39.6,43.2]
df=pd.DataFrame({"height":height,"weight":weight})
plt.xlabel('Weight in kg')
plt.ylabel('Height in cm')
plt.title('Average weight with respect to average height')
plt.plot(df.weight,df.height,marker='*',markersize=10,color='gree
n',linewidth=2,linestyle='dashdot')
plt.savefig('height_weight_plot.png')
Output:
Figure 4.4: Line chart showing average weight against average height
In Programs 4-1 and 4-2, we learnt that the plot() function of the pyplot module of
matplotlib can be used to plot a chart. However, starting from version 0.17.0,
Pandas objects Series and DataFrame come equipped with their own .plot()
methods. This plot() method is just a simple wrapper around the plot() function of
pyplot. Thus, if we have a Series or DataFrame type object (let's say 's' or 'df') we
can call the plot method by writing:
s.plot() or df.plot()
80 | P a g e
The plot() method of Pandas accepts a considerable number of arguments that can
be used to plot a variety of graphs. It allows customising different plot types by
supplying the kind keyword arguments. The general syntax is: plt.plot(kind),where
kind accepts a string indicating the type of .plot, as listed in Table 4.5. In addition,
we can use the matplotlib.pyplot methods and functions also along with the plt()
method of Pandas objects.
A line plot is a graph that shows the frequency of data along a number line. It is
used to show continuous dataset. A line plot is used to visualise growth or decline
in data over a time interval. We have already plotted line charts through Programs
4-1 and 4-2. In this section, we will learn to plot a line chart for data stored in a
DataFrame. .
Program 4-4 Smile NGO has participated in a three week cultural mela. Using
Pandas, they have stored the sales (in Rs) made day wise for every week in a CSV
file named “MelaSales.csv”, as shown in Table 4.6.
81 | P a g e
Table 4.6 Day-wise mela sales data
Depict the sales for the three weeks using a Line chart. It should have the following:
i. Chart title as “Mela Sales Report”.
ii. axis label as Days.
iii. axis label as “Sales in Rs”.
Line colours are red for week 1, blue for week 2 and brown for week 3.
import pandas as pd
import matplotlib.pyplot as plt
The Figure 4.5 displays a line plot as output for Program 4-4. Note that the legend is
displayed by default associating the colours with the plotted data
Output:
82 | P a g e
Figure 4.5: Line plot showing mela sales figures
The line plot takes a numeric value to display on the x axis and hence uses the index
(row labels) of the DataFrame in the above example. Thus, x tick values are the
index of the DataFramedf that contains data stored in MelaSales.CSV.
We can substitute the ticks at x axis with a list of values of our choice by using
plt.xticks(ticks,label) where ticks is a list of locations(locs) on x axis at which ticks
should be placed, label is a list of items to place at the given ticks.
Program 4-5 Assuming the same CSV file, i.e., MelaSales. CSV, plot the line chart
with following customisations:
83 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
84 | P a g e
Plotting Bar Chart
The line plot in Figure 4.6 shows that the sales for all the weeks increased during the
weekend. Other than weekends, it also shows that the sales increased on Wednesday
for Week 1, on Thursday for Week 2 and on Tuesday for Week 3.
But, the lines are unable to efficiently depict comparison between the weeks for which
the sales data is plotted. In order to show comparisons, we prefer Bar charts. Unlike
line plots, bar charts can plot strings on the x axis. To plot a bar chart, we will specify
kind=’bar’. We can also specify the DataFrame columns to be used as x and y axes.
85 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
Output:
Output:
We can also customise the bar chart by adding certain parameters to the plot
function. We can control the edgecolor of the bar, linestyle and linewidth. We can
also control the color of the lines. The following example shows various
customisations on the bar chart of Figure 4.8
Program 4-7 Let us write a Python script to display Bar plot for the “MelaSales.csv”
file with column Day on x axis, and having the following customisation:
86 | P a g e
● Changing the color of each bar to red, yellow and purple.
● Edgecolor to green
● Linewidth as 2
● Line style as "--"
import pandas as pd
import matplotlib.pyplot as plt
Output:
87 | P a g e
Plotting Pie Charts
Pie is a type of graph in which a circle is divided into different sectors and each sector
represents a part of the whole. A pie plot is used to represent numerical data
proportionally. To plot a pie chart, either column label y or 'subplots=True' should be
set while using df.plot(kind='pie') . If no column reference is passed and
subplots=True, a 'pie' plot is drawn for each numerical column independently.
import pandas as pd
import matplotlib.pyplot as plt
# Create DataFrame
df = pd.DataFrame(
{
'mass': [0.330, 4.87, 5.97],
'radius': [2439.7, 6051.8, 6378.1]
},
index=['Mercury', 'Venus', 'Earth']
)
plt.title('Mass of Planets')
plt.show()
Output:
It is important to note that the default label names are the index value of
88 | P a g e
the DataFrame.
Let us consider the dataset of Table 4.10 showing the forest cover of north
eastern states that contains geographical area and corresponding forest cover
in sq km along with the names of the corresponding states.
import pandas as pd
import matplotlib.pyplot as plt
# Create DataFrame
df = pd.DataFrame(
{
'GeoArea': [83743, 78438, 22327, 22429, 21081, 16579,
10486],
'ForestCover': [67353, 27692, 17280, 17321, 19240,
13464, 8073]
},
index=[
'Arunachal Pradesh',
'Assam',
'Manipur',
'Meghalaya',
'Mizoram',
'Nagaland',
'Tripura'
]
)
To customise the pie plot of we have added the following two properties of pie
chart in program
• Explode—it specifies the fraction of the radius with which to explode or expand
each slot.
• Autopct—to display the percentage of that part as a label.
90 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
# Create DataFrame
df = pd.DataFrame(
{
'GeoArea': [83743, 78438, 22327, 22429, 21081, 16579,
10486],
'ForestCover': [67353, 27692, 17280, 17321, 19240, 13464,
8073]
},
index=[
'Arunachal Pradesh',
'Assam',
'Manipur',
'Meghalaya',
'Mizoram',
'Nagaland',
'Tripura'
]
)
91 | P a g e
Case Study: Bio-Signal Plotting using Matplotlib/Pandas
Background
Objective
To plot and analyze bio-signal data (for example, ECG signal) using Pandas for data
handling and Matplotlib for visualization.
Data Source
• A CSV file (ecg_data.csv) containing:
o Time (seconds)
o Amplitude (mV)
• The data is collected from a patient monitoring device.
Time,Amplitude
0.00,0.1
0.01,0.15
0.02,0.2
0.03,0.18
...
92 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
# Step 4: Display
plt.show()
93 | P a g e
4 Exploring Data Analysis
Introduction
Exploratory Data Analysis (EDA) refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary statistics and
graphical representations.
EDA was developed at Bell Labs by John Tukey, a mathematician and statistician
who wanted to promote more questions and actions on data based on the data
itself.
In one of his famous writings, Tukey said:
“The only way humans can do BETTER than computers is to take a chance of doing
WORSE than them.”
Above statement explains why, as a data scientist, your role and tools aren’t
limited to automatic learning algorithms but also to manual and creative
exploratory tasks.
Computers are unbeatable at optimizing, but humans are strong at discovery by
taking unexpected routes and trying unlikely but very effective solutions.
94 | P a g e
With EDA we,
Describe data
Closely explore data distributions
Understand the relationships between variables
Notice unusual or unexpected situations
Place the data into groups
Notice unexpected patterns within the group
Take note of group differences
We can use many inbuilt pandas functions in order to find central tendency of
the numerical data, Some of the functions are
df.mean()
df.median()
df.std()
df.max()
df.min()
df.quantile(np.array([0,0.25,0.5,0.75,1]))
95 | P a g e
EDA Techniques:
EDA techniques are broadly categorized based on the number of variables being
analyzed.
Univariate Analysis
This type of analysis focuses on a single variable to understand its distribution and
central tendencies. It doesn't explore relationships between variables.
96 | P a g e
data's nuances. These visualizations offer unparalleled power to explore and gain
insight into the data's underlying patterns.
• Histograms: Bar charts that show the frequency of data within specific
ranges, revealing the shape of the distribution (e.g., normal, skewed,
bimodal).
• Swarm Plots: Swarm plots are designed to show the distribution of individual
data points, avoiding overlap and revealing the density of observations. They
are effective for visualizing the concentration of data points in certain areas
and for clearly highlighting isolated points that represent outliers.
• Bar Charts/Plots: Primarily used for categorical data, bar charts display the
frequency or count of observations within different categories. They are
useful for evaluating counts, such as the quality rate of wine.
97 | P a g e
Bivariate and Multivariate Analysis
These techniques explore the relationships between two or more variables.
• Non-graphical: These methods quantify the relationship between variables.
o Correlation Analysis: Calculates a correlation coefficient (like Pearson's or
Spearman's) to measure the strength and direction of the linear or
monotonic relationship between two variables. A correlation matrix is
often used to show the correlations for all pairs of variables in a dataset.
o Cross-tabulation (Contingency Tables): Used for categorical variables to
show the frequency distribution of two or more variables at the same
time.
• Graphical: These visualizations help to see the relationships between variables.
o Scatter Plots: The most common tool for visualizing the relationship
between two continuous variables. They can reveal patterns, trends, and
clusters.
o Pair Plots: A grid of scatter plots that visualizes the relationships between
all pairs of variables in a dataset. They are a powerful way to quickly
understand multiple relationships at once.
o Heatmaps: A graphical representation of a correlation matrix where color
intensity represents the strength of the correlation, making it easy to
spot strong relationships.
Exploratory Data Analysis (EDA) leverages a wide array of tools and technologies,
ranging from versatile programming languages and their extensive libraries to
specialized automated frameworks and comprehensive business intelligence
platforms. These tools empower data professionals to effectively manipulate,
analyze, and visualize data to uncover insights.
99 | P a g e
in a CSV “Marks.csv” file as shown in Table. Plot the data using boxplot and
perform a comparative analysis of performance in each subject.
Waseem Ali 95 76 79 77 89
Kulpreet Singh 78 81 75 76 88
Annie Mathews 88 63 67 77 80
Shiksha 95 55 51 59 80
Naveen Gupta 82 55 63 56 74
Taleem Ahmed 73 49 54 60 77
Pragati Nigam 80 50 51 54 76
Usman Abbas 92 43 51 48 69
Gurpreet Kaur 60 43 55 52 71
Sameer Murthy 60 43 55 52 71
Angelina 78 33 39 48 68
Angad Bedi 62 43 51 48 54
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV file
data = pd.read_csv('Marks.csv')
# Create DataFrame
df = pd.DataFrame(data)
# Plot boxplot
df.plot(kind='box')
# Set chart title and labels
plt.title('Performance Analysis')
plt.xlabel('Subjects')
plt.ylabel('Marks')
# Show plot
.
plt.show()
100 | P a g e
Output:
The distance between the box and lower or upper whiskers in some boxplots are
more, and in some less. Shorter distance indicates small variation in data, and
longer distance indicates spread in data to mean larger variation.
To keep improving their services, XYZ group of hotels have asked all the three
hotels to get feedback form filled by their customers at the time of checkout.
After getting ratings on a scale of (1–5) on factors such as Food, Service,
Ambience, Activities, Distance from tourist spots they calculate the average rating
and store it in a CSV file. The data are given in Table
Year Sunny Bunny Resort Happy Lucky Resort Breezy WIndy Resort
2014 4.75 3 4.5
2015 2.5 4 2
2016 3.5 2.5 3
2017 4 2 3.5
2018 1.5 4.5 1
This year, to award the best hotel they have decided to analyze the ratings of the
past 5 years for each of the hotels. Plot the data using Boxplot.
101 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
# Read the CSV file into 'data'
data = pd.read_csv('compareresort.csv')
# Convert 'data' into a DataFrame 'df'
df = pd.DataFrame(data)
# Plot a box plot for the DataFrame 'df' with a title
df.plot(kind='box', title='Compare Resorts')
# Set xlabel and ylabel
plt.xlabel('Resorts')
plt.ylabel('Rating (5 years)')
# Display the plot
plt.show()
df.plot(kind='box',title='Comp
are Resorts', color='red',
vert=False)
102 | P a g e
Heatmap: A graphical representation of data where individual values in
a matrix are represented as colors. It is most commonly used to visualize
a correlation matrix of all numerical variables.
• While heatmaps can be used for any kind of matrix data, their most common
and powerful application in EDA is to visualize a correlation matrix.
103 | P a g e
Example 1: Correlation Heat Map
The heatmap visually represents the correlation matrix, a table that shows the
pairwise correlations between all variables in the dataset.
• The values close to 1 (which are shown in the correlation matrix and on the
heatmap) indicate a very strong positive correlation between all pairs of
variables: Height, Weight, and Age. This means that as one variable
increases, the other two also tend to increase.
The heatmap below provides a clear, color-coded visual summary of these
relationships. The warmer, redder colors highlight the strong positive correlation,
making it easy to quickly identify patterns and relationships in the data.
104 | P a g e
Random Grid Data Heat Map
import numpy as np
import matplotlib.pyplot as plt
# Create a random 10x10 data grid
data = np.random.rand(10, 10)
plt.imshow(data, cmap='YlGn', interpolation='nearest')
plt.colorbar(label="Value")
plt.title("Grid Heat Map")
plt.show()
The plot shows a 10x10 grid of random data, with the color intensity representing
the value of each cell. The color bar on the side indicates the range of values, from
low (light yellow) to high (dark green).
The plot shows a 10x10 grid of random data, with the color intensity representing
the value of each cell. The color bar on the side indicates the range of values, from
low (light yellow) to high (dark green).
Plotting Histogram
Histograms are column-charts, where each column represents a range of values, and
the height of a column corresponds to how many values are in that range.
105 | P a g e
To make a histogram, the data is sorted into "bins" and the number of data points in
each bin is counted. The height of each column in the histogram is then proportional
to the number of data points its bin contains.
The df.plot(kind=’hist’) function automatically selects the size of the bins based on
the spread of values in the data.
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar', 'Bincy',
'Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}
}
df=pd.DataFrame(data
)
df.plot(kind='hist')
plt.show()
Output:
This plot shows the distributions of both the Height and Weight variables from
your data. This is a univariate graphical technique in EDA, where a single
variable's distribution is visualized.
• The Height histogram (blue) shows that most people have a height around
106 | P a g e
60-61.
• The Weight histogram (orange) shows a much more spread-out distribution,
with a significant outlier in the 85-90 range.
• The Below Program displays the histogram corresponding to all attributes
having numeric values, i.e., ‘Height’ and ‘Weight’ attributes as shown in
Figure On the basis of the height and weight values provided in the
DataFrame, the plot() calculated the bin values.
It is also possible to set value for the bins parameter, for example
df.plot(kind=’hist’,bins=20)
df.plot(kind='hist',bins=[18,19,20,21,22])
df.plot(kind='hist',bins=range(18,25))
Output:
107 | P a g e
Customizing Histogram
Taking the same data as above, now let see how the histogram can be customised.
Let us change the edgecolor, which is the border of each hist, to green. Also, let us
change the line style to ":" and line width to 2. Let us try another property
called fill, which takes boolean values. The default True means each hist will be filled
with color and False means each hist will be empty. Another property called hatch
can be used to fill to each hist with pattern ( '-', '+', 'x', '\\', '*', 'o', 'O', '.'). In the
Program 4-10, we have used the hatch value as "o".
import pandas as pd
import matplotlib.pyplot as plt
data = {'Name':['Arnav', 'Sheela', 'Azhar','Bincy','Yash', 'Nazar'],
'Height' : [60,61,63,65,61,60],
'Weight' : [47,89,52,58,50,47]}
df=pd.DataFrame(data)
df.plot(kind='hist',edgecolor='Green',linewidth=2,linestyle=':',fill=False,hatch='o')
plt.show()
108 | P a g e
Using Open Data
There are many websites that provide data freely for anyone to download and do
analysis, primarily for educational purposes. These are called Open Data as the data
source is open to the public. Availability of data for access and use promotes further
analysis and innovation. A lot of emphasis is being given to open data to ensure
transparency, accessibility and innovation. “Open Government Data (OGD)
Platform India” (data. gov.in) is a platform for supporting the Open Data initiative
of the Government of India. Large datasets on different projects and parameters
are available on the platform.
Let us consider a dataset called “Seasonal and Annual Min/Max Temp Series - India
from 1901 to 2017” from the URL https://data.gov.in/resources/seasonal-and-
annual-minmax-temp-series-india-1901-2017.
Our aim is to plot the minimum and maximum temperature and observe the
number of times (frequency) a particular temperature has occurred. We only need
to extract the 'ANNUAL - MIN' and 'ANNUAL - MAX' columns from the file. Also, let
us aim to display two Histogram plots:
109 | P a g e
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
110 | P a g e
Output:
df.plot(
kind='hist',
alpha=0.5,
title='Annual Min and Max Temperature (1901-2017)',
color=['blue', 'red']
) Output:
Output:
111 | P a g e
Plot a frequency polygon for the ‘ANNUAL – MIN’ column of the “Min/MaxTemp”
data over the histogram depicting it.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV with only 'ANNUAL - MIN'
data = pd.read_csv(
"Min_Max_Seasonal_IMD_2017.csv",
usecols=['ANNUAL - MIN']
)
# Convert to DataFrame
df = pd.DataFrame(data)
# Convert the column to NumPy 1D array
minarray = np.array(df['ANNUAL - MIN'])
# Get histogram data
y, edges = np.histogram(minarray, bins=15)
# Calculate bin midpoints
mid = 0.5 * (edges[1:] + edges[:-1])
# Plot histogram
plt.hist(minarray, bins=15, color='blue', alpha=0.7)
# Overlay line plot of frequencies
plt.plot(mid, y, '-^', color='orange')
# Add labels and title
plt.xlabel('Temperature')
plt.ylabel('Frequency')
plt.title('Annual Min Temperature (1901–2017)')
plt.show()
Output:
112 | P a g e
Plotting Scatter Chart
Prayatna sells designer bags and wallets. During the sales season, he gave
discounts ranging from 10% to 50% over a period of 5 weeks. He recorded his
sales for each type of discount in an array. Draw a scatter plot to show a
relationship between the discount offered and sales made.
import numpy as np
import matplotlib.pyplot as plt
discount= np.array([10,20,30,40,50])
saleInRs=np.array([40000,45000,48000,50000,100000])
plt.scatter(x=discount,y=saleInRs)
plt.title('Sales Vs Discount')
plt.xlabel('Discount offered')
plt.ylabel('Sales in Rs')
plt.show()
Output:
113 | P a g e
Customizing Scatter chart
The size of the bubble can also be used to reflect a value. For example, in program
4-14, we have opted for displaying the size of the bubble as 10 times the discount,
as shown in Figure The colour and markers can also be changed in the above plot
by adding the following statements:
import numpy as np
import matplotlib.pyplot as plt
discount= np.array([10,20,30,40,50])
saleInRs=np.array([40000,45000,48000,50000,100000])
size=discount*10
plt.scatter(x=discount,y=saleInRs,s=size,color='red',linewidth=3,marker='*',edgecolor='blue')
plt.title('Sales Vs Discount')
plt.xlabel('Discount offered')
plt.ylabel('Sales in Rs')
plt.show()
Output:
114 | P a g e
ETL (Extract, Transform, Load) process is a fundamental part of data
warehousing and data integration, involving three key stages: extracting
data from various sources, transforming it into a consistent and usable
format, and loading it into a target system like a data warehouse or data
lake. Logging is crucial for tracking the ETL process, ensuring data quality,
and troubleshooting issues.
Using an ETL pipeline to transform raw data to match the target system,
allows for systematic and accurate data analysis to take place in the target
repository. Specifically, the key benefits are:
More stable and faster data analysis on a single, pre-defined use case. This
is because the data set has already been structured and transformed.
Easier compliance with GDPR, HIPAA, and CCPA standards. This is because
users can omit any sensitive data prior to loading in the target system.
Identify and capture changes made to a database via the change data
capture (CDC) process or technology. These changes can then be applied to
another data repository or made available in a format consumable by ETL,
EAI, or other types of data integration tools.
115 | P a g e
typically a data warehouse, where it is ready to be analyzed by BI tools or
data analytics tools.
ETL Tools
There are four primary types of ETL tools:
Batch Processing: Traditionally, on-premises batch processing was the primary
ETL process. In the past, processing large data sets impacted an organization’s
computing power and so these processes were performed in batches during
off-hours. Today’s ETL tools can still do batch processing, but since they’re
often cloud-based, they’re less constrained in terms of when and how quickly
the processing occurs.
Cloud-Native. Cloud-native ETL tools can extract and load data from sources
directly into a cloud data warehouse. They then use the power and scale of the
cloud to transform the data.
Open Source. Open-source tools such as Apache Kafka offer a low-cost
116 | P a g e
alternative to commercial ETL tools. However, some open source tools only
support one stage of the process, such as extracting data, and some are not
designed to handle data complexities or change data capture (CDC). Plus, it can
be tough to get support for open source tools.
Real-Time. Today’s business demands real-time access to data. This requires
organizations to process data in real time, with a distributed model and
streaming capabilities. Streaming ETL tools, both commercial and open source,
offer this capability.
ETL Logging:
Logging is essential for monitoring the ETL process and identifying potential
issues. Effective ETL logging should include:
• Process Start/End Times: Record when each ETL job begins and ends, including
timestamps.
• Data Source Information: Log which sources were accessed and the number
of records extracted.
• Transformation Details: Capture information about the transformations
applied, such as the number of records processed, errors encountered, and
any data quality issues identified.
• Load Details: Log the number of records loaded into the target system and any
errors during the loading process.
• Error Handling: Log all errors, warnings, and exceptions that occur during the
ETL process, including error messages and stack traces.
• Performance Metrics: Track the time taken for each stage of the ETL process
and identify any performance bottlenecks.
• User Information: Log which user initiated the ETL process (if applicable).
• Configuration Information: Log the ETL job configuration, including
parameters and settings.
Benefits of ETL Logging:
• Troubleshooting: Enables quick identification and resolution of issues during
the ETL process.
• Auditing: Provides a record of all ETL activity for auditing and compliance
purposes.
• Performance Monitoring: Helps identify performance bottlenecks and
optimize the ETL process.
• Data Quality Monitoring: Tracks data quality issues and helps ensure data
accuracy.
• Automation: Facilitates automation of the ETL process by providing insights
into its execution.
By implementing robust ETL logging, organizations can ensure the reliability,
accuracy, and efficiency of their data integration processes.
117 | P a g e
5 WebFramework: Django python
Introduction
Django is named after Django Reinhardt, a jazz manouche guitarist from the 1930s
to early 1950s. To this day, he's considered one of the best guitarists of all time.
Django is pronounced JANG-oh. Rhymes with FANG-oh. The "D" is silent. Django is
a free open source Web frameworks written in python.
Django Reinhardt
118 | P a g e
History
2003- started by Adrian Holovaty and Simen Willison as internal project at the
Lawrence journal world newspaper.
2005- publicly released under BSD license in 21, July 2005 and named it Django.
Currently, DSF (Django Software Foundation) maintains its development and
release cycle.
Django is now an open-source project with contributors across the world.
Django is especially helpful for database driven websites.
Advantages of Django
119 | P a g e
Don't Repeat Yourself (DRY) - Everything should be developed only in exactly one
place instead of repeating it again and again.
Model
The model provides data from the database.In Django, the data is delivered as
an Object Relational Mapping (ORM), which is a technique designed to make
it easier to work with databases. The most common way to extract data from
a database is SQL. One problem with SQL is that you have to have a pretty good
understanding of the database structure to be able to work with it. Django,
with ORM, makes it easier to communicate with the database, without having
to write complex SQL statements. The models are usually located in a file
called models.py.
View
Template
A template is a file where you describe how the result should be
represented.Templates are often .html files, with HTML code describing the
layout of a web page, but it can also be in other file formats to present other
results, but we will concentrate on .html files.
Django uses standard HTML to describe the layout, but uses Django
tags to add logic:
The templates of an application is located in a folder named templates.
URLs
Django also provides a way to navigate around the different pages in a website.
When a user requests a URL, Django decides which view it will send it to.
This is done in a file called urls.py.
So, What is Going On?
When you have installed Django and created your first Django web application,
and the browser requests the URL, this is basically what happens:
120 | P a g e
1. Django receives the URL, checks the urls.py file, and calls the view that
matches the URL.
2. The view, located in views.py, checks for relevant models.
3. The models are imported from the models.py file.
4. The view then sends the data to a specified template in the template folder.
5. The template contains HTML and Django tags, and with the data it returns
finished HTML content back to the browser.
Django can do a lot more than this, but this is basically what you will learn in this
tutorial, and are the basic steps in a simple web application made with Django.
To install Django, you must have Python installed, and a package manager
like PIP.
To check if your system has Python installed, run this command in the command
prompt:
python –version
If Python is installed, you will get a result with the version number, like this
Python 3.13.2
PIP
To install Django, you must use a package manager like PIP, which is included in
Python from version 3.4.
To check if your system has PIP installed, run this command in the command
prompt:
pip –version
If PIP is installed, you will get a result with the version number.
121 | P a g e
Virtual Environment
It is suggested to have a dedicated virtual environment for each Django
project, and one way to manage a virtual environment is venv, which is
included in Python.
The name of the virtual environment is your choice, in this tutorial we will call
it myworld.
his will set up a virtual environment, and create a folder named "myworld"
with subfolders and files, like this:
122 | P a g e
Note: You must activate the virtual environment every time you open the command
prompt to work on your project.
Install Django
Now, that we have created a virtual environment, we are ready to install Django.
Note: Remember to install Django while you are in the virtual environment!
Which will give a result that looks like this (at least on my Windows machine):
That's it! Now you have installed Django in your new project, running in a virtual
environment!
Check Django Version
You can check if Django is installed by asking for its version number like this:
(myworld) C:\Users\Your Name>django-admin –version
If Django is installed, you will get a result with the version number:
5.1.7
My First Project
Once you have come up with a suitable name for your Django project, like
mine: my_d1 navigate to where in the file system you want to store the code (in the
virtual environment), I will navigate to the my world folder, and run this command
in the command prompt:
123 | P a g e
Myworld/ → Virtual Environment
→ my_d1/ → Your Django Project
→ manage.py → 🎛 Project management script
→ my_d1/ → Core project settings folder
→ __init__.py → Marks folder as a Python package
→ settings.py → Main configuration
→ urls.py → URL routing
→ asgi.py → Async server support
→ wsgi.py → Web server support
These are all files and folders with a specific meaning, you will learn about some of
them later in this tutorial, but for now, it is more important to know that this is the
location of your project, and that you can start building applications in it.
You have 18 unapplied migration(s). Your project may not work properly until
you apply the migrations for app(s): admin, auth, contenttypes, sessions.
Run 'python manage.py migrate' to apply them.
August 23, 2025 - 12:58:13
Django version 5.1.3, using settings 'my_d1.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.
Open a new browser window and type 127.0.0.1:8000 in the address bar.
124 | P a g e
An app is a web application that has a specific meaning in your project, like a home
page, a contact form, or a members database.
In this tutorial we will create an app that allows us to list and register members in
a database.
But first, let's just create a simple Django app that displays "Hello World!".
125 | P a g e
→ my_app/ → Your Django Project
→ manage.py → 🎛 Project management script
This is where we gather the information we need to send back a proper response.
Views
Django views are Python functions that take http requests and return http response,
like HTML documents.
A web page that uses Django is full of views with different tasks and missions.
Views are usually put in a file called views.py located on your app's folder.
There is a views.py in your members folder that looks like this:
my_app/members/views.py:
Find it and open it, and replace the content with this:
my_app/members/views.py:
126 | P a g e
from django.shortcuts import rend
from django.http import HttpResponse
def members(request):
return HttpResponse("Hello world!")
URLs
Create a file named urls.py in the same folder as the views.py file, and type this code
in it:
my_app/members/urls.py:
urlpatterns = [
path('members/', views.members, name='members'),
]
The urls.py file you just created is specific for the members application. We have to
do some routing in the root directory my_app as well. This may seem complicated,
but for now, just follow the instructions below.
There
–– is a file called urls.py on the my_app folder, open that file and add
the include module in the import statement, and also add a path() function in
the urlpatterns[] list, with arguments that will route users that comes in
via 127.0.0.1:8000/.
Then your file will look like this:
my_app/ my_app /urls.py:
urlpatterns = [
path('', include('members.urls')),
path('admin/', admin.site.urls),
]
––
If the server is not running, navigate to the /my_app folder and execute this
command in the command prompt:
127 | P a g e
python manage.py runserver
Templates
my_app
manage.py
members/
templates/
myfirst.html
128 | P a g e
Open the HTML file and insert the following:
my_app/members/templates/myfirst.html:
<!DOCTYPE html>
<html>
<body>
<h1>Hello World!</h1>
</body>
</html>
Open the views.py file in the members folder, and replace its content with this:
my_app /members/views.py:
def members(request):
template = loader.get_template('myfirst.html')
return HttpResponse(template.render())
Change Settings
To be able to work with more complicated stuff than "Hello World!", We have to tell
Django that a new app is created.
This is done in the settings.py file in the my_app folder.
Look up the INSTALLED_APPS[] list and add the members app like this:
my_tennis_club/my_app/settings.py:
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
'members'
]
129 | P a g e
Then run this command:
Operations to perform:
Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
Applying contenttypes.0001_initial... OK
Applying auth.0001_initial... OK
Applying admin.0001_initial... OK
Applying admin.0002_logentry_remove_auto_add... OK
Applying admin.0003_logentry_add_action_flag_choices... OK
Applying contenttypes.0002_remove_content_type_name... OK
Applying auth.0002_alter_permission_name_max_length... OK
Applying auth.0003_alter_user_email_max_length... OK
Applying auth.0004_alter_user_username_opts... OK
Applying auth.0005_alter_user_last_login_null... OK
Applying auth.0006_require_contenttypes_0002... OK
Applying auth.0007_alter_validators_add_error_messages... OK
Applying auth.0008_alter_user_username_max_length... OK
Applying auth.0009_alter_user_last_name_max_length... OK
Applying auth.0010_alter_group_name_max_length... OK
Applying auth.0011_update_proxy_permissions... OK
Applying auth.0012_alter_user_first_name_max_length... OK
Applying sessions.0001_initial... OK
Start the server by navigating to the /my_app folder and execute this command:
python manage.py runserver
130 | P a g e
Django Models
Up until now in this content, output has been static data from Python or HTML
templates.
Now we will see how Django allows us to work with data, without having to change
or upload files in the process.
In Django, data is created in objects, called Models, and is actually tables in a
database.
A Model in Django is a class imported from the django.db library that acts as the
bridge between your database and server. This class is a representation of the data
structure used by your website. It will directly relate this data-structure with the
database. So that you don’t have to learn SQL for the database.
my_app/members/models.py:
class Member(models.Model):
firstname = models.CharField(max_length=255)
lastname = models.CharField(max_length=255) 131 | P a g e
The first field, firstname, is a Text field, and will contain the first name of the
members.
The second field, lastname, is also a Text field, with the member's last name.
Both firstname and lastname is set up to have a maximum of 255 characters.
Django Admin
Django Admin is a really great tool in Django, it is actually a CRUD* user interface of
all your models!
132 | P a g e
The reason why this URL goes to the Django admin log in page can be found in
the urls.py file of your project:
my_app/my_app/urls.py:
urlpatterns = [
path('', include('members.urls')),
path('admin/', admin.site.urls),
]
The urlpatterns[] list takes requests going to admin/ and sends them
to admin.site.urls, which is part of a built-in application that comes with Django,
and contains a lot of functionality and user interfaces, one of them being the log-in
user interface.
Here you must enter: username, e-mail address, (you can just pick a fake e-mail
address), and password:
Username: BCA
Email address: [email protected]
Password:
Password (again):
This password is too short. It must contain at least 8 characters.
This password is too common.
This password is entirely numeric.
Bypass password validation and create user anyway? [y/N]:
My password did not meet the criteria, but this is a test environment, and I choose
to create user anyway, by enter y:
133 | P a g e
Superuser created successfully.
134 | P a g e
Here you can create, read, update, and delete groups and users, but where is the
Members model?
The Members model is missing, as it should be, you have to tell Django which models
that should be visible in the admin interface.
my_app/members/admin.py:
Insert a couple of lines here to make the Member model visible in the admin page:
my_app/members/admin.py:
from django.contrib import admin
from .models import Member
Now go back to the browser (127.0.0.1:8000/admin/) and you should get this result:
135 | P a g e
Click Members and see the five records we inserted earlier in this tutorial:
the "Recipe Update" project is a part of a CRUD (Create, Read, Update, Delete)
application for Student Dara. Here's a summary of its key functionality:
• Create: The project enables users to create new recipes by providing a name,
description, and an image upload.
• Read: Users can view a list of recipes with details, including names,
descriptions, and images. They can also search for Student using a search
form.
• Update: Users can update existing Student by editing their names,
descriptions, or images. This functionality is provided through a form that
populates with the Student's current details.
• Delete: The project allows users to delete Student by clicking a "Delete"
button associated with each student entry in the list.
Project Structure
student_mgmt/
│── student_mgmt/
│ ├── settings.py
│ ├── urls.py
│ └── ...
│── students/
│ ├── models.py
│ ├── views.py
│ ├── urls.py
│ ├── forms.py
│ └── templates/
│ └── students/
│ ├── base.html
│ ├── student_list.html
│ ├── student_form.html
│ └── student_confirm_delete.html
137 | P a g e
⚙ 1. models.py
class Student(models.Model):
name = models.CharField(max_length=100)
email = models.EmailField(unique=True)
age = models.IntegerField()
course = models.CharField(max_length=50)
def __str__(self):
return self.name
2. forms.py
class StudentForm(forms.ModelForm):
class Meta:
model = Student
fields = ['name', 'email', 'age', 'course']
3. views.py
# List + Search
def student_list(request):
query = request.GET.get("q")
if query:
students = Student.objects.filter(name__icontains=query)
else:
students = Student.objects.all()
return render(request, "students/student_list.html", {"students": students})
# Create
def student_create(request):
if request.method == "POST":
form = StudentForm(request.POST)
if form.is_valid():
138 | P a g e
form.save()
return redirect("student_list")
else:
form = StudentForm()
return render(request, "students/student_form.html", {"form": form})
# Update
def student_update(request, pk):
student = get_object_or_404(Student, pk=pk)
if request.method == "POST":
form = StudentForm(request.POST, instance=student)
if form.is_valid():
form.save()
return redirect("student_list")
else:
form = StudentForm(instance=student)
return render(request, "students/student_form.html", {"form": form})
# Delete
def student_delete(request, pk):
student = get_object_or_404(Student, pk=pk)
if request.method == "POST":
student.delete()
return redirect("student_list")
return render(request, "students/student_confirm_delete.html", {"student":
student})
🛣 4. students/urls.py
urlpatterns = [
path('', views.student_list, name='student_list'),
path('new/', views.student_create, name='student_create'),
path('edit/<int:pk>/', views.student_update, name='student_update'),
path('delete/<int:pk>/', views.student_delete, name='student_delete'),
]
139 | P a g e
5. Main urls.py (student_mgmt/urls.py)
urlpatterns = [
path('admin/', admin.site.urls),
path('students/', include('students.urls')),
]
6. Templates
base.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Student Management</title>
<link
href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
rel="stylesheet">
</head>
<body class="bg-light">
</body>
</html>
Student_list.html
{% extends "students/base.html" %}
{% block content %}
{% endblock %}
student_form.html
{% extends "students/base.html" %}
{% block content %}
141 | P a g e
<button type="submit" class="btn btn-success">Save
Student</button>
</div>
</form>
</div>
{% endblock %}
student_confirm_delete.html
{% extends "students/base.html" %}
{% block content %}
{% endblock %}
142 | P a g e
Output:
143 | P a g e