0% found this document useful (0 votes)
10 views78 pages

Intern Report

Uploaded by

khaparderahil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views78 pages

Intern Report

Uploaded by

khaparderahil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GOVERNMENT POLYTECHNIC BRAMAHAPURI

DEPARTMENT OF COMPUTER TECHNOLOGY

INTERNSHIP REPORT
SESSION 2025-2026

IT-NetworkZ Infosystems Pvt. Ltd.


(Kavin India Pvt. Ltd.)

TECHNICAL HEAD: PRESENTED BY:

Mrs. Shalani Kharkate Rahil Khaparde

IT-NETWORKZ INFOSYSTEMS PVT.LTD 1|Page


DECLARATION

During my internship in IT-NetworkZ Infosystems Pvt. Ltd and


preparation of this report I realized that it is the joint venture guidance,
assistance and co-operation. So, it would have not been completed
without and declaration and help received. It is matter of great privileges
to express my deep sense of gratitude towards my guide Mrs.Shalani
Kharkate at IT-NetworkZ Infosystems Pvt. Ltd Nagpur. For having this
guidance, I am extremely thankful to her for constant motivation and
inspiration extended throughout during internship work which has made
me possible to complete the work in scheduled time. My sincere thanks to
all the faculties.

Submitted By:

Rahil Khaparde

IT-NETWORKZ INFOSYSTEMS PVT.LTD 2|Page


ACKNOWLEDGEMENT
I have taken efforts in the project. However, it would not have been possible
without the kind and help of support and help of many individual and organizations
would like to extend my sincere thanks to all of them.

I am highly indebted to Mrs. Shalani Kharkate for their guidance and


constant supervision as well as for providing necessary information regarding the
project &also for their support in completing the project. I would like to express my
gratitude toward my parents & members of IT-NetworkZ Infosystems Pvt. Ltd. for
their kind co-operation and encouragement which help me in completion of this
project.

I would like to give my special gratitude and thanks to industry persons for
giving me such attention and time.

My thanks and appreciation also goes to my colleague in developing the project


and people who have willingly helped me out with their abilities.

TECHNICAL HEAD: PRESENTED BY:

Mrs. Shalani Kharkate Rahil Khaparde

IT-NETWORKZ INFOSYSTEMS PVT.LTD 3|Page


INDEX

Topics

Module 1: AI USING PYTHON PAGES

Chapter 1: INTRODUCTION TO ARTIFICIAL


12
INTELLIGENCE

Chapter 2: PYTHON BASIC FOR AI 15

Chapter 3:DATA HANDLING USING PANDAS AND NUMPY 21

Chapter 4:DATA VISUALIZATION WITH MATPLOTLIB AND


28
SEABORN

Chapter 5: ROLE OF AI IN REAL LIFE AND FUTURE TRENDS 34

Chapter 6: REFERENCE 36

Module 2: DATA SCIENCE

Chapter 1: INTRODUCTION TO DATA SCIENCE 37

Chapter 2: PYTHON BASIC FOR DATA SCIENCE 42

Chapter 3: STATISTICS AND PROBABILITY 50

Chapter 4: DATA WRANGLING USING PANDAS AND


61
NUMPY

Chapter 5: DATA VISUALIZATION AND DATA ANALYSIS 66

Chapter 6: MACHINE LEARNING WITH SCKIT-LEARN


73
AND ML PIPELINE ALGORITHMS

Chapter 7: REFERENCE 79

IT-NETWORKZ INFOSYSTEMS PVT.LTD 4|Page


COMPANY INFORMATION
Name: IT-NetworkZ Infosystems Pvt. Ltd.

INTRODUCTION
Ms IT-NetworkZ is an IT Company and part of Information Technology industry, it was set
on 22 May 2007. IT NetworkZ started its operations with Information Technology services
including IT Infrastructure Management and Professional IT Training & Online Exam
Facility, later in Dec 2009, company established Software Development wing well On 26
June 2015 Company was incorporated as IT-NetworkZ Infosystems Pvt Ltd Company has
its presence in India & in South Africa, its head office is situated in Laxmi Nagar) and
branch offices are at Nandanvan-Nagpur, Cape Town & Johannesburg-South Africa.
Company is running its all operations independently as per their Geographic area India
operations are handled by a team of 20-25 professionals

SERVICE CATEGORIES:
1) IT Training & International Assessment:

IT Network has very strong bonding with educational institutions and hence
established around 28- MoU's with esteemed Institutions which comes
under MSBTE, UGC and AICTE, these MoU are done for students &
faculty development. Technical team members GENwork have fine
experience and keen interest in teaching Company has trained around 4000+
candidates under its banner and usually 7200+ students attend IT- NetworkZ
Tech sessions every year. IT NetworkZ is an Ex-authorized Prometric and
Pearson VUE test centre for International IT Exams. Currently, it is
authorized by "Kryterion Testing Network for reputed Sales Force and other
IT international exams; IT giant "Persistent Systems Ltd had taken an
initiative with company to start this facility for needy candidates.

2) Live Project & Internship


Live Project Internship turns student into professionals. Being a part of IT
Industry IT Network Management started with this initiative of produce
more quality and Industry ready professionals Company is providing 6 week

IT-NETWORKZ INFOSYSTEMS PVT.LTD 5|Page


,6 Months & 1 Year Internship/Live Projects for Final Year and Graduate Candidates

3) Software Development:
IT-Network has a term of enthusiastic and creativity developers and
designers Company is providing stand alone, web applications and mobile
app development various clients till now company has completed various
projects and working for some of the esteemed clients in Hospitality,
Education, and Government sector. As per the market demand and own
strength company has planned service-based solutions in Matrimony,
Employment, Education Listing Electronics Test System Venue Searching,
etc Company is planning to develop few solutions for health care industry as
well as professionals Company is providing 6 Weeks, 6 Months &1
Internship /Live for Final Year and Graduate Candidates.

AWARDS:

1) Awarded in TOP TEN Prometric Test Centres in the world; out of 5600+ Centre's for
theyear2013-14.

2) Microsoft Network Partner


3) 700-Tech Session Delivered to 20,000-Students

PRODUCTS:

Microsoft Dot Net

Virtualization Cloud

CCNA

Linux

Hardware

Security

IT-NETWORKZ INFOSYSTEMS PVT.LTD 6|Page


List of Tables

Module 1: AI USING PYTHON

Table Page

1.1: Types Of AI and Their Description 11

1.2: Comparison Between Human Intelligence 13

and AI

2.1: Python In-Built Data Type 15

2.2: Python Built-in libraries and their purpose 19

3.1: Sampled Grouped Dataset Summary 26

5.1: AI Applications By Industry 33

IT-NETWORKZ INFOSYSTEMS PVT.LTD 7|Page


` Module 2:DATA SCIENCE

Table Page

2.1: Built-In Data-Types in Python 43

2.2: Common Python Libraries 48

3.1: Sample Dataset with Calculation 51

3.2: Difference Between Binomial, Poisson and 56


Normal Distribution
4.1: summarizing common imputation methods, 61
their advantages, drawbacks, and typical use
cases.

4.2: Example of Encoding a “color” Column 64

5.1: Summary Statistics Example 71

6.1: Difference between Supervised and 72


Unsupervised Learning

IT-NETWORKZ INFOSYSTEMS PVT.LTD 8|Page


List of Figures
Module 1: AI Using Python

Figure Page
2.1: Flowchart of Conditional
15
Statement
3.1: Data Cleaning Cycle 24

4.1: Line Plot 28

4.2: Bar Plot 28

4.3: Scatter Plot 29

4.4: Histogram 29

4.5: Heatmap – Subject Correlation 30


4.6: Box Plot – Score Distribution by
30
Subject
4.7: Violin Plot – Value Distribution
31
by Group
4.8: Pair Plot – Iris Dataset 31
4.9: Pie Chart – Global GDP Share
32
by Country (2025)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 9|Page


List of Figures
Module 2:Data Science

Figure Page
1.1: Life Cycle of a Data Science
40
Project
3.1: Bell Curve – Normal
52
Distribution
3.2: Skewness Diagram – Left,
53
Right, and Normal Skew
3.3: Boxplot with Outliers 59
4.1: Encoding Table – Label & One-
64
Hot Encoding
5.1: Line Chart – Monthly Sales
65
Trend
5.2: Bar Chart – Fruit Preference
66
Survey
5.3 Heatmap – Iris Feature
67
Correlation
5.4: Pairplot – Titanic Dataset 67
5.5: Correlation Heatmap – Titanic
70
Features
5.6: Age Histogram 71

5.7: Age Boxplot 71

5.8: Fare Histogram 71

5.9: Fare Boxplot 71


6.1: Confusion Matrix – Breast
75
Cancer Classification
6.2: ML Pipeline Diagram – Training
77
and Inference Flow

IT-NETWORKZ INFOSYSTEMS PVT.LTD 10 | P a g e


CHAPTER 1
INTRODUCTION TO ARTIFICIAL INTELLIGENCE

What is AI:
Artificial intelligence is a branch of computer science that enables machines and software
to perform tasks once thought to require human intelligence—learning from data,
recognizing patterns, understanding and generating natural language, reasoning, and
making decisions. By leveraging techniques like machine learning, deep learning, natural
language processing, and computer vision, AI systems can process vast amounts of
information, adapt to new inputs, and continually improve their accuracy. From powering
virtual assistants and recommendation engines to driving medical diagnostics and
autonomous vehicles, AI transforms how we analyze data, solve problems, and interact
with technology.

History and Evolution of AI:


Artificial Intelligence emerged in the mid-20th century when Alan Turing posed the
question “Can machines think?” in 1950 and the Dartmouth workshop of 1956 coined the
term AI, launching early research into symbolic reasoning and problem solving. The 1970s
and 1980s saw promising rule-based expert systems like MYCIN and XCON, but
enthusiasm waned during periods known as “AI winters” when funding and interest
declined. A resurgence began in the 1990s as statistical methods and increased computing
power enabled machine-learning algorithms to process larger datasets, highlighted by
IBM’s Deep Blue defeating world chess champion Garry Kasparov in 1997. The 2010s
ushered in a deep-learning revolution with breakthroughs such as AlexNet’s success in
image recognition and Google’s AlphaGo mastering the game of Go. Today, AI spans
diverse applications—from natural language models like GPT-3 to autonomous vehicles—
while research focuses on ethical, explainable, and general intelligence.

Types of AI:

Types Description
Systems designed for a specific
Narrow AI task, excelling within limited
scope
Hypothetical systems with human-
General AI level cognitive abilities across
domains
Future concept where AI surpasses
Super AI human intelligence in virtually all
areas
Table 1.1:Types of AI and Their Description

IT-NETWORKZ INFOSYSTEMS PVT.LTD 11 | P a g e


Application of AI in Real Life:

1)Healthcare Diagnostics:
AI models analyze medical imaging (X-rays, MRIs) and patient data to detect
diseases early, assist in treatment planning, and predict patient outcomes.

2)Autonomous Transportation
Self-driving cars and drones use computer vision, sensor fusion, and
reinforcement learning to navigate roads and airspace safely, reducing human
error.

3)Customer Service Chatbots


Virtual assistants powered by natural language processing handle common
inquiries, troubleshoot issues, and provide 24/7 support, freeing human agents for
complex cases.

4)Financial Fraud Detection


Machine learning algorithms monitor transactions in real time, flag anomalous
patterns, and prevent fraudulent activities, safeguarding banks and customers.

Challenges and Limitation of AI:


1. Data Quality and Availability
AI models require large, high-quality datasets, but obtaining sufficiently diverse and
unbiased data can be difficult. Poor or unrepresentative data leads to inaccuracies and
reinforces societal biases.

2. Interpretability and Transparency


Complex models—especially deep neural networks—often operate as “black boxes,”
making it hard to understand how decisions are made. This lack of explainability
undermines trust in critical applications.

3. Computational Resources and Scalability


Training state-of-the-art AI demands vast computational power, specialized hardware
(GPUs/TPUs), and significant energy, which can be cost-prohibitive and have
environmental impacts.

4. Robustness and Generalization


AI systems can fail when encountering data that differ from their training distribution or
under adversarial conditions. Ensuring reliable performance in dynamic, real-world
environments remains a major hurdle.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 12 | P a g e


5. Ethical and Legal Concerns
Issues around data privacy, algorithmic fairness, accountability, and job displacement pose
ethical dilemmas. Regulatory frameworks often struggle to keep pace with rapid AI
advancements.

6. Security and Adversarial Attacks


Malicious actors can exploit vulnerabilities—through data poisoning or adversarial
examples—to deceive or subvert AI systems, threatening their safety and integrity.

Future of AI:
The future of AI involves smarter, more adaptive systems that work alongside humans in
areas like healthcare, education, and industry. AI will become more ethical and transparent,
with better rules to guide its use. It will be part of everyday life—powering smart homes,
personalized services, and faster decision-making. As AI grows, it will also help in
scientific discoveries and create new job opportunities, though it may replace some tasks.

Table 1.2: Comparison Between Human Intelligence and AI

IT-NETWORKZ INFOSYSTEMS PVT.LTD 13 | P a g e


CHAPTER 2
PYTHON BASIC FOR AI

Why Python for AI:


Python is widely used for AI because it is simple, readable, and has a large collection of
powerful libraries like TensorFlow, PyTorch, scikit-learn, and NumPy that make building
AI models easier. Its clear syntax allows developers to focus on solving problems rather
than dealing with complex code. Python also has strong community support, making it easy
to find resources, tutorials, and help. Additionally, it integrates well with other tools and
platforms, making it ideal for both research and production-level AI applications.

Python Setup:

• Steps to setup python

Step 1: Download the installer from https://www.python.org/downloads and save it locally.

Step 2:Run the downloaded installer.


• Check “Add Python 3.13(Version) to PATH In Environmental Variables
• Click “Install Now.

Step 3: After installation completes, open Command Prompt.

Step 4: Verify Python is installed:

By using “python --version”

Step 5: Now install any libraries you need.

For example: pip install numpy pandas scikit-learn

Recommended IDEs

1. Visual Studio Code: Free, highly extensible, integrated terminal, excellent Python
extensions.
2. PyCharm: Feature-rich (code analysis, refactoring tools), Community (free) and
Professional editions.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 14 | P a g e


3. Jupyter Notebook: Interactive cells for data exploration and visualization,
great for learning and prototyping.
4. Spyder: Scientific IDE with integrated console, variable explorer, and
plotting—ideal for data science.

Data Types and Variables in Python:

Variables in Python are names that reference values. You assign a value to a
variable using the equals sign, for example count = 10. Python infers the
variable’s type automatically based on the value you provide.

Data-Type Description Example


int Whole numbers Age=25
Decimal (floating-point)
float Price=9.99
numbers
Text (string of
str Name=’Rahil’
characters)
Boolean values (True or
bool Active=True
False)
Ordered, mutable fruits = ["apple",
list
sequence "banana"]
Ordered, immutable
tuple coords = (10, 20)
sequence
Unordered collection of student = {"name":
dict
key-value pairs "Rahil", "grade": "A"}
Unordered collection of
set unique_ids = {1, 2, 3}
unique items
Null -Type Represents “no value” result = None
Table 2.1: Python In-Built Data Type

Control Statement In Python:


In Python, control statements are used to manage the flow of execution in a program. They
allow decisions to be made, loops to be executed, and specific actions to be taken based on
conditions. These statements are essential for writing dynamic and responsive code.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 15 | P a g e


Figure 2.1:Flowchart of Conditional Statement

1.Conditional Statement: These let the program choose between different paths
based on conditions.

2.Transfer Statement: These modify the behaviour of loops.

3.Iterative Statements: These repeat a block of code multiple times.

Function and Modules:

o Function: A function in Python is a named block of code designed to


perform a specific task. You define a function once and call it wherever
you need that functionality, which makes your code more modular and
easier to maintain.
1.Defining a function: Use def followed by the function name and
parameters in parentheses, then a colon.

Example : def greet(Name):


Print(“Hello”+Name)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 16 | P a g e


2.Calling a Function: Invoke it by its name and pass the required
arguments
Example: greet(‘Rahil’)

Returning Values: Use return to send a result back to the caller.


Example: def add(a,b):
Return a+b

o Modules: A module is a file containing Python definitions—functions, classes, or


variables—that you can import into other scripts. Modules help you organize related
code into separate namespaces.

1.Creating a Module: Save your code in a .py file (e.g., math_utils.py) and
place it in your project directory

2.Importing a module: Use import to bring its contents into another script

Example: import math_utils


Result = math_utils.add(5,3)

o Built-in modules: Python includes modules like math, random, and datetime.

Example: import math


print(math.sqrt(16))

o Third-party modules: Install via pip (e.g., pip install numpy) and import them the
same way.

File Handling:

File handling lets you create, read, update, and delete files on disk. Python’s
built-in functions and methods make it easy to work with text and binary files
in just a few lines of code.
1. Opening a File: Use the open() function, which returns a file object. You
must specify the filename and mode

Example: f = open("example.txt", "r")

2. Reading from a File: Once open, you can read the entire content or
iterate line by line:

IT-NETWORKZ INFOSYSTEMS PVT.LTD 17 | P a g e


Example: content = f.read() # read whole file as one string
line = f.readline() # read a single line
lines = f.readlines() # read all lines into a list

3.Writing to a file: Open in "w" or "a" mode, then use write() or writelines()
Example: f = open("output.txt", "w")
f.write("Hello, World!\n") # write a single string
f.writelines(["Line 1\n", "Line 2\n"])

Object Oriented Programming Basics:

Object-Oriented Programming (OOP) is a programming paradigm focused on


organizing code into objects, which are instances of classes. This helps structure complex
programs in a more modular and reusable way.

1. Class: A blueprint for creating objects. It defines the properties (attributes) and
behaviors (methods) an object can have.

Example:
class Car:
def __init__(self, brand, color):
self.brand = brand
self.color = color

2. Object: An instance of a class. It represents a specific entity with unique data.

Example: my_car = Car("Toyota", "Red")

3. Constructor (__init__):A special method that initializes an object’s properties


when it’s created.

Example:
def __init__(self, brand, color):
self.brand = brand
self.color = color

4. Methods: Functions defined inside a class that describe behaviors

Example:
def drive(self):
print(self.brand + " is driving")

5. Inheritance: One class can inherit properties and methods from another.
Example: class ElectricCar(Car):
def charge(self):
print("Charging battery")

IT-NETWORKZ INFOSYSTEMS PVT.LTD 18 | P a g e


6. Encapsulation: Protects internal state by restricting access to certain attributes
using underscores or property decorators.

Example: self._speed = 0
7. Polymorphism: Allows objects of different classes to be treated as if they are
instances of the same class (usually by method overriding).

Example:
class Bike:
def drive(self):
print("Bike is driving")
vehicle = Car("Honda", "Blue")
bike = Bike()

for v in [vehicle, bike]:


v.drive()
8. Abstraction: Abstraction means hiding complex internal details and showing only
the essential features to the user. It helps simplify implementation and enhances
security by exposing only the necessary parts of the code.

• In Python, abstraction is often achieved using abstract classes and methods


via the abc module (Abstract Base Classes). You define a general template
and force subclasses to implement specific behaviors.

Introduction to Libraries:
In Python, a library is a collection of modules containing pre-written code that helps
you perform common tasks without having to write everything from scratch. Libraries
speed up development, promote code reuse, and are often optimized for performance and
reliability.

Library Purpose
NumPy Numerical computations
pandas Data analysis and manipulation
matplotlib Data visualization
Statistical plotting built on
seaborn
matplotlib
Scikit-learn Machine Learning
Table 2.2:Python Built-in libraries and their purpose

IT-NETWORKZ INFOSYSTEMS PVT.LTD 19 | P a g e


Chapter 3
Data Handling Using Pandas and Numpy

Introduction to Numpy
NumPy (short for Numerical Python) is a powerful library used for numerical computing in
Python. It provides efficient array operations, mathematical functions, and tools for
working with large datasets—making it a foundation for scientific and data-driven tasks.

• Key Features

1. ndarray: Core data structure that allows for fast, multi-


dimensional array manipulation.

2. Broadcasting: Apply operations on arrays of different shapes


without explicit looping

3. Vectorized Operations: Speed up calculations by operating on


entire arrays at once.

4. Mathematical Functions: Includes tools for linear algebra,


statistics, trigonometry, etc.

5. Integration with other libraries: Works seamlessly with


pandas, matplotlib, scikit-learn, and more.

Example:

import numpy as np

a = np.array([1, 2, 3])

b=a*2

x = np.array([[1, 2], [3, 4]])

y = np.array([[5, 6], [7, 8]])

result = np.dot(x, y)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 20 | P a g e


Array Manipulation in Numpy:
Array manipulation refers to the ability to reshape, join, split, and modify arrays to
suit the structure and flow of your data operations. NumPy makes these tasks fast,
clean, and expressive.

• Indexing in NumPy: Indexing lets you access individual elements of an


array by specifying their position. In a one-dimensional array, you use a
single integer inside square brackets. In higher dimensions, you provide a
tuple of indices, one for each axis.

Example:
import numpy as np
arr = np.array([10, 20, 30, 40])
value = arr[2]
last = arr[-1]

matrix = np.array([[1, 2, 3],[4, 5, 6]])


a = matrix[0, 1]
b = matrix[1, 2]

• Slicing in NumPy: Slicing extracts a subarray without copying data (it


returns a view). You specify a start:stop:step slice for each axis. Omitted
values default to the beginning, end, or a step of 1.

Example:
Import numpy as np
arr = np.array([10, 20, 30, 40, 50])
sub = arr[1:4]
skip = arr[::2]

matrix = np.array([[ 1, 2, 3],[ 4, 5, 6],[ 7, 8, 9]])


upper_left = matrix[:2, :2]
middle_col = matrix[:, 1]

IT-NETWORKZ INFOSYSTEMS PVT.LTD 21 | P a g e


Introduction of Pandas:
Pandas is a powerful open-source library built on NumPy that provides two core data
structures—Series (one-dimensional labeled arrays) and DataFrame (two- dimensional
tables with labeled rows and columns). It makes it easy to load data from sources like
CSV, Excel, SQL or JSON; inspect, filter and transform data using intuitive indexing;
handle missing values; compute group-by aggregations; merge or join datasets; and work
with time-series through resampling and rolling operations. With its concise,
expressive API, pandas is the go-to tool for data analysis in Python.

• Methods in Pandas
1. Series: A Series is a one-dimensional, size-immutable labeled array
capable of holding any data type (integers, strings, floats, Python
objects, etc.). Each element in a Series is associated with an index
label, which you can use to access or slice data by label rather than
only by integer position. Series supports vectorized operations,
automatic alignment of data by index, and built-in methods for
descriptive statistics, making it a versatile structure for handling and
analyzing single columns of data

Example:
import pandas as pd

s = pd.Series([10, 20, 30], index=["a", "b", "c"])


print(s)
print(s["b"])
print(s * 2)

2. DataFrame: A DataFrame is a two-dimensional, size-mutable


tabular structure with labeled rows and columns. You can think of it
as a spreadsheet or SQL table, where each column can hold a
different data type. It supports intuitive indexing and slicing,
vectorized arithmetic across columns, group-by aggregations, joins
and merges, and built-in methods for descriptive statistics—all with
minimal code.
Example:
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Score": [85.5, 92.0, 78.0]}
df = pd.DataFrame(data, index=["a", "b", "c"])
print(df)
print(df["Age"] * 2)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 22 | P a g e


3. Reading and writing Files: Pandas makes loading and saving data a
one-liner with functions like read_csv/to_csv, read_excel/to_excel,
read_json/to_json, and read_sql/to_sql. These methods auto-detect
headers, infer data types (including dates), handle missing values,
and offer parameters for delimiters, compression, chunked
processing, and index column control.

Example:
import pandas as pd
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)
df_excel = pd.read_excel("report.xlsx", sheet_name="Sheet1")
df.to_json("data.json", orient="records")

Data Cleaning And Preparation:


Data cleaning is the systematic process of detecting and correcting or removing inaccurate,
incomplete, or irrelevant data from a dataset to ensure its quality and reliability for analysis.
It involves handling missing values—either by dropping records or imputing with statistical
estimates—identifying and removing duplicate entries, and standardizing formats and data
types, such as converting date strings to datetime objects and ensuring numeric fields
contain valid numbers. Data cleaning also addresses inconsistencies in categorical values,
encodes categories into numerical representations for modeling, and detects outliers
through statistical methods to decide whether to transform or exclude them. By normalizing
numerical features and validating data ranges, this process turns raw, messy input into a
consistent and accurate foundation for visualization, statistical analysis, and machine
learning, ultimately leading to more trustworthy insights and better decision-making.

1. Removing Null: Remove rows or columns with missing values


using dropna, or replace nulls with a specific value using fillna.

Example:
import pandas as pd

df = pd.DataFrame({
"A": [1, None, 3],
"B": [4, 5, None]
})
df_cleaned = df.dropna() # drops any row with at least one null
df_filled = df.fillna(0) # replaces all nulls with 0
print(df_cleaned)
print(df_filled)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 23 | P a g e


2. Encoding: Convert categorical text into numeric form. Use
astype(‘category’).cat.codes for label encoding and get_dummies for
one-hot encoding.

Example:
import pandas as pd

df = pd.DataFrame({"Color": ["Red", "Green", "Blue", "Green"]})


df["ColorEncoded"] = df["Color"].astype("category").cat.codes
df_onehot = pd.get_dummies(df, columns=["Color"])
print(df["ColorEncoded"])
print(df_onehot)

3. Normalizing: Scale numeric columns to a common range, often 0–1,


by applying min-max normalization.

Example:
import pandas as pd

df = pd.DataFrame({
"Score": [50, 80, 90],
"Age": [20, 25, 30]
})
df_norm = (df - df.min()) / (df.max() - df.min())
print(df_norm)

Figure 3.1:Data Cleaning Cycle

IT-NETWORKZ INFOSYSTEMS PVT.LTD 24 | P a g e


Merging, Joining, and Grouping Data
• Merging and Joining Data: Merging and joining combine two or more
DataFrames into a single structure based on common key columns or
indexes. The most flexible method is pd.merge, which lets you specify the
type of join—inner, left, right, or outer—to control whether you keep only
matching rows or all rows from one or both tables. You identify one or more
columns to match on (on, left_on, right_on) and pandas aligns rows
accordingly. You can also join on the index using df.join() for simpler one-
to-one or one-to-many joins.

Example:
import pandas as pd
df1 = pd.DataFrame({"order_id": [1, 2, 3], "customer": ["Alice", "Bob", "Charlie"]})

df2 = pd.DataFrame({ "order_id": [2, 3, 4],"total": [150, 200, 50]})

merged = pd.merge(df1, df2, on="order_id", how="inner")

• Grouping Data: Grouping partitions a DataFrame into subsets based on the


values of one or more columns, then applies an aggregation, transformation,
or filter to each group. The groupby operation follows the split-apply-
combine paradigm: split your data into groups, apply a function (like sum,
mean, count), and combine the results back into a new DataFrame or Series.
You can aggregate multiple functions at once or apply custom lambda
functions

Example:
import pandas as pd

df = pd.DataFrame({
"department": ["HR", "Engineering", "HR", "Engineering"],
"salary": [50000, 80000, 52000, 82000]})

grouped = df.groupby("department").agg(
count=("salary", "size"),
average_salary=("salary", "mean"))

IT-NETWORKZ INFOSYSTEMS PVT.LTD 25 | P a g e


Table 3.1:Sampled Grouped Dataset Summary

IT-NETWORKZ INFOSYSTEMS PVT.LTD 26 | P a g e


Chapter 4
Data Visualization with Matplotlib and Seaborn

Importance of Data Visualization:


Data visualization turns raw numbers into graphical representations—charts, maps, and
dashboards—that make complex information more accessible and understandable. By
translating data into visual formats, it leverages our natural ability to spot patterns, trends,
and outliers at a glance.

Key Benefits of Data Visualization:

1. Enhancing comprehension: Visuals simplify dense datasets,


helping users grasp insights without wading through tables or code.

2. Revealing patterns and correlations: Charting relationships


uncovers trends or anomalies that might remain hidden in raw data.

3. Facilitating communication: Clear graphs bridge gaps between


technical and non-technical audiences, making presentations more
persuasive.

4. Accelerating decision-making: Quick visual cues speed up


analysis, enabling faster, more informed business or research
choices.

5. Driving engagement: Interactive dashboards invite exploration,


fostering curiosity and deeper understanding.

Basic Plotting with Matplotlib:

Matplotlib is a popular Python library for creating static, animated, and


interactive visualizations. It provides full control over your plots and
integrates well with NumPy and pandas, making it perfect for data analysis.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 27 | P a g e


Visualization with Matplotlib:

1. Line Plot: A line plot connects data points with straight lines, ideal
for showing trends or changes over time. It’s commonly used in
time-series analysis to visualize growth, patterns, or relationships
between variables.

Example: Figure 4.1 Line Plot

2. Bar Plot: A bar plot displays categorical data using rectangular bars
with lengths proportional to the values they represent. It’s useful for
comparing quantities across categories.

Example: Figure 4.2 Bar Plot

IT-NETWORKZ INFOSYSTEMS PVT.LTD 28 | P a g e


3. Scatter Plot: A scatter plot shows the relationship between two
continuous variables as individual points. It helps identify patterns,
correlations, or clusters.

Example: Figure 4.3 Scatter Plot

4. Histogram: A histogram groups numeric data into bins and counts


how many values fall into each bin. It’s used to understand the
distribution of a single continuous variable.

Example: Figure 4.4 : Histogram

IT-NETWORKZ INFOSYSTEMS PVT.LTD 29 | P a g e


Advance Visualization With Seaborn:
Seaborn:

Seaborn is a high-level data visualization library built on top of Matplotlib,


designed specifically for statistical graphics. It simplifies the creation of complex
plots and enhances them with attractive default styles and informative visual elements.

Visualization With Seaborn:

1. Heatmap: Visualizes correlations or frequency in a grid layout using color


intensities.

Example: Figure 4.5 Heatmap- Subject Correlation

2. Box Plot: Displays distribution, median, and outliers per category.

Example: Figure 4.6 Boxplot- Score Distribution by Subject

IT-NETWORKZ INFOSYSTEMS PVT.LTD 30 | P a g e


3. Violin Plot: Combines box plot with a density curve to show distribution shape.

Example: Figure 4.7 Boxplot- Score Distribution by Subject

4. Pair Plot: Plots pairwise relationships between multiple variables.

Example: Figure 4.8 Pair Plot – Iris Dataset

IT-NETWORKZ INFOSYSTEMS PVT.LTD 31 | P a g e


Customization Of Graphs:

Customizing graphs helps improve readability and presentation quality. Two important
aspects are adding legends and adjusting colors, both supported in Matplotlib and Seaborn.

• Methods Used to Customize The Graphs

1. Legends: Legends help identify what each line, shape, or color in your
plot represents. You can assign a label to each element and then display
the legend using plt.legend().

2. Colors: You can change colors using named color strings, hex codes, or
predefined palettes

3. Title: Use plt.title() to give your plot a descriptive heading.

4. Axis Labels: Use plt.xlabel() and plt.ylabel() to label the X and Y axes.

Plotting Real-world AI Data:

Plotting real-world AI data involves visualizing metrics, predictions, and patterns from
datasets used in machine learning models. This helps analysts, engineers, and stakeholders
interpret results, track performance, and make informed decisions.

Example: Figure 4.9 Pie Chart – Global GDP Share by Country (2025)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 32 | P a g e


Chapter 5

ROLE OF AI IN REAL LIFE AND FUTURE TRENDS

AI in Healthcare:

AI is transforming healthcare by turning vast amounts of patient data and medical imagery
into actionable insights. Deep learning algorithms now outperform humans in detecting
conditions such as diabetic retinopathy and certain cancers from imaging scans, enabling
earlier and more accurate diagnoses. In the operating room, robotic surgery platforms
guided by AI deliver sub-millimeter precision, reducing recovery times and complication
rates. Beyond diagnosis and surgery, AI-driven tools personalize treatment plans by
predicting individual responses to medications and flagging potential adverse reactions.

AI in Transportation:

The transportation sector is rapidly shifting from human-driven to AI-driven mobility. Self-
driving cars and trucks leverage sensor fusion—combining lidar, radar, and cameras—with
reinforcement learning to navigate complex road environments. Smart traffic control
systems analyze real-time vehicle and pedestrian flows to optimize signal timing, slash
congestion, and cut emissions. In logistics, AI plans routes that minimize fuel consumption
and delivery times, while predictive maintenance flags mechanical issues before
breakdowns occur.

AI in Education and Agriculture:

In education, AI creates adaptive learning platforms that tailor content, pacing, and
assessments to each student’s strengths and weaknesses, boosting engagement and
outcomes. Automated grading and AI-powered tutoring free instructors to focus on
mentorship and creativity. In agriculture, machine learning models analyze satellite
imagery, soil sensors, and weather forecasts to predict crop yields, detect nutrient
deficiencies, and optimize irrigation schedules. Drones equipped with computer vision
identify pests or diseases at an early stage, enabling targeted intervention and reducing
chemical usage.

• AI Applications by Industry
Industry Key Application
Disease detection, robotic surgery,
Healthcare
personalized medicine
Transportation Autonomous vehicles, smart traffic
management
Education Adaptive learning, automated
grading
Agriculture Crop yield prediction, precision
farming, pest detection
Table 5.1: AI Applications By Industry

IT-NETWORKZ INFOSYSTEMS PVT.LTD 33 | P a g e


Ethical Implications and Job Automation:

As AI systems gain autonomy, they raise critical ethical questions. Algorithmic bias—
stemming from unbalanced training data—can perpetuate discrimination in hiring, lending,
and law enforcement. Widespread surveillance capabilities threaten privacy, while AI-
driven automation puts entire job categories at risk, from customer service to factory work.
Balancing these challenges calls for transparent model audits, robust data-privacy
regulations, and large-scale workforce upskilling programs to prepare people for new roles
created by AI’s growth.

Future Trends in AI:

The next frontier of AI blends cutting-edge research with practical applications. Quantum
AI promises to accelerate complex optimizations and simulations by harnessing quantum
computing’s parallelism. Emotion AI, or affective computing, aims to recognize and
respond to human feelings through voice, facial expression, and physiological signals,
ushering in more empathetic interfaces. Hybrid models that combine neural networks with
symbolic reasoning will offer both the pattern-recognition power of deep learning and the
logical transparency of rule-based systems. As these trends converge, AI will become more
powerful, adaptable, and human-centric.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 34 | P a g e


Chapter 6
REFERENCE

1. https://www.cogniteq.com/blog/how-python-powers-artificial-intelligence-tools-
libraries-and-use-cases

2. https://datashark.academy/comparison-of-different-python-frameworks-for-artificial-
intelligence-development

3. https://www.pythoncentral.io/tensorflow-pytorch-or-scikit-learn-a-guide-to-python-
ai-frameworks

4. https://developers.google.com/machine-learning

5. https://learn.microsoft.com/en-us/azure/machine-learning/

6. https://realpython.com/tutorials/machine-learning/

IT-NETWORKZ INFOSYSTEMS PVT.LTD 35 | P a g e


Chapter 1
INTRODUCTION OF DATA SCIENCE
What is Data Science:
Data science is an interdisciplinary field that blends statistical analysis, computer
programming, and domain expertise to transform raw data into meaningful insights and
decision‐making tools. By gathering and cleaning diverse datasets—from customer
transactions to sensor logs—data scientists uncover hidden patterns through exploratory
analysis and predictive modelling. They leverage machine learning algorithms to build
models that forecast trends, recommend actions, or automate complex processes, then
deploy these models into real‐world applications where performance is continuously
monitored and refined. Ultimately, data science empowers organizations to understand
behaviour, optimize operations, and innovate products and services in an increasingly data‐
driven world.

Evolution and History of Data Science:


• Origins in Statistics and Early Computing:
Data science began as a marriage between statistics and emerging
computing power in the mid-20th century. Pioneers like Ronald Fisher and
John Tukey laid the statistical foundations, while scientists leveraged
mainframe computers for complex calculations. Early work focused on
experimental design, sampling theory, and numerical methods—crucial first
steps toward extracting insights from large datasets.

• Coining the Term “Data Science”:


Although data analysis existed for decades, the phrase “data science” was
popularized only in the early 2000s. In 1962, John Tukey’s paper “The
Future of Data Analysis” hinted at this new discipline, but it wasn’t until
William Cleveland’s 2001 keynote that data science was framed as a
standalone field blending statistics, informatics, and visualization.

• Rise of Big Data and Data Mining:


The internet boom of the 1990s and 2000s flooded organizations with
unprecedented volumes of information. Database technologies, data
warehouses, and mining algorithms (like decision trees and clustering)
emerged to handle petabytes of logs, customer records, and sensor feeds.
This era birthed scalable storage (Hadoop) and parallel processing
techniques that underpin today’s data lakes.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 36 | P a g e


• Modern Era: Machine Learning and AI:
From around 2010 onward, advances in machine learning and deep learning
transformed data science into a predictive powerhouse. Frameworks like
Scikit-learn, TensorFlow, and PyTorch democratized access to sophisticated
algorithms, enabling practitioners to build recommendation engines, natural
language models, and computer vision systems with minimal code.

• Professionalization and Future Directions:


Today, data science is a recognized profession defined by roles such as data
engineer, analyst, and scientist. Automated Machine Learning (AutoML),
augmented analytics, and ethical AI practices are shaping its next phase. As
we look ahead, the field will continue evolving around explainability,
fairness, and real-time decisioning—ensuring data science remains at the
heart of innovation.

Importance and Impact of Data Science:


• Importance of Data Science:
Data science equips organizations to turn raw data into clear, actionable
insights. By applying statistical analysis, machine learning, and domain
expertise, companies can forecast trends, optimize operations, and tailor
products to customer needs. This capability underpins smarter decision-
making, drives innovation, and creates measurable business value.

➢ Enables data‐driven strategy instead of intuition

➢ Automates routine tasks and processes for greater efficiency

➢ Uncovers hidden patterns and opportunities in complex datasets

➢ Supports real-time analytics for rapid response to market changes

Impact of Data Science:


Data science delivers profound effects across industries and society, reshaping how we live
and work. From improving patient outcomes in healthcare to enhancing supply-chain
resilience, its reach spans every sector. As adoption grows, so does its potential to solve
global challenges and spur economic growth.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 37 | P a g e


Real-World Applications Of Data Science:
1. Healthcare: Data science transforms patient care by mining electronic health
records to predict hospital readmissions and adverse events. Machine learning
models also analyze medical images—like MRIs and CT scans—to detect tumors
and other anomalies faster and with high accuracy.

2. Finance: In banking and investment, data-driven algorithms monitor transaction


streams in real time to flag fraudulent behavior and money-laundering schemes.
Quantitative trading strategies leverage historical market data and sentiment
analysis to optimize trade execution and portfolio allocation.

3. Retail: Retailers apply demand-forecasting models to anticipate sales spikes and


prevent stockouts or overstock situations. Recommendation engines use
collaborative and content-based filtering to personalize product suggestions,
boosting conversion rates and customer loyalty.

4. Agriculture: Satellite imagery combined with ground sensor data powers


predictive models for crop yield estimation and soil moisture assessment. Data
analytics also forecast pest infestations and disease outbreaks, enabling farmers to
deploy targeted interventions and reduce chemical use.

5. Cybersecurity: Security operations centers utilize anomaly detection techniques


and user-behavior analytics to spot suspicious access patterns before breaches
occur. Predictive threat intelligence models sift through vast threat feeds—malware
hashes, IP blacklists, phishing signatures—to prioritize defenses and automate
incident response.

Life Cycle of a Data Science Project

1. Business Understanding
In this phase, the team works closely with stakeholders to define the
primary goals, success criteria, and constraints of the project. A clear
problem statement is articulated, detailing the decisions the model will
support and the metrics by which its impact will be measured. Securing
alignment on objectives ensures that subsequent efforts remain focused
on delivering real business value

2. Data Acquisition
Once objectives are set, data scientists identify and gather the necessary datasets
from internal systems, third-party providers, APIs, and public repositories. This
stage involves establishing data pipelines, ensuring data quality standards, and

IT-NETWORKZ INFOSYSTEMS PVT.LTD 38 | P a g e


documenting the sources and collection methods. Proper governance and access
controls are put in place to safeguard sensitive information and maintain
compliance.

3. Data Preparation
Raw data often contains missing values, inconsistencies, and noise, so it must be
cleaned and transformed before analysis. Techniques such as imputation,
normalization, and encoding are applied to handle anomalies and convert data into
machine-readable formats. Feature engineering then creates new variables or
aggregates existing ones to capture domain-specific insights and bolster model
performance

4. Exploratory Data Analysis


Analysts dive into the prepared dataset to uncover patterns, correlations, and
outliers through statistical summaries and visualizations. Hypotheses are formulated
and tested, guiding the selection of promising features and revealing potential
pitfalls. This iterative exploration informs both the modeling strategy and any
additional data requirements.

5. Modeling
Data scientists select appropriate algorithms—ranging from linear regression and
decision trees to deep neural networks—to address the problem at hand. Models are
trained, validated, and tuned via techniques like cross-validation and grid search to
strike a balance between bias and variance. Performance metrics aligned with
business objectives (for example, accuracy, F1 score, or RMSE) drive iterative
refinement.

6. Evaluation
Before deployment, models undergo rigorous testing on holdout datasets to assess
generalization performance and robustness. Detailed error analysis highlights
weaknesses, such as model bias or sensitivity to certain data segments. Only once
the model consistently meets or exceeds predefined thresholds does it move to the
production stage.

7. Deployment
The validated model is integrated into production environments, whether as a
RESTful API, batch pipeline, or embedded component within an application.
Engineering teams handle containerization, scaling, and orchestration to ensure
reliability and low latency. Comprehensive documentation and version control
practices facilitate maintenance and future enhancements.

8. Monitoring and Maintenance


After going live, continuous monitoring tracks key metrics such as prediction
accuracy, data drift, and system performance. Alerts and dashboards enable rapid
detection of anomalies or degradation. Periodic retraining schedules and model
updates keep the solution aligned with evolving data distributions and business
requirements.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 39 | P a g e


Figure 1.1: Life Cycle of a Data Science Project

IT-NETWORKZ INFOSYSTEMS PVT.LTD 40 | P a g e


Chapter 2
Python Basics for Data Science

Overview of Python and Its Relevance in Data Science:

• Why Python for Data Science


Python’s simplicity accelerates experimentation and iteration in data-driven
projects. Interactive environments such as Jupyter Notebooks allow inline
code execution, visualization, and narrative text, which is ideal for
exploratory analysis. Its ability to interface with C, C++ and Java libraries
delivers the performance needed for compute-intensive tasks. A vibrant
community contributes to continuous improvement, peer support, and rapid
troubleshooting.

❖ Seamless integration with SQL databases, Hadoop, and cloud


services

❖ Built-in support for parallel and distributed computing frameworks

❖ Mature tooling for version control, testing, and deployment

Now Explain Setting up the Python Environment:

1. Install Python
Begin by installing either the official Python release from python.org or the
Anaconda distribution for a data-science–ready setup. Follow the installer prompts
and ensure Python is added to your system PATH. After installation, verify that the
interpreter is accessible.

❖ python –version

2. Create and Activate a Virtual Environment


Isolate project dependencies by creating a virtual environment. Navigate to your
project directory before running the creation command. Activation ensures that
subsequent package installs are scoped to this environment.
▪ python -m venv venv
▪ venv\Scripts\activate (Windows)
▪ source venv/bin/activate (macOS/Linux)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 41 | P a g e


3. Configure Your IDE
Use an IDE such as Visual Studio Code or PyCharm to enhance productivity. In VS
Code, install the Python extension and select your virtual environment’s interpreter
via the status bar. In PyCharm, add the same environment under Settings ▶ Project
▶ Python Interpreter. Enabling linters and formatters guards against errors and
enforces style.

4. Install JupyterLab
With your virtual environment active, install JupyterLab to gain access to both the
classic Notebook interface and the more modular Lab workspace.
▪ pip install jupyterlab

5. Launch and Verify JupyterLab


Start JupyterLab from your terminal. A browser window will open displaying your
workspace. Create a new notebook, confirm the kernel matches your active
environment, and test core imports.

Python Syntax , Variables and Data Types:


• Python Syntax
Python syntax is designed to be clean, readable, and intuitive. Unlike many other
languages, Python uses indentation (whitespace) to define blocks of code instead
of braces {} or keywords

▪ Key Syntax Rules:

❖ Indentation matters: Code blocks must be indented consistently


(usually 4 spaces).

❖ Statements end without semicolons: You don’t need ; at the end


of lines.

❖ Comments start with # and are ignored by the interpreter.

❖ Case-sensitive: Variable, variable, and VARIABLE are treated as


different identifiers

Example:
# This is a comment
if 5 > 3:
print("Five is greater than three")

IT-NETWORKZ INFOSYSTEMS PVT.LTD 42 | P a g e


• Variables in Python
Variables are used to store data. You don’t need to declare their type
explicitly—Python infers it automatically.

▪ Rules for Naming Variables:

❖ Must start with a letter or underscore (_)


❖ Can contain letters, digits, and underscores
❖ Cannot start with a digit
❖ Avoid using Python keywords (like for, if, class)

▪ Assigning Variables:

❖ x=10

❖ name=”Rahil’

❖ is_valid=True

❖ a,b,c=1,2,3

Python Data Types:

` Python has several built-in data types. Here are the most commonly used:

Data Type Example Description


Int x=5 Integer numbers
Float Pi=3.14 Decimal numbers
Str Name=”Rahil” Text or string
Boolean values: True or
Bool Is_ready=True
False
List Nums=[1,2,3] Ordered, mutable collection
Ordered, immutable
Tuple Coords=(4,5)
collection
Dict User={“Name”:”Rahil”} Key-value pairs
Unordered collection of
Set Uni={1,2,3}
unique elements
Table 2.1: Built-In Data-Types in Python

IT-NETWORKZ INFOSYSTEMS PVT.LTD 43 | P a g e


Control Flow Statements
1. Conditional Statements(if, elif,else): Conditional statements let you execute code
based on Boolean tests. The if block runs when its condition evaluates to True. You
can chain additional checks with one or more elif clauses, and use a final else to
catch all other cases. This structure makes multi-way branching clear and readable.

Example:
age = 20

if age >= 18:


print("Adult")
elif age > 13:
print("Teenager")
else:
print("Child")

2. Loops(for,while)

▪ for loops: A for loop iterates directly over the elements of any
iterable—such as lists, tuples, strings, or the output of range().
Python automatically assigns each item in turn to the loop variable
and executes the loop body once per element. This concise approach
is ideal for applying the same operation to every member of a
collection.

Example:
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
print(fruit.upper())

▪ while loop: A while loop repeats its block as long as a specified


condition remains True. Before each iteration, Python checks the
condition; if it’s False, execution jumps to the code following the
loop. To avoid infinite loops, update variables within the loop body
so the condition will eventually become False.

Example:
count = 0

while count < 5:


print("Count is", count)
count += 1

IT-NETWORKZ INFOSYSTEMS PVT.LTD 44 | P a g e


3. Loop Control Statements (break, continue, else)
Loop control statements adjust the normal flow of iteration:
▪ break
Immediately exits the innermost loop, skipping any remaining
iterations.

▪ continue
Skips the rest of the current iteration and jumps to the next cycle of
the loop.

▪ else on loops
Executes its block only if the loop ends normally (no break was
encountered).

Example
for num in range(5):
if num == 3:
break # Exit loop when num is 3
print(num)
else:
print("Loop completed without break")

Function, Lambda Expressions and Modules


1. Functions
Functions are reusable blocks of code that perform a specific task. They help
organize code, reduce repetition, and improve readability. You define a function
using the def keyword, followed by the function name, parameters in parentheses,
and a colon. The function body is indented, and you can optionally return a value
using the return statement.

Syntax:
def greet(name):
return f"Hello, {name}!"

Example:
message = greet("Rahil")
print(message) # Output: Hello, Rahil!

Function can Take:


▪ Positional arguments: passed in order
▪ Keyword arguments: passed by name
▪ Default values: used when no argument is provided
▪ Variable-length arguments: using *args and **kwargs

IT-NETWORKZ INFOSYSTEMS PVT.LTD 45 | P a g e


2. Lambda Expressions
Lambda expressions are anonymous functions defined using the lambda keyword.
They are typically used for short, throwaway functions—especially as arguments to
higher-order functions like map(), filter(), or sorted().

Syntax:
lambda arguments: expression

Example:
square = lambda x: x ** 2
print(square(5)) # Output: 25

3. Modules
Modules are Python files that contain definitions—functions, classes, variables—
that you can import and reuse in other programs. Python comes with a rich standard
library of modules (like math, random, datetime), and you can also create your
own.

Importing Modules:
Import <module name>

Example:
import math

Popular Python Libraries for Data Science


Python’s popularity in data science is largely due to its rich ecosystem of libraries that
simplify everything from data manipulation to machine learning and visualization. Below is
an overview of the most widely used libraries and their core functionalities.

1. NumPy: NumPy (Numerical Python) is the foundation for numerical computing in


Python. It provides support for multi-dimensional arrays and matrices, along with a
collection of mathematical functions to operate on them efficiently

▪ Create arrays
import numpy as np
arr = np.array([1, 2, 3])

▪ Perform Operations
arr.mean(), arr.sum(), np.dot(arr, arr)

2. Pandas: pandas is used for data manipulation and analysis. It introduces two
powerful data structures: Series (1D) and DataFrame (2D), which make handling
tabular data intuitive and efficient
▪ Load Dataset
import pandas as pd
df = pd.read_csv("data.csv")

IT-NETWORKZ INFOSYSTEMS PVT.LTD 46 | P a g e


▪ Analyse Data:
df.head(), df.describe(), df["column"].value_counts()

3. Matplotlib: Matplotlib is a versatile plotting library for creating static, animated,


and interactive visualizations. It’s often used for line plots, bar charts, scatter plots,
and histograms.

▪ Basic plot:

4. Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for


drawing attractive and informative statistical graphics. It simplifies complex
visualizations like heatmaps, violin plots, and pair plots.

▪ Example
import seaborn as sns
sns.histplot(data=df, x="age", kde=True)

5. Scikit-learn: scikit-learn is a comprehensive machine learning library that supports


classification, regression, clustering, dimensionality reduction, and model
evaluation.

▪ Train a Model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 47 | P a g e


Library Purpose Key Features
Arrays, matrix
NumPy Numerical computing operations, linear
algebra
DataFrames ,
Data manipulation and CSV/Excel I/O ,
Pandas
analysis filtering,
grouping
Line plots, bar charts,
Data visualization
matplotlib histograms,
(basic plots)
customization
Statistical data Heatmaps, violin plots,
Seaborn
visualization pair plots, aesthetics
Classification,
Sckit-learn Machine learning regression, clustering,
metrics
Table 2.2: Common Python Libraries

IT-NETWORKZ INFOSYSTEMS PVT.LTD 48 | P a g e


Chapter 3
STATISTICS AND PROBABILITY

Importance of Statistics in Data Science


Statistics is the backbone of data science. It provides the mathematical foundation for
understanding data, making predictions, and validating results. Whether you're
cleaning data, building models, or interpreting outcomes, statistical thinking ensures
that your conclusions are reliable and meaningful.

• Why Statistics Matters in Data Science

1. Understanding Data Distributions


Statistics helps describe how data is spread, centered, and shaped.
Measures like mean, median, mode, variance, and standard deviation
reveal patterns and anomalies.

Example: Use histograms and box plots to visualize distributions


and detect outliers.

2. Making Informed Decisions


Statistical inference allows data scientists to draw conclusions about
populations from samples. Techniques like hypothesis testing and
confidence intervals quantify uncertainty

Example: A/B testing uses statistical significance to compare two


versions of a product or feature.

3. Feature Selection and Engineering


Correlation, covariance, and statistical tests help identify which
variables are most relevant to a predictive model.

Example: Use Pearson correlation to find relationships between


features and target variables.

4. Model Evaluation
Metrics such as accuracy, precision, recall, F1-score, and ROC-AUC
are rooted in statistical concepts. They help assess how well a model
performs and whether it generalizes to new data.

Example: Confusion matrices summarize classification performance


using true/false positives and negatives.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 49 | P a g e


5. Handling Uncertainty and Noise
Real-world data is messy. Statistics provides tools to deal with
missing values, noisy observations, and sampling errors.

Example: Use imputation techniques and bootstrapping to improve


model robustness.

Types of Data: Nominal, Ordinal, Interval, Ratio:


1. Nominal Scale
Nominal data classify observations into distinct groups or categories
without any quantitative value or inherent order. You can count
frequencies or compute the mode, but arithmetic operations
(addition, subtraction) and ordering are meaningless.

Example: Assigning survey respondents to “Male” or “Female.”

2. Ordinal Scale
Ordinal data impose a meaningful order on categories, but the
intervals between ranks are not uniform or known. You can compare
which is higher or lower, yet you cannot quantify how much higher.

Example: Movie ratings from 1 (worst) to 5 (best).

3. Interval Scale
Interval data feature ordered values with consistent, meaningful
differences between them, yet lack a true zero point. You can add
and subtract values, compute averages and standard deviations, but
ratios (e.g., “twice as hot”) are not valid.

Example: Difference between 20 °C and 30 °C is the same as


between 70 °F and 80 °F.

4. Ratio Scale
Ratio data inherit all properties of interval scales and introduce an
absolute zero that signifies the total absence of the measured
attribute. You can perform all arithmetic operations—including
meaningful ratios.

Example: A weight of 0 kg means no mass; 10 kg is twice as heavy


as 5 kg.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 50 | P a g e


Measures of Central Tendency: Mean, Median, Mode:
Measures of central tendency are statistical metrics that summarize a dataset by identifying
its central point.

• Mean is the arithmetic average, sensitive to every value in the dataset.

• Median is the middle value when data are ordered, robust against outliers.

• Mode is the most frequently occurring value, useful for categorical or


discrete data.

Below is a sample dataset and step-by-step calculations for each measure.

Observation Value
1 4
2 8
3 6
4 5
5 3
6 5
7 7
Count(n) 7
Sum(Σ𝑥) 38
Mean(Σ𝑥/𝑛) 5.43
Sorted Values 3,4,5,5,6,7,8
Median 5
Mode 5
Table 3.1:Sample Dataset with Calculation

• Mean calculation: (4 + 8 + 6 + 5 + 3 + 5 + 7) / 7 = 38 / 7 ≈ 5.43

• Median determination: Order the values (3, 4, 5, 5, 6, 7, 8); the middle (4th) value
is 5.

• Mode identification: The value 5 appears twice, more often than any other.

Measures of Dispersion: Range, Variance, Standard Deviation:


Measures of dispersion describe how spread out the values in a dataset are around its
central tendency. The three most common metrics are:

• Range: The difference between the maximum and minimum values.


• Variance: The average of the squared deviations from the mean.
• Standard Deviation: The square root of the variance, expressing dispersion in the
same units as the original data.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 51 | P a g e


Data Distribution and Normal Curve:
A data distribution describes how values in a dataset are spread across the possible range.
When you plot the frequency of observations—often using a histogram—you reveal
patterns like clustering, skewness, or the presence of outliers. For example, exam scores
might cluster around a central band but taper off toward very high or very low values.
Recognizing the shape of your distribution helps you choose appropriate statistical methods
and anticipate how new data are likely to behave.

1. The Normal (Gaussian) Distribution


The normal distribution, or bell curve, is a continuous probability
distribution characterized by its symmetric, single-peaked shape. It’s
defined by two parameters: the mean (µ), which locates the peak,
and the standard deviation (σ), which controls the spread. Its
probability density function is:
1 (𝑥 − μ)2
𝑓(𝑥) = exp ! (− )
σ√2π 2σ2

Figure 3.1:Bell Curve – Nominal Distribution

Key Properties:
• Symmetry about the mean: mean = median = mode

• Infinitely long tails that approach—but never touch—the


horizontal axis

• The total area under the curve equals 1

IT-NETWORKZ INFOSYSTEMS PVT.LTD 52 | P a g e


2. The Empirical (68–95–99.7) Rule
In a perfectly normal distribution:
• About 68% of values lie within one standard deviation of the
mean (µ ± σ)

• Around 95% fall within two standard deviations (µ ± 2σ)

• Nearly 99.7% are within three standard deviations (µ ± 3σ)

Skewness and Kurtosis:


Skewness:

Skewness measures the asymmetry of a distribution around its mean. A perfectly


symmetric distribution (like a true normal curve) has zero skewness.

• Positive Skew (Right-Skewed): The right tail is longer or fatter than the left. Most
values cluster on the left, with a few extreme high values stretching the tail.

• Negative Skew (Left-Skewed): The left tail is longer or fatter. Most values cluster
on the right, with some extreme low values dragging the tail left.

Figure 3.2: Left, Right, and Normal Skew Examples

You can compute sample skewness using:


𝑛
1 𝑥𝑖 − 𝑥̅ 3
Skewness = ∑ ( )
𝑛 𝑠
𝑖=1

3
1
Example: (√𝑛 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 )

IT-NETWORKZ INFOSYSTEMS PVT.LTD 53 | P a g e


Kurtosis:

Kurtosis quantifies the “tailedness” or peakedness of a distribution compared to a normal


curve. It’s based on the fourth standardized moment.

• Leptokurtic (Positive Excess Kurtosis): Sharp peak and heavy tails. There’s a
higher probability of extreme values.
• Mesokurtic (Zero Excess Kurtosis): Similar peak and tails to a normal
distribution.
• Platykurtic (Negative Excess Kurtosis): Flatter peak and thinner tails. Extreme
values are less likely.

The Excess kurtosis formula:


𝒏
𝟏 ̅ 𝟒
𝒙𝒊 − 𝒙
Excess Kurtosis = ∑ ( ) −𝟑
𝒏 𝒔
𝒊=𝟏

Basics of Probability: Classical and Conditional

1. Classical probability
Classical probability applies when all outcomes in a
sample space are equally likely. If you can list every
possible outcome, you compute the probability of an event
by counting.

• Formula:
number of favorable outcomes for 𝐴
𝑃(𝐴) =
total number of outcomes

• Key points:

1. Sample space S is the set of all possible outcomes.

2. Event A is any subset of S.

• Example(rolling a fair due)

❖ 𝑆 = {1,2,3,4,5,6}
❖ 𝐿𝑒𝑡𝐴 = {even results} = {2,4,6}
❖ P(A) = 3/6 = 0.5

IT-NETWORKZ INFOSYSTEMS PVT.LTD 54 | P a g e


2. Conditional Probability
Conditional probability measures the likelihood of an event A given
that another event B has occurred. It refines your assessment by
restricting the sample space to B.

• Formula:
𝑃(𝐴 ∩ 𝐵)
𝑃( 𝐴 ∣ 𝐵 ) = , 𝑃(𝐵) > 0
𝑃(𝐵)
• Interpretation
❖ 𝑃(𝐴 ∩ 𝐵) is the probability both A and B occur
❖ P(B) is the probability B occurs.

• Example (drawing cards without replacement):

1. Draw one card from a standard 52-card deck. Let


B = “first card is a heart.”

❖ P(B) = 13/52 = ¼

2. Without putting it back, draw a second card. Let


A = “second card is a heart.”

❖ Now the deck has 51 cards; if the first was a


heart, there are 12 hearts left.

❖ 𝑃( 𝐴 ∣ 𝐵 ) = 12/51 ≈ 0.2353

Probability Distributions: Binomial, Poisson, Normal:


Probability distributions describe how likely different outcomes are in a random process.
Below is a brief overview of three fundamental distributions

1. Binomial Distribution
The binomial distribution models the number of successes in a fixed number of
independent trials, each with the same probability of success. It’s discrete and
defined by two parameters

▪ n: number of trials
▪ p: probability of success on each trial

Probability mass function(PMF):

𝑃(𝑋 = 𝑘) = (𝑛𝑘)𝑝𝑘 (1 − 𝑝)𝑛−𝑘

IT-NETWORKZ INFOSYSTEMS PVT.LTD 55 | P a g e


2. Poisson Distribution
The Poisson distribution models the count of events occurring in a fixed interval of
time or space, under the assumption that events happen independently and at a
constant average rate. It’s discrete with a single parameter:

▪ λ: average rate (mean number of events per interval)

Probability mass function(PMF):

𝑒 −𝜆 𝜆𝑘
𝑃(𝑋 = 𝑘) = 𝑘!

3. Normal (Gaussian) Distribution


The normal distribution is a continuous distribution characterized by its symmetric
“bell curve.” It’s defined by two parameters

▪ µ: mean (center of the distribution)


▪ σ: standard deviation (controls spread)

probability density function (PDF):


(𝑥−μ)2
1 −
𝑓(𝑥) = √2πσ2 𝑒 2σ2

Feature Binomial Poisson Normal


Type Discrete Discrete Continuous
Support k = 0,1,…,n k = 0,1,2,… x ∈ (−∞, ∞)
n (trials), p (success
Parameters λ (rate) µ (mean), σ (st. dev.)
prob.)
𝑛 𝑓(𝑥)
𝑃(𝑋 = 𝑘) = ( ) 𝑝𝑘 (1 𝑒 −𝜆 𝜆𝑘 (𝑥−𝜇)2
PMF/PDF 𝑘 𝑃(𝑋 = 𝑘) = 1 −
− 𝑝)𝑛−𝑘 𝑘! = 𝑒 2𝜎2
√2𝜋𝜎 2
Mean np λ μ
Variance n p(1−p) 𝜆 σ2
Success counts in Rare events per Measurements,
Typical Use Cases
trials interval errors, averages
Table 3.2: Difference Between Binomial, Poisson and Normal

IT-NETWORKZ INFOSYSTEMS PVT.LTD 56 | P a g e


Central Limit Theorem(CLT):
The Central Limit Theorem is a foundational result in probability theory and
statistics. It states that when you take sufficiently large samples from any
population with a finite mean and variance, the sampling distribution of the
sample mean (or sum) approaches a normal (Gaussian) distribution—regardless of
the population’s original shape. In practical terms, even if individual observations
are skewed or irregular, their average will be approximately normally distributed
once you aggregate enough of them

Key Conditions

▪ Samples must be independent of each other.

▪ Each observation comes from the same distribution (identically


distributed).

▪ The underlying distribution has a finite mean (µ) and finite variance (σ²).

▪ The sample size, n, should be “large enough.” A common rule of thumb is


n ≥ 30, though highly skewed populations may require larger n.

Figure 3.3: Sampling Distribution (Central Limit Theorem)

IT-NETWORKZ INFOSYSTEMS PVT.LTD 57 | P a g e


Outliers, Boxplots, and Z-Scores:
1. Outliers
Outliers are observations that lie an abnormal distance from other values in a
dataset. They can arise from measurement errors, data entry mistakes, or genuine
variability. Detecting outliers is crucial because they can skew summary statistics—
like the mean—and distort analyses or model training. Before deciding to remove or
adjust outliers, investigate their cause: if they reflect real phenomena, they may
carry important information rather than being mere noise.

Key Consideration for outliers:


• They can inflate variance and mislead regression or
clustering algorithms.
• Context matters: a salary of $1 000 000 is an outlier in most
employee datasets but valid in executive compensation
surveys.
• You can handle outliers by transformation (log, square root),
capping (winsorizing), or exclusion, but document your
approach for reproducibility.

2. Boxplots
A boxplot—also called a box-and-whisker plot—summarizes a distribution using its
five-number summary and highlights potential outliers.

Components of a boxplot:

• The box spans the interquartile range (IQR), from the first quartile (Q1, 25th
percentile) to the third quartile (Q3, 75th percentile).

• A line inside the box marks the median (50th percentile).

• Whiskers extend to the smallest and largest values within 1.5 × IQR below
Q1 and above Q3, respectively.

• Points beyond the whiskers are plotted individually as outliers.

3. Z-Score
A z-score standardizes each data point by measuring its distance from the mean in
units of standard deviation. For a value x, its z-score is calculated as:
𝑥−μ
𝑧=
σ

where μ is the sample mean and σ is the sample standard deviation. Z-scores allow
you to:

IT-NETWORKZ INFOSYSTEMS PVT.LTD 58 | P a g e


• Compare values from different distributions on a common scale

• Identify outliers: observations with |z| greater than about 2 or 3 are


unusually far from the mean.

• Facilitate statistical tests and probability calculations under the


normality assumption.

Figure 3.4: Boxplot with Outlier

IT-NETWORKZ INFOSYSTEMS PVT.LTD 59 | P a g e


Chapter 4
DATA WRANGLING USING PANDAS AND NUMPY

Introduction of Data Wrangling:


Data wrangling—also known as data munging—is the process of transforming raw, messy
datasets into clean, structured forms ready for analysis. It bridges the gap between data
collection and data modeling, ensuring that downstream tasks (visualization, machine
learning, reporting) yield reliable, interpretable results.

Why Data Wrangling Matters

Data in the real world often arrives with missing values, inconsistent formats, duplicates, or
errors. Without systematic wrangling, analyses can be biased or invalid. Thoughtful
wrangling:

• Improves data quality and integrity


• Eliminates anomalies that skew results
• Standardizes formats across multiple sources
• Saves time by automating repetitive cleaning steps

Core Steps in a Typical Wrangling Workflow

1. Data Ingestion
Collect data from files, databases, APIs, or web scraping

2. Profiling & Assessment


Explore distributions, data types, null rates, and unique values to diagnose issues.

3. Cleaning
• Handle missing values (imputation, removal)
• Correct or remove duplicates
• Standardize formats (dates, strings, numeric scales)

4. Transformation
• Create new features (e.g., extract year from date)
• Normalize or scale numerical data
• Encode categorical variables

5. Validation & Documentation


• Confirm that transformations preserve data integrity

IT-NETWORKZ INFOSYSTEMS PVT.LTD 60 | P a g e


• Log all changes for reproducibility

6. Export & Storage


Save the cleaned dataset to the desired format (CSV, Parquet, database) for
analysis.

Common Tools and Libraries:

1. Pandas: DataFrame operations, missing‐value handling, grouping, merging


2. NumPy: Fast numerical transformations and array operations
3. SQL: Filtering, joining, and aggregating structured data at scale

Handling Missing Data (Imputation Techniques):


Missing values can distort analyses, reduce statistical power, and bias machine-learning
models. Imputation replaces those gaps with estimated values to preserve dataset size and
integrity. The choice of technique depends on the nature of the data, the missingness
mechanism, and the intended analysis.

Below is Table 4.1 summarizing common imputation methods, their advantages,


drawbacks, and typical use cases.

Technique Description Advantage Disadvantage Use Case


Replace
Underestimates
missing entries Continuous
Mean Simple; variance;
with the data with low
Imputation preserves mean distorts
column’s missing rate
distribution
average value
Use the median Robust to
Doesn’t
of the non- outliers;
Median account for Skewed
missing values preserves
Imputation data distributions
instead of the central
relationships
mean tendency
Fill in missing
categories or Works for May create Nominal or
Mode
values with the categorical artificial ordinal
Imputation
most frequent data; easy clusters variables
entry
Substitute a
Can bias
constant (e.g., Flags imputed Categorical
models if
Constant Value 0, “Unknown”, records flags; binary
constant is
or –1) for all explicitly indicators
unrealistic
missing entries
Predict missing
values using a Utilizes Can overfit;
Regression regression correlations ignores When strong
Imputation model trained among uncertainty in predictors exist
on other variables estimates
features

IT-NETWORKZ INFOSYSTEMS PVT.LTD 61 | P a g e


Data Cleaning: Removing Duplicates, Invalid Entries:
Data cleaning is the first and most critical step in wrangling. It focuses on detecting and
correcting—or removing—errors and inconsistencies that would otherwise skew your
analysis or models.

• Removing Duplicates
Duplicates occur when the same record appears multiple times in your
dataset. They can inflate counts, bias analyses, and waste storage.

In pandas, you can identify and drop duplicates easily:

import pandas as pd

df = pd.read_csv("data.csv")

dupe_mask = df.duplicated()

df_clean = df.drop_duplicates(keep="first")

df_clean = df.drop_duplicates(subset=["id", "date"], keep="last")

• Handling Invalid Entries


Invalid entries are values that don’t conform to expected formats or
ranges—like negative ages, malformed dates, or stray text in numeric
columns.

Cleaning them involves detection, correction or removal:

# Example: ensure 'age' is between 0 and 120


valid_age = df_clean["age"].between(0, 120)
df_clean = df_clean[valid_age]

# Example: convert a column to numeric, coercing errors to NaN


df_clean["salary"] = pd.to_numeric(df_clean["salary"],
errors="coerce")

# Remove rows where critical fields are now NaN


df_clean = df_clean.dropna(subset=["salary", "join_date"])

IT-NETWORKZ INFOSYSTEMS PVT.LTD 62 | P a g e


Encoding Categorical Variables (Label, One-Hot):
Categorical variables represent qualitative attributes—like color, brand, or category—that
machine-learning algorithms cannot process directly. Two common techniques convert
these text labels into numeric formats: label encoding and one-hot encoding.

1. Label Encoding
Label encoding assigns each unique category an integer value. It’s simple and
produces a single column, but it introduces an arbitrary ordering that may mislead
some algorithms.

• Suitable for ordinal categories where order matters


• Uses fewer columns and keeps feature space small
• Can impose unintended rank relationships for nominal data

Example in Python:

from sklearn.preprocessing import LabelEncode

le = LabelEncoder()

df["color_label"] = le.fit_transform(df["color"])

2. One-Hot Encoding
One-hot encoding creates a new binary column for each category and marks the
presence (1) or absence (0) of that category in each row. This avoids ordering issues
but increases the number of features.

• Ideal for nominal categories without inherent order


• Prevents algorithms from interpreting numeric labels as ranks
• Can lead to high dimensionality if many unique categories

Example in Python:
df_onehot = pd.get_dummies(df, columns=["color"], prefix="color")

IT-NETWORKZ INFOSYSTEMS PVT.LTD 63 | P a g e


colour_label
Index Colour (Label color_Blue color_Green color_Red
Encoder
0 Red 2 0 0 1
1 Green 1 0 1 0
2 Blue 0 1 0 0
3 Green 1 0 1 0
4 Red 2 0 0 1
Table 4.2: Example of Encoding a “color” Column

IT-NETWORKZ INFOSYSTEMS PVT.LTD 64 | P a g e


Chapter 5
DATA VISUALIZATION AND DATA ANALYSIS

Introduction To Data Visualization:


Data visualization is the practice of converting raw data into graphical formats—charts,
plots, and maps—that make it easier to see patterns, trends, and anomalies. By leveraging
the human brain’s innate ability to process visual information, visualization turns numbers
and tables into clear, actionable insights.

Why Data Visualization Matters

• Enhances understanding: Visuals distill large datasets into digestible summaries,


helping viewers grasp the key message immediately.

• Reveals hidden patterns: Trends, clusters, and outliers that might be buried in raw
numbers become obvious when plotted.

• Improves communication: Well-designed graphics support storytelling, making it


easier to share findings with non-technical audiences.

• Drives decision making: Decision makers can compare scenarios, spot risks, and
validate hypotheses through interactive dashboards and reports.

Common Types of Visualization

• Bar charts for comparing categorical values


• Line graphs for tracking changes over time
• Scatter plots for exploring relationships between two variables
• Histograms for understanding data distributions
• Boxplots for summarizing spread and detecting outliers
• Heatmaps for visualizing matrix-style data and correlations
• Geographic maps for spatial data analysis

Plotting with Matplotlib:


Matplotlib is the foundational Python library for creating static, publication-quality plots.
Below are two common examples—a line chart to track trends over a sequence and a bar
chart to compare categorical values.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 65 | P a g e


1. Line Chart
A line chart connects data points in order, making it ideal for visualizing time series
or any ordered sequence.

Example: Figure 5.1 Line Chart – Monthly Sales Trend

2. Bar Chart
A bar chart displays categorical data with rectangular bars; bar heights correspond
to values, making comparisons straightforward.

Example: Figure 5.2: Bar Chart – Fruit Preference Survey

IT-NETWORKZ INFOSYSTEMS PVT.LTD 66 | P a g e


Advanced Visualizations with Seaborn:
Seaborn builds on Matplotlib to offer high‐level interfaces for statistical graphics. Below
are two examples—a heatmap for visualizing correlations and a pairplot for exploring
pairwise relationships across multiple variables.

1. Heatmap of a Correlation Matrix


A heatmap displays a matrix of values as colors, making it easy to spot strong
positive or negative correlations.

Example: Figure 5.3: Heatmap – Iris Feature Correlation

2. Pairplot for Pairwise Relationships


A pairplot (scatterplot matrix) shows each variable plotted against every other, with
optional histograms or density plots on the diagonal.

Example: Figure 5.4: Pairplot – Titanic Dataset

IT-NETWORKZ INFOSYSTEMS PVT.LTD 67 | P a g e


Exploratory Data Analysis (EDA) Process:
Exploratory Data Analysis is an iterative workflow that helps you understand the structure,
patterns, and quirks of your dataset before jumping into modeling. Below is a step-by-step
guide—with code snippets using the Titanic dataset—to illustrate each phase.

1. Define Objectives
Clarify what you want to learn or predict.
▪ Predict whether a passenger survives
▪ Identify key factors (age, fare, class) that influence survival

2. Data Ingestion
Load your dataset into a DataFrame.
Example:
import pandas as pd
titanic = pd.read_csv("titanic.csv")

3. Data Profiling
Get a high-level overview of your data’s shape, types, and missingness.
Example:
# Structure and types
print(titanic.info())
# Summary statistics for numeric features
print(titanic.describe())

# Missing values per column


print(titanic.isnull().sum())

IT-NETWORKZ INFOSYSTEMS PVT.LTD 68 | P a g e


4. Data Cleaning
Handle duplicates, invalid entries, and missing values.
Example:

# Remove duplicate rows


titanic = titanic.drop_duplicates()

# Example: Fill missing ages with median


titanic["age"] = titanic["age"].fillna(titanic["age"].median())

# Drop rows where ‘embarked’ is missing


titanic = titanic.dropna(subset=["embarked"])

5. Univariate Analysis
Study each variable in isolation to understand its distribution.
Example:
import seaborn as sns
import matplotlib.pyplot as plt

# Age distribution
sns.histplot(titanic["age"], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()

# Fare boxplot
sns.boxplot(x=titanic["fare"])
plt.title("Fare Spread and Outliers")
plt.show()

6. Bivariate Analysis
Examine pairwise relationships to spot trends or dependencies.
Example:
# Survival rate by Pclass
sns.countplot(x="pclass", hue="survived", data=titanic)
plt.title("Survival by Passenger Class")
plt.show()

# Age vs. Fare colored by survival


sns.scatterplot(x="age", y="fare", hue="survived", data=titanic, alpha=0.6)
plt.title("Age vs. Fare by Survival")
plt.show()

IT-NETWORKZ INFOSYSTEMS PVT.LTD 69 | P a g e


7. Multivariate Analysis
Explore interactions among three or more variables.
Example:
# Pairwise plots
sns.pairplot(
titanic,
vars=["age", "fare", "sibsp"],
hue="survived",
diag_kind="kde",
palette="Set1"
)
plt.suptitle("Pairwise Relationships", y=1.02)
plt.show()

# Correlation heatmap
corr = titanic[["age", "fare", "sibsp", "parch", "survived"]].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation")
plt.show()

8. Feature Engineering
Create new variables that may boost insight or predictive power
▪ Family size= sibsp + parch+1
▪ Title extraction from name (Mr, Mrs, etc.)
▪ Age bands (child/adult/senior)

9. Summarize Insights
Compile your key findings:
▪ Higher fare and first-class travel correlate with survival
▪ Very young and very old passengers have lower survival rates.
▪ Family size shows non-linear effects (very large families fare worse).

10. Communicate Findings


Present your results through concise reports or dashboards.
▪ Highlight the most impactful visuals
▪ Document any assumptions (e.g., median imputation).
▪ Outline next steps—feature selection, modeling, validation.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 70 | P a g e


Basic Statistical Analysis with Visuals:
Basic statistical analysis involves summarizing key properties of your data—its central
tendency, dispersion, and distribution shape—then using simple plots to reveal patterns or
anomalies. Below is an example workflow using the Titanic dataset, followed by a table of
summary statistics and code snippets to generate accompanying visuals.

1. Compute Summary Statistics


Summary statistics quantify a dataset’s central tendency, spread, and range. In
pandas, the describe() method quickly returns key metrics—count, mean,
standard deviation, min/max, and quartiles—for each numeric column.

Example:
import pandas as pd

# Load sample data


titanic = pd.read_csv("titanic.csv")

# Select numeric features


cols = ["age", "fare", "sibsp", "parch"]

# Compute summary statistics


summary = titanic[cols].describe().round(2)
print(summary)

Table 5.1: Summary Statistics Example


Feature Count Mean Std Min 25% 50% 75% Max
Age 714 29.70 14.52 0.17 22.00 28.00 35.00 80.00
Fare 891 32.30 49.69 0.00 7.91 14.45 31.00 512.33
Sibsp 891 0.52 1.10 0.00 0.00 0.00 1.00 8.00
parch 891 0.38 0.81 0.00 0.00 0.00 0.00 6.00

2. Visualize Distributions
Visualizing distributions helps you see skew, modality, and outliers at a glance.
Below are examples using Seaborn and Matplotlib.
Example:

IT-NETWORKZ INFOSYSTEMS PVT.LTD 71 | P a g e


Chapter 6
MACHINE LEARNING WITH SCKIT-LEARN AND ML PIPELINE
ALGORITHMS

Supervised vs Unsupervised Learning:


Machine learning tasks generally fall into two broad categories based on the presence
or absence of labeled data. Supervised learning uses input–output pairs to train models,
while unsupervised learning discovers hidden structure in unlabeled data.

Aspects Supervised Learning Unsupervised Learning


Learn a mapping from Discover patterns or
Definition inputs to known outputs groupings in data without
(labels) labels
Requires labeled training
Data Requirement Works on unlabeled data
data
Find hidden structure,
Predict or classify unseen
Primary Goal clusters, or dimensionality
examples
reductions
Linear/Logistic k-Means Clustering,
Common Algorithms Regression, Decision Hierarchical Clustering,
Trees, SVM, k-NN PCA, t-SNE
Continuous value Cluster assignments,
Typical Output (regression) or class label principal components,
(classification) anomaly scores
Accuracy, Silhouette Score, Within-
Evaluation Metrics Precision/Recall, RMSE, Cluster Sum of Squares,
F1-score Reconstruction Error
Spam detection, house- Customer segmentation,
Use Cases price prediction, image anomaly detection, data
classification visualization
Table 6.1: Difference between Supervised and Unsupervised Learning

Common ML Algorithms: Linear Regression, Decision Trees:

Machine learning offers a toolbox of algorithms for different tasks and data
types. Two of the foundational methods—linear regression and decision
trees—illustrate contrasting approaches: one fits a global linear model, the
other builds a piecewise, rule-based structure.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 72 | P a g e


• Linear Regression
Linear regression models the relationship between one or more input
features and a continuous target by fitting a straight line (or
hyperplane).

Example:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Example: Predict house prices based on square footage


X = df[["sqft_living"]]
y = df["price"]

X_train, X_test, y_train, y_test = train_test_split(X, y,


random_state=42)
model = LinearRegression().fit(X_train, y_train)

# Coefficient (slope) and intercept


print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)

# Make predictions
y_pred = model.predict(X_test)

▪ Key Points
❖ Learns coefficients β to minimize the sum of squared errors:
2
min _𝛽∑_𝑖 (𝑦𝑖 − 𝛽0 − 𝛽1𝑥{𝑖1−⋯ )

❖ Assumes a linear relationship, no multicollinearity, homoscedasticity


(constant variance), and normally distributed errors.

❖ Highly interpretable—each coefficient shows the change in target


per unit change in a feature.

❖ Fast to train and predict, but struggles when relationships are


nonlinear or when key assumptions fail.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 73 | P a g e


• Decision Trees
Decision trees recursively split the feature space into homogeneous regions by
asking a series of yes/no questions.

Example:
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

# Example: Predict house prices with multiple features


features = ["sqft_living", "bedrooms", "bathrooms"]
X = df[features]
y = df["price"]

model = DecisionTreeRegressor(max_depth=4, random_state=42).fit(X, y)

# Visualize the tree


plt.figure(figsize=(12, 6))
plot_tree(model, feature_names=features, filled=True, rounded=True)
plt.show()

▪ Key Points:
❖ Splits nodes by selecting features and thresholds that minimize
impurity (variance for regression; Gini or entropy for classification).

❖ Makes no assumptions about data distribution or feature


relationships.

❖ Captures complex, nonlinear interactions and can handle mixed data


types.

❖ Prone to overfitting—depth and minimum-sample parameters must


be tuned.

❖ Easy to interpret: each path from root to leaf describes a decision


rule.

Model Evaluation Techniques:


Model evaluation ensures your predictions align with reality. In classification tasks, the
confusion matrix is a foundational tool: it breaks down true vs. predicted labels and drives
metrics like accuracy, precision, recall, and F1-score.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 74 | P a g e


• Confusion Matrix
A confusion matrix for a binary classifier has four cells:

Predicted Negative Predicted Positive


Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)

▪ True Negative (TN): Correctly predicted negative

▪ False Positive (FP): Incorrectly predicted positive (Type I error)

▪ False Negative (FN): Incorrectly predicted negative (Type II error)

▪ True Positive (TP): Correctly predicted positive

Example:

Figure 6.1: Confusion Matrix – Breast Cancer Classification

IT-NETWORKZ INFOSYSTEMS PVT.LTD 75 | P a g e


Train-Test Split and Cross Validation:
1. Train-Test Split: Train–test splitting divides your dataset into two disjoint
subsets—one to train the model and one to test its performance on unseen
data. This simple hold-out approach helps estimate how well your model will
generalize.

Why it Matters

▪ Prevents overfitting: ensures your model isn’t evaluated on the same data it learned.
▪ Provides a quick sanity check of performance before more robust validation.

Common Practice

▪ Typical split ratios: 70/30, 80/20, or 90/10 (train/test).


▪ For classification, use stratified splits to preserve class proportions:

Example:

from sklearn.model_selection import train_test_split

# X = feature matrix, y = target vector


X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% data for testing
random_state=42, # reproducibility
stratify=y # preserve label distribution
)

2. Cross-Validation: Cross-validation (CV) mitigates the variance of a single train–


test split by repeatedly partitioning the data and averaging performance across folds

K-Fold Cross-Validation

1. Split data into K equal-sized folds

2. For each fold i:


▪ Train on all folds except i.
▪ Test on fold i

3. Aggregate metrics (mean ± std) across the K runs.

IT-NETWORKZ INFOSYSTEMS PVT.LTD 76 | P a g e


Example:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

kf = KFold(n_splits=5, shuffle=True, random_state=42)


model = RandomForestClassifier(random_state=42)

scores = cross_val_score(
model, X, y,
cv=kf,
scoring="accuracy"
)
print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

• Variations of CV
▪ tratified K-Fold: preserves label distribution in each
fold (useful for classification).

▪ Leave-One-Out (LOO): K = N (each sample


becomes its own test set).

▪ Time-Series Split: respects temporal order (no


shuffling), splitting along time.

Creating an ML Pipeline with Scikit-Learn:

An ML pipeline strings together all preprocessing and modeling steps into a


single, reproducible workflow. Below is both a code example and a block-
diagram illustrating the flow.
Example: Figure 6.2: ML Pipeline Diagram – Training and Inference Flow

IT-NETWORKZ INFOSYSTEMS PVT.LTD 77 | P a g e


Chapter 7
REFERENCE

1. https://www.coursera.org/specializations/jhu-data-science-pro

2. https://www.edx.org/micromasters/mitx-statistics-and-data-science

3. https://www.udemy.com/course/python-for-data-science-and-machine-
learning-bootcamp/

4. https://www.datacamp.com/tracks/data-scientist-with-python

5. https://www.kaggle.com/learn/overview

6. https://nptel.ac.in/courses/106/106/106106178/

IT-NETWORKZ INFOSYSTEMS PVT.LTD 78 | P a g e

You might also like