DATA SCIENCE NOTES
UNIT 1
What is Data?
Data is raw facts, figures and information collected for analysis and
decision making. It can be used across various fields and industries to
optimize processes and uncover valuable insights.
Examples:
• A list of students with their grades
• Sales numbers of a business
• Weather records
What is Data Science?
Definition: Data Science is the field that uses mathematics,
statistics, programming, and machine learning to analyze data,
extract insights, and make predictions.
Key Components of Data Science:
• Data Collection & Cleaning – Removing errors & organizing
data.
• Data Analysis – Finding trends & patterns.
• Machine Learning & AI – Making predictions.
• Data Visualization – Representing data through
graphs/charts.
Importance of Data & Data Science
Better Decision-Making: Helps companies and organizations
make informed decisions.
Predicting Trends: Forecasts future outcomes (e.g., stock market,
disease spread).
Automation & AI: Powers self-driving cars, chatbots, and
personalized recommendations.
Improves Business Growth: Optimizes sales, marketing, and
customer service.
Real-World Applications of Data Science
Field Application
Healthcare Disease prediction, personalized medicine (e.g.,
AI diagnosing diseases).
E-Commerce Product recommendations (e.g., Amazon,
Flipkart).
Finance Fraud detection, stock market predictions.
Social Media Instagram, YouTube suggest videos based on
user behavior.
Self-Driving Cars AI analyzes road data to navigate safely.
Cybersecurity Detecting online fraud and hacking threats.
Types of Data:
Data is classified into two major categories based on organization and
measurement scale
Types of Data Based on Organization
Structured Data
Structured data is well-organized and stored in a fixed format, like
tables in relational databases. It follows a predefined schema, making
it easy to search and analyze.
Example: Employee records, banking transactions, student
databases.
Unstructured Data
Unstructured data does not have a fixed format and cannot be stored
easily in traditional databases. It includes text, images, videos, and
audio files.
Example: Social media posts, emails, YouTube videos, WhatsApp
chats.
Semi-Structured Data
Semi-structured data has some structure but does not follow a rigid
format like structured data. It includes tags, metadata, or markers for
organization.
Example: JSON files, XML data, IoT sensor data, email messages
(structured headers + unstructured body).
Types of Data Based on Measurement Scale:
Qualitative (Categorical) Data – Descriptive & Non-Numeric
Nominal Data
Nominal data consists of categories that have no specific order or
ranking. It is qualitative (non-numeric) and used for labeling or
classification.
Example: Gender (Male/Female), Blood Type (A, B, O), Eye color
(Blue, Brown, Green).
Ordinal Data
Ordinal data represents categories with a meaningful order, but the
difference between the values is not measurable. It is also
qualitative.
Example: Customer satisfaction levels (Poor, Average, Good),
Clothing sizes (S, M, L, XL), Education levels (High School, Bachelor’s,
Master’s).
Quantitative (Numerical) Data – Measurable & Numeric
Discrete Data
Discrete data is numerical and consists of countable values. It cannot
take decimal or fractional values.
Example: Number of students in a classroom, number of mobile
phones sold, number of goals in a football match.
Continuous Data
Continuous data is numerical and can take any value within a given
range, including decimals. It is measurable rather than countable.
Example: Height of a person (in cm), temperature of a city (in °C),
speed of a moving car (in km/h).
Types of Data Collection:
Data collection is the process of gathering information for analysis
and decision-making. It is broadly classified into Primary Data
Collection and Secondary Data Collection.
1. Primary Data Collection
Definition: Primary data is collected first-hand directly from the
source for a specific purpose. It is original and has not been used
before.
Advantages:
• More accurate and relevant to the research.
• Collected based on specific requirements.
Disadvantages:
• Time-consuming and expensive.
• Requires effort and resources.
Methods of Primary Data Collection:
• Surveys & Questionnaires – Asking people specific
questions (e.g., customer feedback).
• Interviews – One-on-one conversations to get detailed
information.
• Observations – Watching and recording behaviors in real-
time (e.g., studying customer behavior in a mall).
• Experiments – Conducting tests under controlled conditions
(e.g., clinical trials for a new medicine).
Example: A company conducts a survey to understand customer
preferences for a new product.
2. Secondary Data Collection
Definition: Secondary data is already collected by someone else
for a different purpose but used for new research. It is pre-existing
data from various sources.
Advantages:
• Saves time and cost.
• Easily available from books, reports, and online sources.
Disadvantages:
• May not be completely relevant or up-to-date.
• Can be biased or inaccurate.
Sources of Secondary Data:
• Government Reports & Census Data – Population statistics,
economic reports.
• Research Papers & Books – Information from universities
and libraries.
• Websites & Online Databases – Market trends, Wikipedia,
business reports.
• Newspapers & Magazines – Articles, business news,
historical records.
Example: A student uses a government report on population
statistics for their research on urban development.
Data Cleaning:
Definition:
Data cleaning is the process of identifying and correcting errors in a
dataset to improve its accuracy and quality. It involves handling
missing values, outliers, noise, and inconsistencies to ensure better
analysis and decision-making.It comes before the validation stage in
data pipeline.
Importance:
Unclean data can lead to incorrect analysis, misleading insights, and
flawed decision-making. Cleaning data eliminates errors,
inconsistencies, and missing values, ensuring a high-quality dataset.
Key Steps in Data Cleaning:
1️. Removing Duplicates – Eliminates repeated records to prevent
skewed analysis.
2️. Removing Irrelevant Data – Filters out unnecessary information
that doesn't contribute to insights.
3️. Standardizing Capitalization – Ensures uniform text formatting
(e.g., "new york" → "New York") to avoid inconsistencies.
4️. Data Type Conversion – Converts incorrect formats (e.g., text to
date, integer to float) to ensure proper calculations.
5️. Handling Outliers – Detects and corrects extreme values using Z-
score, IQR, or capping methods to prevent distortions.
6️. Fixing Errors – Corrects typos, spelling mistakes, and numerical
inaccuracies that affect data integrity.
7️. Language Translation – Converts multilingual data into a common
language for uniformity in global datasets.
8️. Handling Missing Values – Fills gaps using mean, median, mode,
or predictive models to maintain completeness.
Data Extraction:
Data extraction is the process of gathering data from various
sources, transforming it into a usable format, and storing it for
analysis. It acts like a filter, ensuring only relevant data is collected
while removing unnecessary information.
Why is Data Extraction Important?
1️ Facilitates Decision-Making – Helps organizations analyze past
trends, current patterns, and future predictions to make informed
choices.
2️ Empowers Business Intelligence – Provides timely, relevant data
for better insights and strategic planning.
3️ Enables Data Integration – Combines data from different sources
into a unified format for a comprehensive view.
4️ Enhances Automation & Efficiency – Reduces manual effort
through automated extraction, ensuring consistency and speed.
Steps in the Data Extraction Process:
1. Filtering – Identifying and selecting only relevant data based
on specific criteria.
2. Parsing – Breaking down and analyzing data to understand its
structure, making it easier to work with.
3. Structuring – Organizing and formatting raw data to prepare it
for analysis.
Data Transformation:
Data transformation is the process of converting, cleaning, and
structuring raw data into a format suitable for analysis. It ensures
data is consistent, accurate, and ready for use in business
intelligence, analytics, and machine learning. This process involves
changing data formats, standardizing values, and integrating multiple
data sources.
Why is Data Transformation Important?
1️⃣ Improves Data Consistency – Converts data into a uniform format,
making it easier to process and analyze.
2️⃣ Enhances Data Quality – Fixes errors, removes duplicates, and fills
missing values, ensuring reliable results.
3️⃣ Facilitates Data Integration – Merges data from different sources
into a single structured dataset for better insights.
4️⃣ Optimizes Performance – Helps in organizing large datasets
efficiently, improving storage, retrieval, and processing speed.
5️⃣ Supports Decision-Making – Transformed data provides better
insights for business strategies and predictive analysis.
Steps in Data Transformation Process:
1. Data Cleaning – Identifies and corrects errors, removes
duplicates, and fills missing values to maintain accuracy.
2. Data Formatting – Converts data types (e.g., text to date,
integer to float) to maintain uniformity.
3. Data Normalization – Standardizes data values to maintain
consistency (e.g., converting all date formats to YYYY-MM-DD).
4. Data Aggregation – Summarizes large data sets into
meaningful insights (e.g., calculating monthly sales from daily sales
data).
5. Data Integration – Combines data from multiple sources,
ensuring a unified and complete dataset.
6. Data Enrichment – Enhances existing data by adding additional
relevant information for deeper insights.
Data Visualization:
Data visualization is the process of representing data graphically
using charts, graphs, and plots. It helps in understanding patterns,
trends, and relationships in data, making complex information easier
to interpret.
Why is Data Visualization Important?
1️. Simplifies Complex Data – Converts raw data into easy-to-
understand visual formats.
2️. Identifies Trends & Patterns – Helps in spotting insights that might
be missed in raw data.
3️. Enhances Decision-Making – Provides a clear visual representation
for quick and informed decisions.
4️. Improves Communication – Makes it easier to present data
findings to stakeholders.
Types of Data Visualization:
1. Bar Charts – Used to compare categories with rectangular
bars. Example: Comparing sales of different products.
2. Line Charts – Shows trends over time using connected data
points. Example: Stock price movements.
3. Histograms – Displays frequency distributions for numerical
data by grouping values into bins. Example: Exam score distributions.
🌡 4. Heatmaps – Uses colors to represent values, commonly used
for correlation matrices. Example: Website click heatmaps.
5. Boxplots (Whisker Plots) – Summarizes data distribution,
highlighting outliers. Example: Visualizing salary distributions in a
company.
6. Pie Charts – Represents data as proportional slices of a circle.
Example: Market share of different brands.
Applications of Data Visualization:
1️⃣ Exploratory Data Analysis (EDA) – Helps in understanding dataset
structure, identifying missing values, and spotting patterns before
applying machine learning models.
2️⃣ Trend Analysis & Forecasting – Line charts and time-series plots
help predict future trends in finance, sales, and weather forecasting.
3️⃣ Anomaly Detection – Boxplots and scatter plots reveal outliers in
datasets, useful for fraud detection and cybersecurity.
4️⃣ Business Intelligence (BI) – Dashboards with bar charts, pie charts,
and heatmaps provide real-time insights for decision-making in
industries like healthcare, marketing, and finance.
5️⃣ Machine Learning Model Evaluation – Visualizations like ROC
curves, confusion matrices, and feature importance plots help assess
model performance.
6️⃣ Geospatial Data Analysis – Maps and heatmaps are used in
logistics, urban planning, and climate studies to analyze geographical
trends.
7️⃣ Social Media & Sentiment Analysis – Word clouds and network
graphs help analyze customer sentiment and engagement trends.
Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) is the process of analyzing and
summarizing datasets to discover patterns, detect anomalies, and
check assumptions before applying machine learning models. It helps
in understanding data distributions, relationships, and hidden
insights.
Steps in Exploratory Data Analysis (EDA):
1️⃣ Understanding the Dataset – Loading the dataset and checking its
structure (rows, columns, data types).
• Example: Using head() in Python to view the first few rows of
data.
2️⃣ Handling Missing Values – Identifying and dealing with missing or
null values.
• Methods: Removing rows, filling with mean/median/mode, or
using predictive models.
3️⃣ Identifying Duplicates & Inconsistencies – Removing duplicate
records and correcting data entry errors.
4️⃣ Descriptive Statistics & Summary – Calculating mean, median,
standard deviation, and percentiles to understand data distribution.
5️⃣ Detecting Outliers – Identifying extreme values using Boxplots, Z-
score, or IQR method to ensure they don’t distort analysis.
6️⃣ Data Visualization – Using plots like histograms, scatter plots, bar
charts, heatmaps to understand relationships and trends.
7️⃣ Feature Engineering & Transformation – Creating new features,
normalizing/scaling data, and encoding categorical variables for
better analysis.
8️⃣ Correlation Analysis – Checking relationships between variables
using correlation matrices to remove redundant features.
Types of EDA
1️⃣ Univariate Analysis – Examines a single variable to understand its
distribution and characteristics.
• Methods: Histograms, Boxplots, Pie Charts, Mean, Median,
Mode.
• Example: Analyzing the distribution of students' marks in a
subject.
2️⃣ Bivariate Analysis – Analyzes the relationship between two
variables.
• Methods: Scatter plots, Correlation coefficient, Line charts.
• Example: Studying the relationship between study time and
exam scores.
3️⃣ Multivariate Analysis – Examines relationships between multiple
variables simultaneously.
• Methods: Heatmaps, Pair plots, Principal Component Analysis
(PCA).
• Example: Understanding how multiple factors (age, income,
education) impact loan approval rates.
4️⃣ Graphical EDA – Uses visual representations to identify patterns
and trends in data.
• Methods: Bar charts, Line charts, Boxplots, Heatmaps.
• Example: Visualizing sales trends over time using a line chart.
5️⃣ Descriptive Statistics – Summarizes data using numerical measures.
• Methods: Measures of central tendency (Mean, Median,
Mode), Measures of dispersion (Variance, Standard Deviation,
Range).
• Example: Summarizing a dataset of customer ages using mean
and standard deviation.
6️⃣ Dimensionality Reduction – Reduces the number of variables while
preserving important information.
• Methods: Principal Component Analysis (PCA), t-SNE, Feature
Selection.
• Example: Reducing a dataset with 5️0 features to 1️0 important
features for faster computation.
Importance of EDA:
1️⃣ Understanding Data Distribution – Helps in identifying data trends,
skewness, and normality.
• Example: Checking if a dataset follows a normal distribution
before applying machine learning algorithms like Linear
Regression.
2️⃣ Handling Missing Values – Ensures that missing values are either
filled or removed to avoid biased results.
• Example: Using mean/median imputation to fill missing salary
values in an employee dataset.
3️⃣ Detecting and Handling Outliers – Identifies extreme values that
can negatively impact model performance.
• Example: Removing extreme house prices in a real estate
dataset before training a price prediction model.
4️⃣ Feature Selection & Engineering – Helps in selecting relevant
features and creating new meaningful features.
• Example: Converting date of birth into an "age" column to
make it more useful for prediction.
5️⃣ Identifying Relationships Between Variables – Helps in
understanding correlations and dependencies among variables.
• Example: Using a heatmap to see the correlation between
study hours and exam scores.
6️⃣ Reducing Dimensionality – Simplifies the dataset by removing
redundant or less important features.
• Example: Using PCA (Principal Component Analysis) to reduce
1️00 features to 1️0 important ones for faster training.
7️⃣ Selecting the Right Machine Learning Model – Provides insights
into which models might work best for the data.
• Example: If data is linearly distributed, Linear Regression is a
good choice; for complex patterns, Decision Trees or Neural
Networks might be better.
Integrated Development Environment(IDE):
An Integrated Development Environment (IDE) is a software
application that provides a comprehensive workspace for
developers to write, test, debug, and run code efficiently. It combines
essential tools like a code editor, compiler, debugger, and
automation tools in one platform to streamline the software
development process.
Key Features of an IDE:
Code Editor – Provides syntax highlighting, auto-completion, and
indentation.
Compiler/Interpreter – Translates code into machine-executable
form.
Debugger – Helps detect and fix errors in the code.
Build Automation Tools – Automates repetitive tasks like
compiling and testing.
Version Control Integration – Works with Git/GitHub for
collaborative coding.
Examples of IDEs:
1️⃣ PyCharm – Best for Python development, widely used in data
science and AI.
2️⃣ Visual Studio Code (VS Code) – Lightweight, supports multiple
languages, and has powerful extensions.
3️⃣ Eclipse – Popular for Java development, supports plugins for
multiple programming languages.
4️⃣ IntelliJ IDEA – Best for Java and Kotlin development, widely used in
Android development.
5️⃣ NetBeans – Supports Java, PHP, C++, and more, great for enterprise
applications.
6️⃣ Jupyter Notebook – Used in Data Science and Machine Learning for
interactive Python coding.
7️ RStudio – Used in Data Science and Statistics with the help of R
language
Role of IDE in Development:
1️⃣ Increases Developer Productivity – Provides features like syntax
highlighting, auto-completion, and code suggestions, making coding
faster and easier.
2️⃣ Efficient Code Debugging – Includes built-in debuggers that help
detect and fix errors, reducing time spent on troubleshooting.
3️⃣ Compilation & Execution – IDEs integrate compilers and
interpreters that allow developers to quickly run and test their code
within the same environment.
4️⃣ Project Management – Helps manage multiple files, libraries, and
dependencies efficiently, making it easier to work on large projects.
5️⃣ Version Control Integration – Many IDEs support Git/GitHub,
enabling developers to track changes and collaborate on code.
6️⃣ Cross-Platform Development – IDEs like Eclipse, Visual Studio, and
IntelliJ IDEA allow developers to write code for different platforms
(Windows, macOS, Linux) using a single environment.
7️⃣ Automation & Code Refactoring – Helps automate repetitive tasks
like code formatting, compiling, and testing, improving overall
efficiency.
High-Level Programming Languages:
A High-Level Programming Language (HLL) is a programming
language that is designed to be easier for humans to read, write,
and understand compared to low-level languages (like Assembly or
Machine code). These languages are closer to human language and
abstract away the complexities of hardware interactions.
Key Features of High-Level Languages:
Easy to Read & Write – Uses English-like syntax (e.g.,
print("Hello, World!")).
Portable – Can run on different machines without modification.
Memory Management – Handles memory allocation
automatically.
Abstraction from Hardware – Developers don't need to worry
about system architecture.
Examples of High-Level Programming Languages:
1️⃣ Python – Popular for web development, data science, AI, and
automation.
2️⃣ Java – Used for enterprise applications, Android development, and
backend systems.
3️⃣ C++ – Used in game development, system programming, and high-
performance applications.
4️⃣ JavaScript – Essential for web development, both frontend (React,
Angular) and backend (Node.js).
5️⃣ Ruby – Known for simplicity, used in web development (Ruby on
Rails).
6️⃣ Swift – Developed by Apple for iOS and macOS application
development.
7️⃣ PHP – Widely used for server-side web development (WordPress,
Laravel).
Role of HLL in Software Development:
1️⃣ Simplifies Coding & Improves Productivity – HLLs use English-like
syntax, making it easier for developers to write and understand code
quickly. Examples: Python (print("Hello, World!")) vs. Assembly
(complex opcode instructions).
2️⃣ Platform Independence – Most HLLs (e.g., Java, Python) are
portable, meaning they can run on different operating systems
without modification. Java’s "Write Once, Run Anywhere" principle
is a great example.
3️⃣ Easier Debugging & Maintenance – HLLs have built-in error
handling features, making it easier to debug and maintain large-scale
applications.
4️⃣ Supports Rapid Application Development (RAD) – HLLs provide
frameworks and libraries that speed up development (e.g., Django
for Python, Spring Boot for Java).
5️⃣ Enhances Software Security – High-level languages handle memory
management and security features, reducing the risk of issues like
buffer overflows (common in low-level languages like C).
6️⃣ Encourages Code Reusability & Modularity – HLLs support Object-
Oriented Programming (OOP), which promotes code reusability
through concepts like classes and inheritance (e.g., Java, C++,
Python).
7️⃣ Widely Used in AI, Web, and Enterprise Applications – HLLs are
crucial in modern fields like Artificial Intelligence (AI), Machine
Learning, Web Development, and Enterprise Software.
UNIT 2
SQL vs NoSQL:
Feature SQL (Relational NoSQL (Non-Relational
Databases) Databases)
Full Form Structured Query Not Only SQL
Language
Data Structured, organized Flexible, organized as Key-
Structure in tables (rows & Value, Document,
columns) Column, or Graph
Schema Fixed schema with Dynamic schema,
predefined structure allowing flexible data
formats
Scalability Vertically scalable Horizontally scalable
(adding more (adding more servers for
CPU/RAM to a single load distribution)
server)
Data Storage Stores data in relations Stores data in JSON, XML,
Model (tables) key-value pairs, or graphs
Query Uses SQL (Structured Uses NoSQL query
Language Query Language) methods (MongoDB
Query Language, CQL for
Cassandra, etc.)
Transactions Follows ACID Follows BASE (Basically
(Atomicity, Available, Soft state,
Consistency, Isolation, Eventually consistent) for
Durability) ensuring high availability
strict data integrity
Joins & Supports JOINs for Does not support JOINs
Relationships complex relationships but uses denormalization
and embedded
documents
Performance Efficient for structured Faster for large-scale
queries and complex unstructured or semi-
transactions structured data
Flexibility Less flexible due to Highly flexible for evolving
rigid schema data models
Use Cases Ideal for banking, ERP, Best for Big Data, real-
CRM, finance, and time applications, IoT,
inventory systems social media,
where data recommendation systems
consistency is critical
Examples MySQL, PostgreSQL, MongoDB, Cassandra,
Oracle, SQL Server Firebase, DynamoDB
Introduction to MongoDB:
What is MongoDB?
MongoDB is a popular NoSQL database that stores data in a
document-oriented format instead of relational tables. Unlike SQL
databases, which store data in rows and columns, MongoDB
organizes information using BSON (Binary JSON) documents, making
it flexible, scalable, and faster for large datasets.
Key Features of MongoDB:
Document-Oriented: Stores data in a flexible, JSON-like format.
Schema-Free: No need to define a rigid structure (unlike SQL).
Horizontally Scalable: Can handle large volumes of data
efficiently.
High Performance: Optimized for big data and real-time
applications.
Easy Integration: Works well with modern web applications and
cloud storage.
Basic MongoDB Commands for Database Management
MongoDB provides simple commands to create, view, and delete
databases using the MongoDB shell or a programming language like
Python or Node.js.
1⃣ Creating a Database in MongoDB
Unlike SQL, MongoDB does not use an explicit CREATE
DATABASE command. Instead, a database is automatically created
when a collection (table) is inserted with data.
To create and switch to a new database:
bash
CopyEdit
use myDatabase
If myDatabase does not exist, MongoDB will create it only after
data is added to a collection.
2⃣ Viewing Available Databases
To list all databases in MongoDB, use:
bash
CopyEdit
show dbs
Note: Only databases that contain data will be listed. Empty
databases do not appear until they store at least one collection.
3⃣ Deleting a Database
To delete an existing database, follow these steps:
bash
CopyEdit
use myDatabase # First, switch to the database you want to delete
db.dropDatabase() # Delete the selected database
This permanently deletes myDatabase and all its collections, so
use it carefully.
Real-World Applications of MongoDB
MongoDB is widely used in:
✔ Web Development: Storing user profiles, product catalogs (e.g., e-
commerce sites like Amazon, eBay).
✔ Big Data Applications: Processing large-scale data in real-time
(e.g., IoT, social media).
✔ Content Management Systems (CMS): Handling dynamic content
in blogs, news sites, etc.
✔ Gaming Applications: Storing real-time player stats, scores, and
leaderboards.
Authentication vs Authorization:
What is Authentication?
Authentication is the process of verifying a user’s identity before
granting access. It ensures that the person trying to access the
system is who they claim to be.
Example: Logging into a system using a username & password,
OTP, or fingerprint.
What is Authorization?
Authorization is the process of granting or restricting access to
specific data or resources after authentication is successful. It
determines what a user is allowed to do in a system.
Example: After logging in, a normal user cannot access admin
controls, but an admin can.
Authentication vs. Authorization – Comparison Table
Feature Authentication Authorization
Definition Confirms a user’s Determines user
identity. permissions.
Purpose Ensures the user is who Ensures the user has
they claim to be. the right access to
resources.
Process User provides System checks if the
credentials (password, user is allowed to
biometrics, OTP). access specific
data/actions.
Timing Happens before Happens after
authorization. authentication.
Security Type Protects identity Protects data and
verification. resource access.
Example Logging in with email & A logged-in user
password. accessing only their
files, while an admin
can access all files.
Implementation Handled via login forms, Controlled via roles,
MFA (Multi-Factor permissions, access
Authentication), control lists (ACLs).
biometrics, etc.
Dependency Independent – happens Dependent on
first. authentication –
cannot happen
without it.
Real Life Use Cases:
Example 1: User Login vs. Admin Access (Website/App)
Authentication:
• When you log in to a website (e.g., Gmail, Facebook, Netflix),
you enter your email and password (or OTP, fingerprint, etc.).
• The system verifies your identity before letting you in.
Authorization:
• After logging in, a regular user can only access their own
profile, emails, or watch history.
• An admin user can edit, delete, or manage other users’
accounts.
• The system checks what you are allowed to do based on your
role.
Example 2: University System (Student vs. Professor)
Authentication:
• A student and professor both log in to the university portal
using their credentials.
• The system checks their identity before granting access.
Authorization:
• A student can view their own grades, timetable, and
assignments.
• A professor can access students' marks, upload results, and
modify course content.
Big Data and Hadoop Framework:
What is Big Data?
Big Data refers to huge volumes of structured, semi-structured, and
unstructured data that traditional databases cannot handle
efficiently. This data is generated at high speed and needs
specialized processing frameworks like Hadoop to store, process,
and analyze it.
What is Hadoop?
Hadoop is an open-source framework developed by Apache for
processing large datasets in a distributed computing environment. It
enables scalable, fault-tolerant, and cost-effective storage and
processing of Big Data.
Example: Companies like Google, Facebook, Amazon, and Netflix
use Hadoop to manage massive amounts of user data.
Features of Hadoop
1⃣ Open-Source & Cost-Effective
• Hadoop is free to use and open-source, meaning anyone can
modify and improve it.
• It eliminates the need for expensive hardware and database
solutions.
2⃣ Distributed Storage (HDFS - Hadoop Distributed File System)
• Hadoop stores data across multiple machines instead of a
single server.
• Even if one machine fails, data is still safe due to data
replication.
3⃣ Fault Tolerance & High Availability
• If a node (computer) fails, Hadoop automatically reallocates the
task to another node.
• Data is replicated across multiple nodes to ensure reliability.
4⃣ Scalability & Flexibility
• Hadoop can easily scale from a single node to thousands of
nodes.
• It can handle structured, unstructured, and semi-structured
data, including text, images, and videos.
5⃣ Parallel Processing with MapReduce
• Hadoop processes data using MapReduce, which splits a task
into smaller sub-tasks and runs them in parallel on multiple
nodes.
• This significantly reduces processing time for large datasets.
6⃣ Supports Multiple Programming Languages
• Unlike traditional databases that mostly use SQL, Hadoop
supports Java, Python, R, Scala, and more for data processing.
7️⃣ Integration with Big Data Tools
• Hadoop works with tools like Apache Spark, Apache Hive,
Apache Pig, and HBase to enhance data processing, querying,
and analysis.
Hadoop Architecture:
Hadoop follows a distributed computing model where large datasets
are processed in parallel across multiple machines. The Hadoop
Architecture is based on the MapReduce framework, which breaks
down tasks into smaller, manageable chunks and processes them
efficiently.
Hadoop Architecture Components:
1⃣ Input – Big Data Ingestion
• The process begins with huge datasets (structured, semi-
structured, unstructured) being loaded into the Hadoop
Distributed File System (HDFS).
• Data can come from databases, logs, IoT sensors, social media,
or other sources.
Example: A company collects millions of website logs, user
activity, and transactions.
2⃣ Map() – Data Splitting & Processing (Mapper Phase)
• The input data is divided into smaller chunks and distributed
across multiple nodes.
• Each chunk is processed in parallel by a Mapper function,
which extracts useful information and converts it into key-value
pairs.
Example: If processing a huge sales dataset, the Mapper can
extract (Product, SalesAmount) as key-value pairs.
3⃣ Shuffle & Sort (Intermediate Phase)
• The intermediate key-value pairs from multiple Mappers are
sorted and grouped.
• Similar keys are brought together for further processing in the
Reducer phase.
Example: If multiple Mapper nodes generate key-value pairs like
(Apple, 5️0), (Apple, 3️0), and (Banana, 4️0), they are grouped
together:
scss
CopyEdit
Apple → (5️0, 3️0)
Banana → (4️0)
4⃣ Reduce() – Aggregation & Final Processing (Reducer Phase)
• The Reducer function takes grouped key-value pairs and
performs final calculations like summing, averaging, or filtering.
• The final processed output is generated.
Example: Summing up the sales of each product:
nginx
CopyEdit
Apple → 5️0 + 3️0 = 8️0
Banana → 4️0
5⃣ Output – Final Processed Data
• The final results are stored in HDFS or exported to databases,
dashboards, or visualization tools for further analysis.
• The processed data is now ready for business insights,
reporting, or machine learning models.
Example: A company now has a final report of total product
sales, which can be used for decision-making and future predictions.
Cloud Computing in Data Science (AWS and Google Cloud):
What is Cloud Computing in Data Science?
Cloud computing provides on-demand access to storage, computing
power, and machine learning tools over the internet. Instead of
buying expensive hardware, companies use cloud platforms like AWS
(Amazon Web Services) and Google Cloud to process and analyze Big
Data efficiently.
Why is Cloud Computing Important for Data Science?
• Handles large-scale data storage
• Provides high-speed processing power
• Supports AI & machine learning model training
• Reduces cost & infrastructure maintenance
• Enables scalability & flexibility
AWS Services in Data Science (Common Examples)
AWS (Amazon Web Services) is the most widely used cloud platform
for data storage, processing, and machine learning. Below are some
key AWS services:
1⃣ AWS S3 (Simple Storage Service) – Data Storage
• Stores huge datasets (structured/unstructured)
• Used for data lakes, backups, and distributed storage
Example: Netflix stores and processes user watch history &
recommendations using AWS S3️.
2⃣ AWS EC2 (Elastic Compute Cloud) – Virtual Machines
• Provides scalable virtual servers to run machine learning
models
• Allows high-performance computing (HPC) for data processing
tasks
Example: Financial companies use EC2️ to predict stock
market trends.
3⃣ AWS Lambda – Serverless Computing
• Runs data science functions without managing servers
• Used for real-time data processing & automation
Example: Used in IoT applications to analyze sensor data
instantly.
4⃣ AWS RDS (Relational Database Service) – Managed Databases
• Provides fully managed databases like MySQL, PostgreSQL, and
SQL Server
• Handles large-scale structured data
Example: E-commerce websites use RDS to store user
purchases & product data.
5⃣ AWS SageMaker – Machine Learning Platform
• Helps train, build, and deploy ML models
• Supports deep learning frameworks like TensorFlow & PyTorch
Example: Amazon uses SageMaker for product
recommendations & fraud detection.
Google Cloud Services in Data Science
Google Cloud provides AI-powered services for data storage,
processing, and analytics. Key services include:
1⃣ Google BigQuery – Data Warehousing
• Fast and scalable data analytics service
Example: Used for real-time analytics & dashboards in
businesses.
2⃣ Google Cloud ML Engine – Machine Learning
• Trains and deploys machine learning models at scale
Example: Used in Google Photos for face recognition.
3⃣ Google Cloud Storage – Secure Data Storage
• Stores large datasets for processing and analysis
Example: Used for archiving, backups, and AI datasets.
4⃣ Google Dataflow – Real-Time Data Processing
• Analyzes streaming data from IoT devices & applications
Example: Used in self-driving cars for real-time decision-
making.
AWS Billing and Pricing Models:
AWS follows a pay-as-you-go model, meaning you only pay for the
resources you use. The pricing structure is designed to be flexible
and cost-efficient for different workloads. Here are the key
components of AWS billing and pricing models:
1⃣ Compute Costs
• Charges for processing power used in virtual machines (EC2️),
serverless functions (Lambda), or containers (ECS/EKS).
• Billed based on time used (per second/minute/hour) or
number of requests (serverless).
Example: Running an EC2 instance for 2 hours will be charged
based on the instance type and hourly rate.
2⃣ Storage Costs
• Cost for storing data in AWS services like S3 (object storage),
EBS (block storage), and Glacier (archival storage).
• Charges depend on storage type, data size, and access
frequency.
Example: S3 Standard costs more than S3 Glacier, which is
cheaper but has a longer retrieval time.
3⃣ Data Transfer Costs
• Incoming data (ingress) is free, but outgoing data (egress) is
charged based on the amount transferred.
• Data transfer within the same region is often free, but cross-
region transfers cost extra.
Example: If a website hosted on AWS serves 1TB of data to global
users, egress charges apply.
4⃣ Request & API Call Costs
• Some services charge per request or per API call, such as AWS
Lambda, S3, DynamoDB, and API Gateway.
• The cost depends on the number of function executions,
database queries, or HTTP requests.
Example: AWS Lambda charges per execution and duration, so a
heavily used function will have higher costs.
5⃣ AWS Support & Licensing Costs
• Extra charges apply for premium AWS support plans
(Developer, Business, Enterprise).
• AWS also charges for licensed software usage (e.g., Windows
Server, SQL Server, Oracle).
Example: Using Windows-based EC2 instances incurs extra
licensing costs compared to Linux.
Structuring Unstructured Data:
Unstructured data includes text, images, videos, and audio, which
do not follow a predefined format like traditional structured
databases (e.g., relational databases). Since over 80% of enterprise
data is unstructured, structuring it is essential for storage, retrieval,
and analysis in machine learning, artificial intelligence, and business
intelligence applications.
To convert unstructured data into a structured format, various
Natural Language Processing (NLP), Computer Vision (CV), and Data
Processing techniques are used.
1⃣ Handling Text Data
Text data is highly unstructured and requires processing techniques
to structure it for analysis. It is commonly found in emails, social
media posts, customer reviews, documents, chat logs, and web
content.
Techniques to Structure Text Data:
✔ Tokenization – Splitting text into words or phrases (e.g., "Data
Science is awesome!" → ["Data", "Science", "is", "awesome"]).
✔ Stemming & Lemmatization – Reducing words to their root forms
(e.g., "running" → "run", "better" → "good").
✔ Stopword Removal – Eliminating common words like "is," "the,"
and "and" to focus on meaningful terms.
✔ Named Entity Recognition (NER) – Identifying entities like names,
locations, dates, and brands in text.
✔ TF-IDF (Term Frequency-Inverse Document Frequency) –
Converting words into numerical values to understand their
importance in a document.
✔ Word Embeddings (Word2Vec, GloVe, BERT) – Converting words
into vector representations for sentiment analysis and text
classification.
Example: A company analyzing customer reviews can categorize
feedback into positive, negative, or neutral sentiments using TF-IDF
or Word2Vec models.
2⃣ Handling Image Data
Images contain visual information that must be converted into
structured formats for analysis. Image structuring is widely used in
facial recognition, medical imaging, self-driving cars, and security
surveillance.
Techniques to Structure Image Data:
✔ Feature Extraction – Identifying edges, colors, shapes, and
textures within an image.
✔ Object Detection – Using AI models like YOLO (You Only Look
Once), SSD (Single Shot MultiBox Detector), and Faster R-CNN to
detect objects in images.
✔ Image Segmentation – Dividing an image into regions or segments
to analyze different parts separately.
✔ Optical Character Recognition (OCR) – Extracting text from images
(e.g., reading number plates, digitizing printed documents).
✔ Convolutional Neural Networks (CNNs) – Deep learning models
used for image classification, recognition, and enhancement.
✔ Metadata Extraction – Extracting additional information like
timestamps, camera type, geolocation, and device details from
image files.
Example: Self-driving cars process real-world images using CNNs
and object detection to identify traffic signals, pedestrians, and lane
markings for autonomous navigation.
3⃣ Handling Video Data
Videos contain a sequence of images (frames) and audio, requiring
advanced structuring techniques to extract meaningful insights.
Video structuring is commonly used in security surveillance, content
moderation, sports analytics, and social media.
Techniques to Structure Video Data:
✔ Frame Extraction – Breaking videos into individual image frames
for further processing.
✔ Motion Detection & Tracking – Identifying movements within a
video to analyze suspicious behavior, sports analytics, or animation
tracking.
✔ Optical Character Recognition (OCR) in Videos – Extracting license
plates, subtitles, or handwritten text from video content.
✔ Speech-to-Text (STT) & Natural Language Processing (NLP) –
Converting spoken words into structured text for subtitles,
transcriptions, or voice commands.
✔ Face & Object Recognition – Detecting people, objects, or actions
using AI-driven models.
✔ Sentiment & Content Analysis – Categorizing videos based on
mood, theme, or explicit content (e.g., YouTube’s content
moderation).
Example: YouTube and Netflix use speech-to-text and facial
recognition to improve content recommendations, automatic
captioning, and content moderation.
4⃣ Handling Audio Data
Audio data, like voice recordings, podcasts, or music, is unstructured
but can be converted into structured formats for AI applications
such as virtual assistants (Alexa, Siri), call center analytics, and fraud
detection.
Techniques to Structure Audio Data:
✔ Speech Recognition (ASR - Automatic Speech Recognition) –
Converting spoken words into structured text (e.g., Google Voice,
Siri, Alexa).
✔ Speaker Diarization – Identifying who is speaking in an audio
recording.
✔ Sentiment Analysis on Audio – Understanding tone, pitch, and
emotions from voice recordings.
✔ Keyword Spotting – Extracting specific keywords or phrases for
chatbots and customer service applications.
✔ Noise Filtering & Enhancement – Removing background noise to
improve clarity and quality.
Example: Call centers use speech-to-text and sentiment analysis
to analyze customer satisfaction and agent performance from
recorded calls.
Tools for Structuring Data:
Structuring data involves transforming raw, unstructured data (like
text, images, videos, and audio) into an organized, structured format
for analysis and machine learning. Various tools help in data
preprocessing, transformation, and structuring, making data usable
for AI and analytics.
1. Tools for Structuring Text Data (NLP & Processing)
Unstructured text data is structured using Natural Language
Processing (NLP) tools, which help in text cleaning, tokenization,
entity recognition, and sentiment analysis.
Popular Tools:
✔ NLTK (Natural Language Toolkit) – Python library for tokenization,
stopword removal, and stemming.
✔ SpaCy – Advanced NLP tool for named entity recognition (NER)
and dependency parsing.
✔ TextBlob – Simplifies sentiment analysis, language translation,
and POS tagging.
✔ Google Cloud NLP & AWS Comprehend – Cloud-based text
processing and sentiment analysis tools.
✔ OpenAI GPT – Processes and structures text using large language
models (LLMs).
2. Tools for Structuring Image Data (Computer Vision)
Images contain visual information that needs to be structured for
classification, object detection, and feature extraction.
Popular Tools:
✔ OpenCV (Open Source Computer Vision Library) – Used for image
processing, object detection, and feature extraction.
✔ TensorFlow & PyTorch (Deep Learning Libraries) – Helps in image
classification using CNNs (Convolutional Neural Networks).
✔ Google Vision AI & AWS Rekognition – Cloud-based image
structuring tools for facial recognition and OCR.
✔ Tesseract OCR – Extracts text from images for document
digitization.
3. Tools for Structuring Video Data (Processing & Analysis)
Video structuring involves frame extraction, object detection, and
speech-to-text conversion.
Popular Tools:
✔ FFmpeg – Extracts frames from video, resizes, and processes
video content.
✔ YOLO (You Only Look Once) & Faster R-CNN – Detects objects in
video streams.
✔ Google Video AI & AWS Rekognition Video – Cloud-based tools
for video classification, motion tracking, and face recognition.
✔ Kaldi & DeepSpeech – Converts speech from videos into
structured text (speech-to-text processing).
4. Tools for Structuring Audio Data (Speech Processing &
Analysis)
Audio structuring focuses on transcribing, speaker identification,
and keyword extraction.
Popular Tools:
✔ CMU Sphinx & DeepSpeech – Converts speech to text for
transcription.
✔ Google Speech-to-Text API & AWS Transcribe – Cloud-based
services for audio structuring and sentiment analysis.
✔ Praat – Analyzes voice pitch, tone, and phonetics for linguistic
studies.
✔ Librosa – Processes audio signals for machine learning
applications.
5. Tools for Structuring General Unstructured Data (Data
Engineering & Transformation)
These tools help in structuring text, images, videos, and audio by
transforming, filtering, and organizing data into structured formats.
Popular Tools:
✔ Pandas & NumPy – Used for data cleaning, transformation, and
structuring in Python.
✔ Apache Hadoop & Apache Spark – Used for processing and
structuring big data efficiently.
✔ SQL & NoSQL Databases (MongoDB, PostgreSQL) – Store and
structure raw data into organized formats.
✔ ETL Tools (Talend, Apache Nifi, Alteryx) – Extract, Transform, and
Load (ETL) processes for structuring enterprise data.
UNIT 3
Bias and Variance Trade Off in Model Selection
The Bias-Variance Tradeoff is a crucial concept in machine learning
model selection. It describes the balance between bias and variance
to minimize total prediction error and achieve the best model
performance.
What is Bias?
Bias refers to the error introduced by approximating a real-world
problem with a simplified model. It occurs when a model makes
strong assumptions and fails to capture the data's underlying
patterns.
✔ High Bias → Underfitting (Model is too simple)
✔ Low Bias → Better Fit
🛠 Example:
• Using Linear Regression to model complex non-linear
relationships leads to high bias because it oversimplifies the
data.
•
What is Variance?
Variance refers to the model’s sensitivity to small fluctuations in
training data. A model with high variance fits the training data too
well, including noise, and performs poorly on new data.
✔ High Variance → Overfitting (Model is too complex)
✔ Low Variance → Better Generalization
🛠 Example:
• A Deep Neural Network memorizing every detail of training
data but failing on unseen test data.
The Tradeoff: Finding the Sweet Spot
• If a model has high bias and low variance, it makes simplistic
assumptions and does not learn well.
• If a model has high variance and low bias, it learns noise from
training data and fails on test data.
• The best model minimizes both bias and variance, achieving
low total error.
• So, it is required to make a balance between bias and variance
errors, and this balance between the bias error and variance
error is known as the Bias-Variance trade-off.
🛠 Solution:
✔ Use regularization techniques (L1️/L2️, dropout) to prevent
overfitting.
✔ Choose an optimal model complexity based on cross-validation.
✔ Use ensemble methods (Bagging, Boosting) to reduce variance.
Regularization:
Regularization is a technique used in machine learning to prevent
overfitting by adding a penalty term to the loss function. It helps
control model complexity, ensuring that the model generalizes well
on unseen data.
Why is Regularization Needed?
In regression models, if the number of features (predictors) is high,
the model may become too complex and start memorizing noise,
leading to overfitting. Regularization helps by reducing the impact of
less important features and keeping the model simple yet effective.
Types of Regularization Techniques
There are two main types:
1⃣ Lasso Regression (Least Absolute Shrinkage
and Selection Operator) (L1 Regularization)
✔ Uses L1️ norm (sum of absolute values of coefficients).
✔ Shrinks some coefficients exactly to zero, effectively performing
feature selection.
✔ Helps in removing irrelevant or redundant features.
🛠 Formula:
where RSS is residual sum of squares, w are coefficients, and λ
(lambda) is the regularization parameter.
✔ Higher λ → More coefficients shrink to zero (simpler model).
✔ Lower λ → Model behaves like ordinary regression.
Example:
• Used in high-dimensional datasets, such as text processing (e.g.,
NLP), where many features are irrelevant.
2⃣ Ridge Regression (L2 Regularization)
✔ Uses L2️ norm (sum of squared values of coefficients).
✔ Shrinks coefficients toward zero, but never exactly zero.
✔ Helps when all features contribute to the prediction but need
controlled impact.
🛠 Formula:
where w are coefficients and λ controls the regularization strength.
✔ Higher λ → Coefficients become smaller, reducing complexity.
✔ Lower λ → Model behaves like ordinary regression.
Example:
• Used in finance and medical prediction models, where all
features contribute but should not dominate the model.
Elastic Net (L1 + L2 Combination)
✔ Elastic Net combines Lasso and Ridge to get the best of both
worlds.
✔ Helps when data has high dimensionality and multicollinearity.
✔ Adds both L1️ and L2️ penalties to the loss function.
🛠 Formula:
Role in preventing overfitting:
What is Overfitting?
Overfitting happens when a machine learning model memorizes the
training data instead of learning its general patterns. This results in:
✔ High accuracy on training data
✔ Poor performance on test/unseen data
Example: A model that perfectly fits a training dataset but fails to
predict new data accurately.
How Regularization Prevents Overfitting?
Regularization techniques add a penalty term to the loss function
that prevents the model from assigning too much importance to any
single feature.
Shrinks large coefficients (reducing model complexity)
Prevents learning noise or random variations in training data
Forces the model to focus on essential patterns
Role of Lasso (L1) in Preventing Overfitting
✔ Feature selection: Shrinks irrelevant feature coefficients to exactly
zero
✔ Creates a simpler, interpretable model
✔ Best for high-dimensional datasets where only a few features
matter
Example: In a house price prediction model, Lasso removes
irrelevant features like "house color" while keeping key ones like
"area" and "number of bedrooms."
Role of Ridge (L2) in Preventing Overfitting
✔ Reduces complexity without eliminating features
✔ Best when all features contribute but need control
✔ Helps in cases of multicollinearity (high correlation between
variables)
Example: In stock market prediction, Ridge ensures no single
feature dominates, improving generalization.
Elastic Net: Best of Both Worlds
✔ Combines L1 and L2 for balanced regularization
✔ Useful when some features need elimination (Lasso) and others
need reduction (Ridge)
Example: In medical diagnosis, where some test results may be
completely irrelevant while others need tuning.
Comparison Table:
Feature Lasso Ridge Elastic Net (L1
Regression Regression + L2)
(L1) (L2)
Regularization L1️ (absolute L2️ (squared Combination of
Type values of values of L1️ & L2️
coefficients) coefficients)
Effect on Some Coefficients Some
Coefficients coefficients shrink but coefficients are
shrink exactly never zero zero, others
to zero shrink
Feature Yes No (keeps Yes, but
Selection? (removes all features) with controlled
irrelevant shrinking
features)
Best for When many When all When data has
Overfitting irrelevant features high-
Prevention? features exist matter but dimensional &
need control correlated
features
Handles No, may Yes, Yes,
Multicollinearity? select one distributes balances
feature and feature
ignore the weight across selection and
rest features weight
reduction
Best Use Cases High- Regression Datasets with
dimensional with both irrelevant
data (e.g., correlated & important
NLP, gene features (e.g., but correlated
expression stock market, features (e.g.,
analysis) weather medical
forecasting) diagnosis)
PCA and Dimensionality Reduction:
Dimensionality Reduction:
What is Dimensionality Reduction?
Dimensionality reduction is a data transformation technique that
reduces the number of features (dimensions) in a dataset while
preserving essential information. It helps in:
✔ Removing redundant or irrelevant features
✔ Improving computational efficiency
✔ Enhancing visualization of high-dimensional data
✔ Avoiding the "Curse of Dimensionality" (where too many features
reduce model performance)
Example: Imagine an e-commerce dataset with 100+ customer
features. Many features may be correlated (like "total purchases" &
"purchase frequency"). Dimensionality reduction helps transform
this data into fewer, more meaningful features.
Why is Dimensionality Reduction a Data Transformation
Technique?
Dimensionality reduction doesn’t just remove features—it
transforms them into a new set of informative features. These new
features may be:
✔ Linear Combinations of Original Features (e.g., Principal
Component Analysis - PCA)
✔ Compressed Representations (e.g., Autoencoders in Deep
Learning)
✔ Feature Selection-Based Reductions (e.g., removing less
significant features)
Types of Dimensionality Reduction Techniques
1⃣ Feature Selection (Selecting Most Important Features)
• Removes irrelevant, redundant, or highly correlated features
• Does not change the nature of features, only selects a subset
• Examples:
✔ Variance Thresholding – Removes features with low variance
✔ Correlation-Based Selection – Removes correlated features
✔ Chi-Square Test, Mutual Information – Used for feature
selection in machine learning
Example: If you are predicting house prices, keeping "square
footage" but removing "number of windows" (if it's irrelevant).
2⃣ Feature Extraction (Transforming Features into New Ones)
• Transforms data into a new set of features
• Commonly used when there are highly correlated features
• Examples:
✔ Principal Component Analysis (PCA) – Converts features into
principal components that capture the most variance
✔ Linear Discriminant Analysis (LDA) – Used for supervised
learning to maximize class separation
✔ t-SNE & UMAP – Used for non-linear transformations &
clustering
Example: Instead of keeping "height" and "weight" separately,
PCA may create a new feature like "body size index", combining
them.
Real-World Applications of Dimensionality Reduction
✔ Image Compression & Face Recognition – Reducing pixels while
maintaining key features
✔ Medical Diagnosis – Transforming thousands of genetic features
into key disease indicators
✔ Speech & Text Processing – Reducing vocabulary size in NLP
models
✔ Stock Market Prediction – Identifying key financial indicators from
large datasets
Example: In handwriting recognition, a dataset with 10,000
pixels per image can be reduced to 50 key features using PCA,
making it easier for machine learning models to process.
PCA to visualize high-dimensional data:
Why is High-Dimensional Data Hard to Visualize?
When datasets have many features (dimensions), it becomes
difficult to:
✔ Plot them in 2D or 3D (since we can't visualize 4️D, 5️D, etc.)
✔ Understand relationships & patterns
✔ Identify clusters or anomalies
Example: Imagine trying to visualize customer behavior based on
100 features like age, income, purchase history, etc.
What is PCA (Principal Component Analysis)?
PCA is a popular linear dimensionality reduction technique that
transforms high-dimensional data into a lower-dimensional form
while preserving the most important patterns.
How PCA Helps in Visualization?
PCA reduces high-dimensional data to 2D or 3D while keeping the
most important information.
✔ Finds new axes (Principal Components) that capture the most
variance
✔ Projects data into 2D/3D space for visualization
✔ Helps in identifying clusters, trends, and anomalies
Example: PCA can take a 100-feature dataset and convert it into
two principal components (PC1 & PC2), which can then be plotted in
a 2D scatter plot.
Steps for PCA-based Visualization
1️⃣ Preprocess the data – Standardize features for equal importance
2️⃣ Apply PCA – Reduce dimensions while keeping maximum variance
3️⃣ Select Top 2 or 3 Principal Components
4️⃣ Plot the data – Use scatter plots (2D) or 3D plots to visualize
relationships
5️⃣ Analyze patterns & clusters – Identify groups, outliers, and trends
Real-World Applications of PCA in Visualization
✔ Customer Segmentation – Grouping customers based on shopping
habits
✔ Genomics & Bioinformatics – Visualizing genetic variations
✔ Stock Market Analysis – Identifying patterns in stock price
movements
✔ Handwriting Recognition – Visualizing letter similarities in 2️D
K-Nearest Neighbors (KNN) Algorithm
What is the K-Nearest Neighbors (KNN) Algorithm?
K-Nearest Neighbors (KNN) is a supervised learning algorithm used
for classification and regression. It predicts the class or value of a
data point by analyzing the 'k' closest points (neighbors) in the
dataset.
Example: Imagine you are identifying whether a new student is
introverted or extroverted based on their behavior. KNN will check
the k most similar students and assign the majority label.
How Does KNN Work?
1️⃣ Choose a value for ‘k’ (number of neighbors)
2️⃣ Measure the distance between the new data point and existing
points (using Euclidean, Manhattan, or other distance metrics)
3️⃣ Find the ‘k’ nearest neighbors (points closest to the new data
point)
4️⃣ Classify (for classification problems) – Assign the most common
class among neighbors
5️⃣ Predict (for regression problems) – Take the average of ‘k’
neighbors’ values
Example: If k = 5 and 3️ out of 5️ nearest points belong to the
"Extrovert" category, the new data point is classified as Extrovert.
Choosing the Right 'k' in KNN
✔ A small k (e.g., k=1, k=3) is sensitive to noise and may overfit
✔ A large k (e.g., k=15, k=20) smoothens predictions but may
underfit
✔ The best k is chosen using cross-validation
Tip: Always choose an odd value of k to avoid ties in
classification.
Real-World Applications
✔ KNN: Handwriting recognition, fraud detection, recommendation
systems
✔ Clustering: Customer segmentation, anomaly detection, gene
expression analysis
Example:
• Netflix uses KNN to recommend movies based on user
preferences
• E-commerce platforms use clustering to group similar
customers for targeted marketing
Clustering (k-means):
• K-Means Clustering is an unsupervised learning algorithm that
is used to solve the clustering problems in machine learning or
data science.
• K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters. Here
K defines the number of pre-defined clusters that need to be
created in the process, as if K=2️, there will be two clusters, and
for K=3️, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset belongs
only one group that has similar properties.
The working of the K-Means algorithm is explained in the below
steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other
from the input dataset).
• Step-3: Assign each data point to their closest centroid, which
will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each
cluster.
• Step-5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4️ else go to
FINISH.
• Step-7️: The model is ready.
Regression Models: Linear & Logistic Regression
Regression models are used in machine learning to analyze
relationships between variables and make predictions. The two
most common types are Linear Regression and Logistic Regression.
What is Linear Regression?
Linear Regression is a supervised learning algorithm used for
predicting continuous values based on independent variables.
Example: Predicting house prices based on features like size,
number of rooms, and location.
How It Works?
It establishes a linear relationship between dependent (Y) and
independent (X) variables using the equation:
Y=mX+b
✔ Y → Predicted value (dependent variable)
✔ X → Feature/input (independent variable)
✔ m → Slope (weight)
✔ b → Intercept
Example Formula:
If House Price = 50,000 + (10,000 × Area in sq. ft), then for a 1500 sq.
ft house, the price will be:
5️0,000+(1️0,000×1️5️00)=1️5️,00,000
What is Logistic Regression?
Logistic Regression is used for classification problems where the
output is binary (0 or 1, Yes or No, True or False).
Example: Spam detection – Classifying emails as Spam (1) or Not
Spam (0).
How It Works?
Instead of a straight line, Logistic Regression applies a Sigmoid (S-
Shaped) function to predict probabilities between 0 and 1:
✔ Output is a probability – If P > 0.5, the model predicts 1 (Spam),
otherwise 0 (Not Spam).
Example:
In Diabetes prediction, if a patient’s data gives P = 0.7️8, the model
predicts "Diabetic" (1).
Real-World Applications
✔ Linear Regression:
• Stock Market Prediction
• Sales Forecasting
• Predicting Student Scores
✔ Logistic Regression:
• Medical Diagnosis (Diabetes, Cancer detection)
• Credit Card Fraud Detection
• Customer Churn Prediction
K-Fold Cross Validation
What is K-Fold Cross Validation?
K-Fold Cross Validation is a resampling technique used in machine
learning to evaluate model performance more reliably. Instead of
using a single train-test split, it divides the dataset into multiple folds
(subsets), ensuring every part of the data is used for both training
and testing.
Purpose of K-Fold Cross Validation:
✔ Ensures model evaluation is more stable
✔ Reduces the risk of overfitting or underfitting
✔ Uses entire dataset efficiently for training and testing
✔ Works well for small datasets with limited data
Steps Involved in K-Fold Cross Validation
1️⃣ Shuffle the dataset randomly to ensure fairness.
2️⃣ Split the dataset into K equal-sized folds (subsets).
• If K = 5, the dataset is split into 5 parts.
3️⃣ Train the model K times:
• In each iteration, (K-1) folds are used for training
• The remaining 1 fold is used for testing
4️⃣ Repeat for K iterations, so each fold is used as a test set once.
5️⃣ Compute the average performance metric across all K iterations
(e.g., accuracy, RMSE, F1️-score).
Example (5-Fold Cross Validation)
Iteration Training Folds Testing Fold
1 Fold 2️,3️,4️,5️ Fold 1️
2 Fold 1️,3️,4️,5️ Fold 2️
3 Fold 1️,2️,4️,5️ Fold 3️
4 Fold 1️,2️,3️,5️ Fold 4️
5 Fold 1️,2️,3️,4️ Fold 5️
Final Model Performance = Average of all test results
Choosing K in K-Fold
✔ Common values: K = 5️ or K = 1️0
✔ Higher K (e.g., 10) → More training data, but slower computation
✔ Lower K (e.g., 5) → Faster, but might not generalize well
✔ Leave-One-Out Cross Validation (LOOCV) → Special case where K
= N (one sample is left out each time)
Advantages of K-Fold Cross Validation
More reliable model evaluation than simple train-test splits
Reduces bias in model selection
Helps in tuning hyperparameters ⚙
Works with imbalanced datasets by ensuring all data is used
fairly
Parsimony:
Parsimony in data science refers to the principle of keeping models
as simple as possible while maintaining accuracy. It follows the
Occam's Razor principle, which states that simpler solutions are
preferred over complex ones when both perform similarly.
Why is Parsimony Important?
✔ Avoids Overfitting – Complex models may memorize noise instead
of learning real patterns.
✔ Enhances Interpretability – Simpler models are easier to
understand and explain.
✔ Improves Generalization – A parsimonious model works better on
unseen data.
✔ Reduces Computational Cost – Fewer parameters mean faster
training and execution.
Parsimony in Data Science Practices
1️⃣ Feature Selection – Removing unnecessary features to keep the
model simple.
2️⃣ Regularization (Lasso & Ridge Regression) – Penalizing large
coefficients to prevent complexity.
3️⃣ Dimensionality Reduction (PCA, t-SNE) – Reducing the number of
input variables while retaining information.
4️⃣ Choosing Simple Algorithms – Preferring logistic regression or
decision trees over deep learning when possible.
5️⃣ Hyperparameter Tuning – Avoiding excessive tuning that leads to
overly complex models.
Real-World Example
Spam Detection Model:
• A complex neural network may achieve 99% accuracy, but with
high computation and less interpretability.
• A simple logistic regression model with a few key features (e.g.,
word frequency, sender info) may achieve 95% accuracy but is
faster and easier to explain.
Conclusion
Parsimony ensures that data science models are efficient,
interpretable, and generalizable without unnecessary complexity.
The goal is to find the simplest model that performs well on real-
world data.
Data Smoothing:
Data smoothing is a technique used in data preprocessing to remove
noise and irregularities from datasets, making patterns more
noticeable. It helps in improving data quality, trend analysis, and
model performance.
Why is Data Smoothing Important?
✔ Removes Noise – Helps eliminate random fluctuations in data.
✔ Enhances Trends – Makes it easier to identify underlying patterns.
✔ Improves Model Performance – Reduces variability for better
predictions.
✔ Aids in Data Visualization – Smoother data makes graphs and
reports clearer.
Common Data Smoothing Techniques
1⃣ Moving Average (Rolling Mean)
• Computes the average of a fixed number of past data points.
• Commonly used in time-series forecasting (e.g., stock prices,
weather trends).
• Example: 3-day moving average for stock prices smooths short-
term fluctuations.
2⃣ Exponential Smoothing
• Assigns higher weights to recent observations while reducing
the impact of older data.
• Useful for real-time forecasting (e.g., demand prediction, sales
forecasting).
3⃣ Binning Method 🏗
• Groups data into bins (intervals) and replaces values with
mean, median, or boundary values.
• Helps in handling noisy data in classification and regression
problems.
4⃣ Regression-Based Smoothing
• Fits a regression model (e.g., linear, polynomial) to the dataset
to reduce noise.
• Commonly used in trend analysis and predictive modeling.
Real-World Example
Stock Market Analysis
• A raw stock price graph has frequent fluctuations due to
market volatility.
• Using Moving Average (MA) smooths the price curve, making
trends easier to analyze.
Weather Forecasting
• Temperature data is often noisy due to sudden changes.
• Applying exponential smoothing helps in making predictions
more stable.
Time Series Analysis:
Time Series Analysis is a statistical technique used to analyze data
points collected over time at regular intervals. It helps in
understanding patterns, trends, and seasonal variations to make
future predictions.
Key Components of Time Series
1⃣ Trend – Long-term increase or decrease in data (e.g., rising
global temperatures).
2⃣ Seasonality – Recurring patterns at regular intervals (e.g.,
holiday sales spikes).
3⃣ Cyclic Patterns – Fluctuations that occur over time but without
fixed periods (e.g., economic cycles).
4⃣ Irregular Variations ⚠ – Random, unpredictable changes due to
unforeseen events (e.g., natural disasters).
Time Series Analysis Methods
✔ Moving Averages – Smooths fluctuations to highlight trends
(used in stock price forecasting).
✔ Exponential Smoothing – Assigns greater importance to recent
observations for trend forecasting.
✔ Autoregressive Integrated Moving Average (ARIMA) – A powerful
model for forecasting time series data.
✔ Seasonal Decomposition (STL Decomposition) – Splits time series
into trend, seasonality, and residual components.
Real-World Applications
Stock Market Prediction – Identifying price trends and
fluctuations.
Weather Forecasting – Predicting temperature, rainfall, and
climate changes.
Sales Forecasting – Analyzing customer demand trends.
Healthcare Analytics – Predicting disease outbreaks based on
historical patterns.
Conclusion
Time Series Analysis is crucial for making data-driven decisions in
finance, healthcare, business, and more. By understanding
historical patterns, we can predict future trends and optimize
strategies.