Data
Science
and
DATA
PRE-
PROCESSING
Showcased by
Anshul Sharma Anmol Sharma Aditya Sikarwar
Please Slide
Challenges
Introduction
Applications
Data Pre-
Stages in a
in
to Data
Data
of Data
Processing
CONTENTS
Conclusion
Data Science
Pre
Science
Science
Techniques
Project
Processing
Please Slide
Introduction
to Data
Science Conclusion
Stages in a Challenges
CONTENTS in Data
Data Science
Project Pre
Processing
Applications Data Pre-
of Data Processing
Science Techniques
Please Slide
Please Slide
Data Science is an interdisciplinary field that uses
scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and
unstructured data.
Please Slide
1 Decision making 2 AI applications
Enables data-driven decision Powers artificial intelligence
making and machine learning
applications
3 Pattern discovery 4 Efficiency
Helps uncover hidden Improves operational
patterns and trends efficiency across industries
Please Slide
Please Slide
EVOLUTION OF DATA SCIENCE
1960s
Early statistical analysis
Please Slide
EVOLUTION OF DATA SCIENCE
1960s
Early statistical analysis
1 2
1980s-1990s
Data mining concepts
Please Slide
EVOLUTION OF DATA SCIENCE
1960s 2000s
Early statistical analysis Emergence of big data technologies
1 2 3
1980s-1990s
Data mining concepts
Please Slide
EVOLUTION OF DATA SCIENCE
1960s 2000s
Early statistical analysis Emergence of big data technologies
1 2 3 4
1980s-1990s Current era
Data mining concepts AI-driven analytics
Please Slide
ROLES IN DATA SCIENCE
Data Scientist Data Analyst Data Engineer
Data Scientist Data Analyst Data Engineer
ML Engineer Business Analyst
Machine Learning Engineer Business Analyst
Please Slide
Please Slide
Data collection Sources
Gathering raw data from various APIs, databases, web scraping,
sources sensors
Please Slide
PRE-PROCESSING
1 2
Data preparation Time consumption
Cleaning and preparing data for analysis
Most time-consuming phase (≈60-80% of project time)
Please Slide
MODELING
1 2
Statistical techniques Algorithm training
Applying statistical and machine learning techniques Algorithm selection and training
Please Slide
DEPLOYMENT
1 2
Model Implementation Dashboard Creation
in production environments or APIs for end-users
Please Slide
Please Slide
APPLICATION OF DATA
SCIENCE
1 Healthcare 2 Finance
•Fraud detection
•Disease prediction models
•Algorithmic trading
•Medical image analysis
•Credit scoring
•Drug discovery
•Risk management
•Personalized treatment
plans
3 Marketing 4 Cybersecurity
•Customer segmentation •Anomaly detection
•Churn prediction •Threat intelligence
•Sentiment analysis •Network security monitoring
•Recommendation systems •Malware analysis
Please Slide
Please Slide
Cleaning
1 2 3
Data Cleaning Data Cleaning Data Cleaning
Handling missing values Removing duplicates Correcting inconsistencies
Please Slide
Integration
1 2 3
Data Combination Schema Resolution Entity Matching
Combining data from multiple Resolving schema conflicts Entity resolution
sources
Please Slide
Transformation
2 Aggregation
1
Normalization
Feature engineering
3
Please Slide
Reduction
Reduction Sampling Selection
Dimensionality reduction Sampling techniques Feature selection
Please Slide
Discretization
Algorithm boost
Improving algorithm efficiency
Data processing
Model clarity
Converting continuous attributes to
discrete intervals Enhancing interpretability
1 3
Please Slide
Please Slide
Future Trends
AutoML AI cleaning
Automated machine learning AI-powered data cleaning
(AutoML)
Real-time Data quality
Real-time data processing
Increased focus on data quality
Please Slide
Importance of Clean Data
Accurate analytics Model performance
Foundation for accurate analytics 1 2 Critical for model performance
Error reduction
Reliable insights
Reduces downstream errors and
4 3
Ensures reliable business insights
costs
Please Slide
Over