0% found this document useful (0 votes)

13 views32 pages

Data Science - Unit2

Uploaded by

Swapnil Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views32 pages

Data Science - Unit2

Uploaded by

Swapnil Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Basics of Data Science

Dr. Md. Asraful Haque

Syllabus
Introduction
1 Provides an overview of data science

Data Handling & Preprocessing

2 Covers strategies for data acquisition and essential preprocessing techniques

Statistics & Exploratory Data Analysis

3 Establishes a statistical foundation for data analysis

Machine Learning Techniques

4 Introduces core machine learning concepts and applications
Unit-2:
Data Handling & Preprocessing
Sources of Data

 The common sources of data: Databases, APIs, Web Scraping,

Sensors, Files, Surveys etc.
 Databases and APIs for structured, dynamic access;
 Web scraping and files for unstructured or third-party data;
 Sensors for real-time physical data;
 Surveys for customized, human-centric insights.
Databases

 Databases are organized collections of data that are stored and

accessed electronically.
 They are widely used for structured data storage in enterprises, web
applications, and information systems.
 Types:
1. Relational Databases (RDBMS): Use tables (rows and
columns) to store data (e.g., MySQL, PostgreSQL, Oracle, SQL
Server).
2. NoSQL Databases: Handle unstructured or semi-structured
data; more flexible schema (e.g., MongoDB, Cassandra,
CouchDB).
Databases

 Advantages:
1. Efficient data storage and retrieval
2. Support for complex queries and indexing
3. Transaction support and concurrency control
 Use Cases:
1. Customer relationship management (CRM)
2. E-commerce platforms
3. Financial and healthcare systems
 Data Retrieval: Data is typically accessed using SQL (Structured
Query Language) or database drivers/interfaces like JDBC, ODBC.
APIs (Application Programming Interfaces)

 APIs are defined protocols and tools that allow software applications
to communicate and share data.
 APIs are one of the most efficient and scalable ways to gather real-
time or batch data from external systems.
 Examples: Twitter API (for tweets, user profiles), Google Maps API (for
geolocation and maps).
 Advantages:
1. Access to real-time data
2. Efficient data integration
3. No need for direct database access
Web Scraping

 Web scraping is the process of automatically extracting data from

websites using software, often referred to as bots or web crawlers.
 It involves collecting information from the website's underlying code
(HTML) and sometimes from databases, rather than just copying
what's displayed on the screen. This extracted data can then be
organized and stored in a more usable format for various applications.
 Use Cases:
1. Price comparison engines
2. News aggregation
3. Collecting public reviews or comments
Sensors

 Sensors are physical devices that collect data from the environment or
machinery and convert it into a digital signal for further processing.
 Examples of Sensors: Temperature sensors, Accelerometers, GPS
modules, IoT (Internet of Things) devices.
 Data Characteristics: High volume (often real-time streaming), Time-
stamped, Often requires edge computing or cloud platforms.
 Data Collection Platforms: Arduino, Raspberry Pi, Cloud IoT
platforms (AWS IoT, Google Cloud IoT).
 Use Cases:
1. Smart cities and traffic monitoring
2. Environmental data collection (weather, pollution)
Files

 Files are among the most basic and widespread sources of data. They
can be structured (Ex. CSV, XLSX), semi-structured (Ex. JSON, XML),
or unstructured (Ex. Text files, PDFs, images).
 Files are easy to store and share and can be generated by many
applications (e.g., exports from software tools).
 Use Cases:
1. Financial reports in spreadsheets
2. User logs in text files
3. Social media metadata in JSON format
Surveys

 Surveys are structured questionnaires designed to collect specific

information from a target population. They are a primary method of
collecting first-hand, customized data.
 Types:
1. Online Surveys: Google Forms, SurveyMonkey, Typeform
2. Face-to-Face or Telephonic Surveys
3. Paper-based Surveys
 Data Characteristics:
1. Usually structured or categorical
2. Designed to meet specific research or business goals
3. May include quantitative (ex. rating scales) or qualitative responses
Surveys

 Challenges:
1. Bias in question design or respondent selection
2. Low response rates
3. Data cleaning required before analysis
 Use Cases:
1. Customer satisfaction measurement
2. Public health data collection
3. Market research
Data Acquisition Strategies

 Data acquisition is not just about collecting data—it's about collecting

the right data in the right way. Successful data science projects start
with thoughtful acquisition strategies that ensure data quality,
reliability, and relevance.
 Best Practices in Data Acquisition
1. Understand the data needs before starting collection.
2. Respect privacy and legal constraints (e.g., GDPR, HIPAA).
3. Clean and preprocess data soon after acquisition.
4. Automate repetitive collection tasks when possible.
5. Ensure versioning and backup of collected data.
6. Monitor and audit data pipelines for failures or anomalies.
Data Preprocessing

 Data preprocessing in Data Science is the crucial step of transforming

raw data into a clean, consistent, and usable format before applying
machine learning, statistical modeling, or analysis techniques.
 Since real-world data is often incomplete, noisy, inconsistent, or
unstructured, preprocessing ensures higher model performance and
reliable insights.
 Data preprocessing is like polishing raw gemstones before turning them
into jewelry — it doesn’t change the essence of the data, but it makes it
shine and become usable for analysis.
Why Data Preprocessing?
 Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data.
Data Quality Measures
 A well-accepted multidimensional view:
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
 Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Data Cleaning
 Goal: Remove errors, inconsistencies, noise.
 Common tasks include:
 Handling Missing Data
 Removing Noise (Smooth data using moving averages or binning.)
 Handling inconsistent formats (e.g., ―Male/Female‖ vs. ―M/F‖).
 Outlier detection & treatment (using statistical methods, clustering,
or domain rules).
 Importance of Data Cleaning
 Reduces bias and errors in analysis.
 Improves model accuracy and efficiency.
 Ensures consistency, reliability, and trustworthiness of insights.
 Saves cost and time in downstream processing.
Incomplete or Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., ―unknown‖, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Integration
 Data integration is the process of combining data from multiple sources
into a single, unified view so it can be used for analysis or modeling.
 Issues in Data Integration:
1. Schema Mismatch: One source calls a column "Customer_ID", another
calls it "CustID".
2. Data Type Differences: One database stores dates as YYYY-MM-DD,
another as DD/MM/YYYY.
3. Unit Inconsistency: One dataset measures weight in kilograms, another
in pounds.
4. Duplicate/Redundant Records: Same customer appears twice with
slightly different spellings.
5. Data Conflicts: 2 sources give different phone no.s for the same
customer.
Handling Redundant Data in Data Integration
 Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue.
 Redundant attributes may be able to be detected by correlation analysis
and covariance analysis.
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test:
(Observed  Expected ) 2
 
2

Expected

 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

calculated based on the data distribution in the two categories)
• It shows that like_science_fiction and play_chess are correlated in the group

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2

 
2
    507.93
90 210 360 840
Data Transformation
 Goal: Convert data into a form that models can understand.
 Scaling / Normalization
Normalization (Min-Max Scaling): Scales data to a range [0, 1].
Standardization (Z-score Scaling): Scales data to have mean 0 and
standard deviation 1.
 Encoding Categorical Data
Label Encoding: Assign a unique integer to each category.
Ordinal Encoding: Encode categories with meaningful order.
 Feature Engineering
Create new features from existing data (eg. extracting ―age‖ from dob)
 Dimensionality Reduction
Reduce the number of variables using PCA (Principal Component
Analysis) or feature selection.
 Data Generaliztion/Specialization etc.
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to:
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
• Z-score normalization (μ: mean, σ: standard deviation): v  A
v' 
 A

– Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000  1.225

16,000
Data Reduction
 Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
 Why data reduction?
— A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data
set. The goal is to make datasets smaller, simpler, and faster to
process without losing critical patterns or relationships.
 Data reduction is the process of minimizing the volume of data while
preserving as much meaningful information as possible.
Data Reduction Strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
 Lossless compression (no data lost): ZIP, run-length encoding.
 Lossy compression (some data lost but acceptable): JPEG for
images, MP3 for audio.
Discretization
 Data discretization is a data preprocessing technique used in data
science to transform continuous numerical data into a finite set of
discrete intervals or "bins."
 Bins map nicely to human concepts (e.g., ―low/medium/high‖).
 Instead of working with an infinite number of possible values, you work
with a limited number of defined categories. This process simplifies the
data, making it easier to analyze and interpret.
 Discretization can also reduce the effect of noise and outliers.
Thank you

Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Unit-I Da
No ratings yet
Unit-I Da
42 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Module 2
No ratings yet
Module 2
70 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
56 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Unit 1
No ratings yet
Unit 1
11 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
161 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Fds Question Bank With Answer
No ratings yet
Fds Question Bank With Answer
35 pages
Data Science My Notes
No ratings yet
Data Science My Notes
61 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
23SC3201 Data Science and Challenges-2
No ratings yet
23SC3201 Data Science and Challenges-2
28 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
60 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
Data Analytics: Collection & Pre-processing
No ratings yet
Data Analytics: Collection & Pre-processing
16 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
Hammad Raza.
No ratings yet
Hammad Raza.
28 pages
Ocs353dsf Unit Wise Notes
100% (4)
Ocs353dsf Unit Wise Notes
121 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Gerund or Infinitive - 15232
0% (1)
Gerund or Infinitive - 15232
1 page
Alam GEOG 1105 - 760 - Outline - Winter 2024
No ratings yet
Alam GEOG 1105 - 760 - Outline - Winter 2024
9 pages
Costing Accounting Problems
No ratings yet
Costing Accounting Problems
3 pages
Uts Semester 1 Bahasa Inggris X 2016 Soal A
No ratings yet
Uts Semester 1 Bahasa Inggris X 2016 Soal A
3 pages
Analisis Kesalahan Terjemahan Kitab Ta'lim
No ratings yet
Analisis Kesalahan Terjemahan Kitab Ta'lim
16 pages
Motor Insurance - Passenger Carrying Vehicle Liability Only: Certificate of Insurance Cum Policy Schedule
No ratings yet
Motor Insurance - Passenger Carrying Vehicle Liability Only: Certificate of Insurance Cum Policy Schedule
2 pages
Cso Against Terrorism
No ratings yet
Cso Against Terrorism
236 pages
CE3 Ashish Sharma 4
No ratings yet
CE3 Ashish Sharma 4
24 pages
NCERT Exemplar Class 8 Science Solutions Chapter 4 Materials - Metals and Non-Metals
No ratings yet
NCERT Exemplar Class 8 Science Solutions Chapter 4 Materials - Metals and Non-Metals
14 pages
GMDSS Questions: Question? A B C D
100% (2)
GMDSS Questions: Question? A B C D
9 pages
10 J Paediatrics Child Health - 2022 - Jessop - Health Care Workers Understanding of and Barriers To Palliative Care Services
No ratings yet
10 J Paediatrics Child Health - 2022 - Jessop - Health Care Workers Understanding of and Barriers To Palliative Care Services
6 pages
Pandas
No ratings yet
Pandas
82 pages
The Girl I Fall in Love With
No ratings yet
The Girl I Fall in Love With
9 pages
Part 1 B C New Movs For Opcrf 24 25
No ratings yet
Part 1 B C New Movs For Opcrf 24 25
6 pages
Sales & Distribution Blueprint
100% (4)
Sales & Distribution Blueprint
29 pages
IB Economics Real World Examples - Micro and Macro
100% (1)
IB Economics Real World Examples - Micro and Macro
9 pages
Understanding ICT: Definitions & Distinctions
No ratings yet
Understanding ICT: Definitions & Distinctions
3 pages
Long Face Syndrome
100% (3)
Long Face Syndrome
144 pages
Round 2
No ratings yet
Round 2
20 pages
Criteria For Website
No ratings yet
Criteria For Website
2 pages
Explanation of Bacon's Essay of Studies
No ratings yet
Explanation of Bacon's Essay of Studies
12 pages
Chery A1 Operating Instruction Manual
No ratings yet
Chery A1 Operating Instruction Manual
176 pages
(Herbal Reference Library) Kapoor, L. D - Handbook of Ayurvedic Medicinal Plants-CRC Press (2001)
No ratings yet
(Herbal Reference Library) Kapoor, L. D - Handbook of Ayurvedic Medicinal Plants-CRC Press (2001)
425 pages
Construction Management Degree Guide
No ratings yet
Construction Management Degree Guide
27 pages
M.V. Syllabus
No ratings yet
M.V. Syllabus
37 pages
Write Thesis Statement Middle School
100% (3)
Write Thesis Statement Middle School
8 pages
Understanding Research Objectives, Aims, Scope
No ratings yet
Understanding Research Objectives, Aims, Scope
14 pages
Photosynthesis A Practical Study
No ratings yet
Photosynthesis A Practical Study
43 pages
NedapBrochure EAS ID - UK
No ratings yet
NedapBrochure EAS ID - UK
16 pages
SNHU 107 Module Six Journal
No ratings yet
SNHU 107 Module Six Journal
2 pages

Data Science - Unit2

Uploaded by

Data Science - Unit2

Uploaded by

Basics of Data Science

Dr. Md. Asraful Haque

Data Handling & Preprocessing

Statistics & Exploratory Data Analysis

Machine Learning Techniques

 The common sources of data: Databases, APIs, Web Scraping,

 Databases are organized collections of data that are stored and

 Web scraping is the process of automatically extracting data from

 Surveys are structured questionnaires designed to collect specific

 Data acquisition is not just about collecting data—it's about collecting

 Data preprocessing in Data Science is the crucial step of transforming

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2

– Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000  1.225

You might also like