0% found this document useful (0 votes)
13 views32 pages

Data Science - Unit2

Uploaded by

Swapnil Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

Data Science - Unit2

Uploaded by

Swapnil Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Basics of Data Science

Dr. Md. Asraful Haque


Syllabus
Introduction
1 Provides an overview of data science

Data Handling & Preprocessing


2 Covers strategies for data acquisition and essential preprocessing techniques

Statistics & Exploratory Data Analysis


3 Establishes a statistical foundation for data analysis

Machine Learning Techniques


4 Introduces core machine learning concepts and applications
Unit-2:
Data Handling & Preprocessing
Sources of Data

 The common sources of data: Databases, APIs, Web Scraping,


Sensors, Files, Surveys etc.
 Databases and APIs for structured, dynamic access;
 Web scraping and files for unstructured or third-party data;
 Sensors for real-time physical data;
 Surveys for customized, human-centric insights.
Databases

 Databases are organized collections of data that are stored and


accessed electronically.
 They are widely used for structured data storage in enterprises, web
applications, and information systems.
 Types:
1. Relational Databases (RDBMS): Use tables (rows and
columns) to store data (e.g., MySQL, PostgreSQL, Oracle, SQL
Server).
2. NoSQL Databases: Handle unstructured or semi-structured
data; more flexible schema (e.g., MongoDB, Cassandra,
CouchDB).
Databases

 Advantages:
1. Efficient data storage and retrieval
2. Support for complex queries and indexing
3. Transaction support and concurrency control
 Use Cases:
1. Customer relationship management (CRM)
2. E-commerce platforms
3. Financial and healthcare systems
 Data Retrieval: Data is typically accessed using SQL (Structured
Query Language) or database drivers/interfaces like JDBC, ODBC.
APIs (Application Programming Interfaces)

 APIs are defined protocols and tools that allow software applications
to communicate and share data.
 APIs are one of the most efficient and scalable ways to gather real-
time or batch data from external systems.
 Examples: Twitter API (for tweets, user profiles), Google Maps API (for
geolocation and maps).
 Advantages:
1. Access to real-time data
2. Efficient data integration
3. No need for direct database access
Web Scraping

 Web scraping is the process of automatically extracting data from


websites using software, often referred to as bots or web crawlers.
 It involves collecting information from the website's underlying code
(HTML) and sometimes from databases, rather than just copying
what's displayed on the screen. This extracted data can then be
organized and stored in a more usable format for various applications.
 Use Cases:
1. Price comparison engines
2. News aggregation
3. Collecting public reviews or comments
Sensors

 Sensors are physical devices that collect data from the environment or
machinery and convert it into a digital signal for further processing.
 Examples of Sensors: Temperature sensors, Accelerometers, GPS
modules, IoT (Internet of Things) devices.
 Data Characteristics: High volume (often real-time streaming), Time-
stamped, Often requires edge computing or cloud platforms.
 Data Collection Platforms: Arduino, Raspberry Pi, Cloud IoT
platforms (AWS IoT, Google Cloud IoT).
 Use Cases:
1. Smart cities and traffic monitoring
2. Environmental data collection (weather, pollution)
Files

 Files are among the most basic and widespread sources of data. They
can be structured (Ex. CSV, XLSX), semi-structured (Ex. JSON, XML),
or unstructured (Ex. Text files, PDFs, images).
 Files are easy to store and share and can be generated by many
applications (e.g., exports from software tools).
 Use Cases:
1. Financial reports in spreadsheets
2. User logs in text files
3. Social media metadata in JSON format
Surveys

 Surveys are structured questionnaires designed to collect specific


information from a target population. They are a primary method of
collecting first-hand, customized data.
 Types:
1. Online Surveys: Google Forms, SurveyMonkey, Typeform
2. Face-to-Face or Telephonic Surveys
3. Paper-based Surveys
 Data Characteristics:
1. Usually structured or categorical
2. Designed to meet specific research or business goals
3. May include quantitative (ex. rating scales) or qualitative responses
Surveys

 Challenges:
1. Bias in question design or respondent selection
2. Low response rates
3. Data cleaning required before analysis
 Use Cases:
1. Customer satisfaction measurement
2. Public health data collection
3. Market research
Data Acquisition Strategies

 Data acquisition is not just about collecting data—it's about collecting


the right data in the right way. Successful data science projects start
with thoughtful acquisition strategies that ensure data quality,
reliability, and relevance.
 Best Practices in Data Acquisition
1. Understand the data needs before starting collection.
2. Respect privacy and legal constraints (e.g., GDPR, HIPAA).
3. Clean and preprocess data soon after acquisition.
4. Automate repetitive collection tasks when possible.
5. Ensure versioning and backup of collected data.
6. Monitor and audit data pipelines for failures or anomalies.
Data Preprocessing

 Data preprocessing in Data Science is the crucial step of transforming


raw data into a clean, consistent, and usable format before applying
machine learning, statistical modeling, or analysis techniques.
 Since real-world data is often incomplete, noisy, inconsistent, or
unstructured, preprocessing ensures higher model performance and
reliable insights.
 Data preprocessing is like polishing raw gemstones before turning them
into jewelry — it doesn’t change the essence of the data, but it makes it
shine and become usable for analysis.
Why Data Preprocessing?
 Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
 No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data.
Data Quality Measures
 A well-accepted multidimensional view:
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
 Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same or
similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially for
numerical data
Data Cleaning
 Goal: Remove errors, inconsistencies, noise.
 Common tasks include:
 Handling Missing Data
 Removing Noise (Smooth data using moving averages or binning.)
 Handling inconsistent formats (e.g., ―Male/Female‖ vs. ―M/F‖).
 Outlier detection & treatment (using statistical methods, clustering,
or domain rules).
 Importance of Data Cleaning
 Reduces bias and errors in analysis.
 Improves model accuracy and efficiency.
 Ensures consistency, reliability, and trustworthiness of insights.
 Saves cost and time in downstream processing.
Incomplete or Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., ―unknown‖, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Integration
 Data integration is the process of combining data from multiple sources
into a single, unified view so it can be used for analysis or modeling.
 Issues in Data Integration:
1. Schema Mismatch: One source calls a column "Customer_ID", another
calls it "CustID".
2. Data Type Differences: One database stores dates as YYYY-MM-DD,
another as DD/MM/YYYY.
3. Unit Inconsistency: One dataset measures weight in kilograms, another
in pounds.
4. Duplicate/Redundant Records: Same customer appears twice with
slightly different spellings.
5. Data Conflicts: 2 sources give different phone no.s for the same
customer.
Handling Redundant Data in Data Integration
 Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue.
 Redundant attributes may be able to be detected by correlation analysis
and covariance analysis.
 Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test:
(Observed  Expected ) 2
 
2

Expected

 The larger the Χ2 value, the more likely the variables are related
 The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)
• It shows that like_science_fiction and play_chess are correlated in the group

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2


 
2
    507.93
90 210 360 840
Data Transformation
 Goal: Convert data into a form that models can understand.
 Scaling / Normalization
Normalization (Min-Max Scaling): Scales data to a range [0, 1].
Standardization (Z-score Scaling): Scales data to have mean 0 and
standard deviation 1.
 Encoding Categorical Data
Label Encoding: Assign a unique integer to each category.
Ordinal Encoding: Encode categories with meaningful order.
 Feature Engineering
Create new features from existing data (eg. extracting ―age‖ from dob)
 Dimensionality Reduction
Reduce the number of variables using PCA (Principal Component
Analysis) or feature selection.
 Data Generaliztion/Specialization etc.
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to:
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
• Z-score normalization (μ: mean, σ: standard deviation): v  A
v' 
 A

– Ex. Let μ = 54,000, σ = 16,000. Then 73,600  54,000  1.225


16,000
Data Reduction
 Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
 Why data reduction?
— A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data
set. The goal is to make datasets smaller, simpler, and faster to
process without losing critical patterns or relationships.
 Data reduction is the process of minimizing the volume of data while
preserving as much meaningful information as possible.
Data Reduction Strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
 Lossless compression (no data lost): ZIP, run-length encoding.
 Lossy compression (some data lost but acceptable): JPEG for
images, MP3 for audio.
Discretization
 Data discretization is a data preprocessing technique used in data
science to transform continuous numerical data into a finite set of
discrete intervals or "bins."
 Bins map nicely to human concepts (e.g., ―low/medium/high‖).
 Instead of working with an infinite number of possible values, you work
with a limited number of defined categories. This process simplifies the
data, making it easier to analyze and interpret.
 Discretization can also reduce the effect of noise and outliers.
Thank you

You might also like