Basics of Data Science
Dr. Md. Asraful Haque
Syllabus
Introduction
1 Provides an overview of data science
Data Handling & Preprocessing
2 Covers strategies for data acquisition and essential preprocessing techniques
Statistics & Exploratory Data Analysis
3 Establishes a statistical foundation for data analysis
Machine Learning Techniques
4 Introduces core machine learning concepts and applications
Unit-2:
Data Handling & Preprocessing
Sources of Data
The common sources of data: Databases, APIs, Web Scraping,
Sensors, Files, Surveys etc.
Databases and APIs for structured, dynamic access;
Web scraping and files for unstructured or third-party data;
Sensors for real-time physical data;
Surveys for customized, human-centric insights.
Databases
Databases are organized collections of data that are stored and
accessed electronically.
They are widely used for structured data storage in enterprises, web
applications, and information systems.
Types:
1. Relational Databases (RDBMS): Use tables (rows and
columns) to store data (e.g., MySQL, PostgreSQL, Oracle, SQL
Server).
2. NoSQL Databases: Handle unstructured or semi-structured
data; more flexible schema (e.g., MongoDB, Cassandra,
CouchDB).
Databases
Advantages:
1. Efficient data storage and retrieval
2. Support for complex queries and indexing
3. Transaction support and concurrency control
Use Cases:
1. Customer relationship management (CRM)
2. E-commerce platforms
3. Financial and healthcare systems
Data Retrieval: Data is typically accessed using SQL (Structured
Query Language) or database drivers/interfaces like JDBC, ODBC.
APIs (Application Programming Interfaces)
APIs are defined protocols and tools that allow software applications
to communicate and share data.
APIs are one of the most efficient and scalable ways to gather real-
time or batch data from external systems.
Examples: Twitter API (for tweets, user profiles), Google Maps API (for
geolocation and maps).
Advantages:
1. Access to real-time data
2. Efficient data integration
3. No need for direct database access
Web Scraping
Web scraping is the process of automatically extracting data from
websites using software, often referred to as bots or web crawlers.
It involves collecting information from the website's underlying code
(HTML) and sometimes from databases, rather than just copying
what's displayed on the screen. This extracted data can then be
organized and stored in a more usable format for various applications.
Use Cases:
1. Price comparison engines
2. News aggregation
3. Collecting public reviews or comments
Sensors
Sensors are physical devices that collect data from the environment or
machinery and convert it into a digital signal for further processing.
Examples of Sensors: Temperature sensors, Accelerometers, GPS
modules, IoT (Internet of Things) devices.
Data Characteristics: High volume (often real-time streaming), Time-
stamped, Often requires edge computing or cloud platforms.
Data Collection Platforms: Arduino, Raspberry Pi, Cloud IoT
platforms (AWS IoT, Google Cloud IoT).
Use Cases:
1. Smart cities and traffic monitoring
2. Environmental data collection (weather, pollution)
Files
Files are among the most basic and widespread sources of data. They
can be structured (Ex. CSV, XLSX), semi-structured (Ex. JSON, XML),
or unstructured (Ex. Text files, PDFs, images).
Files are easy to store and share and can be generated by many
applications (e.g., exports from software tools).
Use Cases:
1. Financial reports in spreadsheets
2. User logs in text files
3. Social media metadata in JSON format
Surveys
Surveys are structured questionnaires designed to collect specific
information from a target population. They are a primary method of
collecting first-hand, customized data.
Types:
1. Online Surveys: Google Forms, SurveyMonkey, Typeform
2. Face-to-Face or Telephonic Surveys
3. Paper-based Surveys
Data Characteristics:
1. Usually structured or categorical
2. Designed to meet specific research or business goals
3. May include quantitative (ex. rating scales) or qualitative responses
Surveys
Challenges:
1. Bias in question design or respondent selection
2. Low response rates
3. Data cleaning required before analysis
Use Cases:
1. Customer satisfaction measurement
2. Public health data collection
3. Market research
Data Acquisition Strategies
Data acquisition is not just about collecting data—it's about collecting
the right data in the right way. Successful data science projects start
with thoughtful acquisition strategies that ensure data quality,
reliability, and relevance.
Best Practices in Data Acquisition
1. Understand the data needs before starting collection.
2. Respect privacy and legal constraints (e.g., GDPR, HIPAA).
3. Clean and preprocess data soon after acquisition.
4. Automate repetitive collection tasks when possible.
5. Ensure versioning and backup of collected data.
6. Monitor and audit data pipelines for failures or anomalies.
Data Preprocessing
Data preprocessing in Data Science is the crucial step of transforming
raw data into a clean, consistent, and usable format before applying
machine learning, statistical modeling, or analysis techniques.
Since real-world data is often incomplete, noisy, inconsistent, or
unstructured, preprocessing ensures higher model performance and
reliable insights.
Data preprocessing is like polishing raw gemstones before turning them
into jewelry — it doesn’t change the essence of the data, but it makes it
shine and become usable for analysis.
Why Data Preprocessing?
Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data.
Data Quality Measures
A well-accepted multidimensional view:
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?
Broad categories:
• intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Data Cleaning
Goal: Remove errors, inconsistencies, noise.
Common tasks include:
Handling Missing Data
Removing Noise (Smooth data using moving averages or binning.)
Handling inconsistent formats (e.g., ―Male/Female‖ vs. ―M/F‖).
Outlier detection & treatment (using statistical methods, clustering,
or domain rules).
Importance of Data Cleaning
Reduces bias and errors in analysis.
Improves model accuracy and efficiency.
Ensures consistency, reliability, and trustworthiness of insights.
Saves cost and time in downstream processing.
Incomplete or Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with:
a global constant : e.g., ―unknown‖, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian formula
or decision tree
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
Data Integration
Data integration is the process of combining data from multiple sources
into a single, unified view so it can be used for analysis or modeling.
Issues in Data Integration:
1. Schema Mismatch: One source calls a column "Customer_ID", another
calls it "CustID".
2. Data Type Differences: One database stores dates as YYYY-MM-DD,
another as DD/MM/YYYY.
3. Unit Inconsistency: One dataset measures weight in kilograms, another
in pounds.
4. Duplicate/Redundant Records: Same customer appears twice with
slightly different spellings.
5. Data Conflicts: 2 sources give different phone no.s for the same
customer.
Handling Redundant Data in Data Integration
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have
different names in different databases
Derivable data: One attribute may be a ―derived‖ attribute in another
table, e.g., annual revenue.
Redundant attributes may be able to be detected by correlation analysis
and covariance analysis.
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality.
Correlation Analysis (Nominal Data)
Χ2 (chi-square) test:
(Observed Expected ) 2
2
Expected
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
• It shows that like_science_fiction and play_chess are correlated in the group
(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2
2
507.93
90 210 360 840
Data Transformation
Goal: Convert data into a form that models can understand.
Scaling / Normalization
Normalization (Min-Max Scaling): Scales data to a range [0, 1].
Standardization (Z-score Scaling): Scales data to have mean 0 and
standard deviation 1.
Encoding Categorical Data
Label Encoding: Assign a unique integer to each category.
Ordinal Encoding: Encode categories with meaningful order.
Feature Engineering
Create new features from existing data (eg. extracting ―age‖ from dob)
Dimensionality Reduction
Reduce the number of variables using PCA (Principal Component
Analysis) or feature selection.
Data Generaliztion/Specialization etc.
Normalization
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to:
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
• Z-score normalization (μ: mean, σ: standard deviation): v A
v'
A
– Ex. Let μ = 54,000, σ = 16,000. Then 73,600 54,000 1.225
16,000
Data Reduction
Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
Why data reduction?
— A database/data warehouse may store terabytes of data. Complex
data analysis may take a very long time to run on the complete data
set. The goal is to make datasets smaller, simpler, and faster to
process without losing critical patterns or relationships.
Data reduction is the process of minimizing the volume of data while
preserving as much meaningful information as possible.
Data Reduction Strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
Lossless compression (no data lost): ZIP, run-length encoding.
Lossy compression (some data lost but acceptable): JPEG for
images, MP3 for audio.
Discretization
Data discretization is a data preprocessing technique used in data
science to transform continuous numerical data into a finite set of
discrete intervals or "bins."
Bins map nicely to human concepts (e.g., ―low/medium/high‖).
Instead of working with an infinite number of possible values, you work
with a limited number of defined categories. This process simplifies the
data, making it easier to analyze and interpret.
Discretization can also reduce the effect of noise and outliers.
Thank you