UNIT- II
Data Mining
Data mining is the process of finding useful patterns, trends, or knowledge from large sets of
data. Think of it like searching for hidden treasure in a big pile of information. Businesses,
researchers, and organizations collect huge amounts of data every day — from sales records,
website visits, customer feedback, social media, and more. But just having data isn’t enough.
Data mining helps turn that data into something meaningful. For example, a supermarket might
use data mining to find that people who buy bread often buy butter too. This kind of information
can help in making better decisions, like how to arrange products on shelves or what products to
advertise together.
Data mining involves many steps. First, the data is collected from different sources. Then, it’s
cleaned to remove errors or missing values. After that, tools and algorithms are used to find
patterns or trends. Some of the common tasks in data mining include classification (putting
things into categories), clustering (grouping similar items), association (finding relationships
between items), and prediction (guessing future outcomes based on past data).
Data Mining Task Primitives
Data mining task primitives are the basic steps or building blocks that help define what exactly a
user wants to do with the data. Think of them as settings or instructions you give to the data
mining system so that it knows how to carry out the task correctly. These task primitives help
guide the system by telling it what kind of patterns to look for, where to look, and under what
conditions.
Here are the main types of task primitives:
1. The kind of data to be mined: This tells the system what type of data you are working
with. It could be a database, data warehouse, spreadsheets, or even text files. You also
define what specific attributes (columns or fields) you want to analyze.
2. The kind of patterns to be discovered: Data mining can be used for many purposes.
You need to specify what you’re looking for. For example:
○ Classification: Sorting data into predefined groups or classes (like spam or not
spam).
○ Clustering: Grouping similar items together (like customer segments).
○ Association: Finding relationships between items (like "people who buy chips
often buy soda").
○ Prediction: Guessing future values based on current data (like predicting next
month's sales).
3. Background knowledge to be used: This includes any existing information or rules that
the system can use to improve its results. For example, a store might already know some
sales rules that help guide the mining process.
4. Interestingness measures: Not all patterns found are useful. You can set rules to tell the
system which results are interesting or valuable to you. These rules help filter out the less
useful patterns and focus only on what matters.
5. Use of constraints: Sometimes, you want to limit the mining to a certain part of the data
or focus only on specific results. For example, you might only want results related to a
particular product or customer group.
Data, Information, and Knowledge
Data is raw, unprocessed facts and figures. It has no meaning on its own and is just a collection
of values or numbers. For example, a list of numbers like 30, 45, 50, 42 doesn’t tell us anything
unless we know what they represent. In a retail store, data could include things like sales
amounts, product names, dates, customer IDs, or locations. Data is the starting point in data
mining, and it must be cleaned, organized, and analyzed to become useful.
Information is data that has been processed and given context so that it has meaning. For
example, if we say that 30, 45, 50, 42 are the number of items sold in the past four days,
now we have useful information. It tells us something about how the business is performing.
Information answers basic questions like who, what, when, and where. In data mining,
information helps identify what is happening in the data, such as which product sells the most
Knowledge is the deeper understanding or insight gained from analyzing information. It answers
questions like why or how, and helps in making decisions or predictions. For example, if data
mining reveals that "sales increase on weekends for snacks and drinks," that's knowledge. It
combines information with patterns and trends to give meaning that can help in decision-making.
Knowledge allows businesses to plan better, such as increasing stock on weekends or offering
discounts on certain items.
Attribute Types in Data Mining
In data mining, attributes (also called features or variables) are the properties or characteristics of
the data that you're analyzing. Depending on the nature of the values they hold, attributes are
categorized into different types. The type of attribute determines what kind of operations and
analyses can be performed on it.
1. Nominal Attributes
Nominal attributes are categorical attributes that represent names or labels without any order or
ranking between them. They are used simply to identify or classify data into categories.
● Example: Gender (Male, Female), Color (Red, Blue, Green), City (Delhi, Mumbai,
Chennai)
● Key Point: You cannot say one value is greater than or less than another.
● Operations allowed: Equality (=), inequality (≠); no
mathematical operations.
In data mining, nominal data is often converted into numerical codes using
techniques like one-hot encoding so that algorithms can process it.
2. Binary Attributes
Binary attributes are a special type of nominal attribute that can have only two possible values,
typically representing yes/no or true/false situations.
● Example:
○ Is student? (Yes or No)
○ Light switch (On or Off)
○ Has account? (1 or 0)
Binary attributes are of two types:
● Symmetric binary: Both outcomes are equally important (e.g., gender).
● Asymmetric binary: One outcome is more important (e.g., disease = present/absent).
Binary attributes are common in classification tasks like spam detection, fraud
detection, or medical diagnosis.
3. Ordinal Attributes
Ordinal attributes represent categories that have a meaningful order or ranking, but the
differences between the values are not measurable.
● Example:
○ Education level (High school < Bachelor < Master < PhD)
○ Customer satisfaction (Poor < Average < Good < Excellent)
○ Movie rating (1 star to 5 stars)
● Key Point: You can compare values using greater than or less than, but you can’t do
arithmetic on them (e.g., "Excellent" is not 2× "Good").
Ordinal attributes are useful when data has a natural ranking, but we don’t know the
exact difference between the ranks.
4. Numeric Attributes
Numeric attributes are quantitative and represent numbers that can be measured and calculated.
They allow for mathematical operations like addition, subtraction, average, etc. Numeric
attributes are further divided into:
a. Interval Attributes
● These attributes have values with equal intervals, but no true zero point.
● Example: Temperature in Celsius or Fahrenheit.
● Key Point: You can say the difference between 30°C and 40°C is 10°C, but you can’t say
40°C is twice as hot as 20°C.
b. Ratio Attributes
● These have equal intervals and a true zero point, which means you can perform all
mathematical operations including ratios.
● Example: Height, weight, age, salary, distance.
● Key Point: You can say "20 kg is twice as heavy as 10 kg" because 0 kg means no weight
at all.
Introduction to Data Preprocessing
Data preprocessing is one of the most important steps in the data mining process. It involves
preparing and cleaning the raw data so that it can be used effectively for analysis and mining.
Real-world data is rarely perfect — it often contains errors, missing values, duplicates,
inconsistencies, or irrelevant information. If this data is used directly, it can lead to poor results.
Data preprocessing helps solve these problems by converting raw data into a clean, consistent,
and usable format.
Data preprocessing is considered a data preparation phase before the actual data mining begins
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, incomplete, duplicate, or irrelevant
data from a dataset. It is one of the most important steps in data preprocessing because real-
world data is often messy. Without cleaning, inaccurate data can lead to wrong conclusions, poor
decision-making, or unreliable results in data mining or machine learning.
Common Problems
1. Missing Values
● Sometimes data is not recorded, which leads to blanks or null values.
● Example: A customer forgot to enter their phone number in a form.
● Solution: Fill missing values using:
○ Mean, median, or mode of the column
○ Predicting the value using other data
○ Removing the record (if it’s not important)
2. Noisy Data (Random or Incorrect Values)
● Noise refers to errors or random values in data that don’t make sense.
● Example: A person's age entered as 500 or a salary as -10000
● Solution: Use techniques like:
○ Binning (grouping similar values)
○ Smoothing (average out values)
○ Manual correction or using software
3. Duplicate Records
● When the same data appears more than once.
● Example: A customer appears twice in a database with slightly different names ("Jon"
and "John").
● Solution: Use deduplication tools or algorithms to merge or delete duplicates.
4. Inconsistent Data
● This happens when data formats vary across sources.
● Example: Dates written as "03/08/2025" and "August 3, 2025"
● Solution: Standardize data formats across the dataset.
5. Irrelevant Data
● Sometimes, extra data is collected that has no value for the analysis.
● Example: Including the "profile picture size" while analyzing customer purchase
behavior.
● Solution: Remove irrelevant columns or fields.
Data integration is the process of combining data from multiple sources into a single, unified
view. It helps in building a complete dataset that can be used for effective data mining and
analysis. When data comes from different departments, systems, or files, it may be inconsistent,
redundant, or conflicting. Integration helps resolve these issues and creates a coherent data
structure.
One key step during data integration is correlation analysis. This is used to identify
relationships between different attributes (columns or variables) in the combined dataset. If
two attributes are strongly related, they may carry similar information. This can help reduce
redundancy and improve the quality of the integrated data.
What is Correlation Analysis?
Correlation analysis is a statistical method used to measure the strength and direction of a
relationship between two attributes. The result is called the correlation coefficient and
usually ranges from -1 to +1:
● +1 → Perfect positive correlation (as one increases, the other increases)
● -1 → Perfect negative correlation (as one increases, the other decreases)
● 0 → No correlation (no relationship between the attributes)
In data integration, correlation analysis helps in deciding whether two attributes
from different sources are closely related (and possibly duplicate) or completely
independent.
Why is Correlation Analysis Useful in Data Integration?
● To detect redundancy: If two attributes are highly correlated, one may be removed to
reduce complexity.
● To align similar attributes: Sometimes, two sources use different names for similar data
(e.g., "income" and "salary"). Correlation can help match them.
● To improve data consistency: Helps in resolving conflicts by understanding how
attributes interact.
What is a Correlation Coefficient?
The correlation coefficient is a statistical measure that describes the strength and direction
of a relationship between two variables. The most commonly used type is the Pearson
correlation coefficient, also called Pearson’s r.
● If r = +1 → perfect positive correlation
● If r = -1 → perfect negative correlation
● If r = 0 → no correlation
Pearson Correlation Coefficient Formula
Let’s calculate the correlation coefficient for the following small dataset:
X (Hours Studied) Y (Marks Scored)
2 50
4 60
6 65
8 80
10 85
Step 1: Find the mean
xˉ=2+4+6+8+105=6yˉ=50+60+65+80+855=68\bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6 \\
\bar{y} = \frac{50 + 60 + 65 + 80 + 85}{5} = 68xˉ=52+4+6+8+10=6yˉ=550+60+65+80+85=68
Step 2: Apply the formula
First, create a table for calculations:
x y x−x̄ y−ȳ (x−x̄)(y−ȳ) (x−x̄)² (y−ȳ)²
2 50 -4 -18 72 16 324
4 60 -2 -8 16 4 64
6 65 0 -3 0 0 9
8 80 2 12 24 4 144
10 85 4 17 68 16 289
Now sum the columns:
∑(x−xˉ)(y−yˉ)=72+16+0+24+68=180∑(x−xˉ)2=16+4+0+4+16=40∑(y−yˉ)2=324+
64+9+144+289=830\sum (x - \bar{x})(y - \bar{y}) = 72 + 16 + 0 + 24 + 68
= 180 \\ \sum (x - \bar{x})^2 = 16 + 4 + 0 + 4 + 16 = 40 \\ \sum (y -
\bar{y})^2 = 324 + 64 + 9 + 144 + 289 = 830∑(x−xˉ)(y−yˉ
)=72+16+0+24+68=180∑(x−xˉ)2=16+4+0+4+16=40∑(y−yˉ
)2=324+64+9+144+289=830
Step 3: Plug into the formula
r=18040×830=18033200≈180182.19≈0.988r = \frac{180}{\sqrt{40 \times
830}} = \frac{180}{\sqrt{33200}} \approx \frac{180}{182.19} \approx
0.988r=40×830180=33200180≈182.19180≈0.988
Result: r ≈ 0.988
Data Transformation
Data transformation is a preprocessing step that converts data into a suitable format for mining.
It’s especially important when attributes have different scales, which can negatively affect
algorithms like K-Means, k-NN, or neural networks.
Definition:
Min-max normalization rescales the data to a fixed range, typically [0, 1]. It preserves the
relationships among the original data values.
Where:
● x = original value
● min(x) = minimum value in the column
● max(x) = maximum value in the column
● 'x′ = normalized value
📊 Example:
Original data: [50, 60, 70, 80, 90]
● Min = 50, Max = 90
Normalize value 70:
Z-score normalization, also known as standardization, is a crucial data preprocessing technique
in machine learning and statistics. It is used to transform data into a standard normal distribution,
ensuring that all features are on the same scale. This process helps to avoid the dominance of
certain features over others due to differences in their scales, which can significantly impact the
performance of machine learning models
v', v is the new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.
Decimal Scaling Method
It normalizes by moving the decimal point of values of the data. To normalize the data by this
technique, we divide each value of the data by the maximum absolute value of data. The data
value, vi, of data is normalized to vi' by using the formula below -
where j is the smallest integer such that max(|vi'|)<1.
Example :
Let the input data is: -10, 201, 301, -401, 501, 601, 701. To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
Data Reduction :
● Data Reduction refers to techniques that reduce the volume of data while maintaining its
integrity and analytical value.
● It helps in improving efficiency of data processing and mining by summarizing or
compressing the data.
● Common methods include sampling, dimensionality reduction, discretization, and
aggregation.
Data Cube Aggregation :
Data Cube Aggregation is a multidimensional data reduction technique used primarily in
Online Analytical Processing (OLAP).
A data cube organizes data in a multi-dimensional structure, with each dimension representing a
different attribute (e.g., time, location, product).
Aggregation means summarizing data along one or more dimensions by applying aggregation
functions like SUM, COUNT, AVG, MIN, or MAX.
The result is a reduced dataset with fewer rows but more meaningful insights at higher
abstraction levels.
Steps to perform Data Cube Aggregation
Step 1. Identify Dimensions and Measures
● Dimensions (Day,Product,Region)
● Measure: Sales( The Numeric Data)
Step 2. Define the Aggregation Function
● Choose which Aggregation function to apply on Measures.
Step 3. Decide level of aggregation(Granularity)
● The Raw data is at finest Granularity
● Aggregations mean Summarizing over one by more dimensions.
Level
(Day,Product) - -> Aggregated Over Region
(Day,Region)--> Aggregated Over Product
(Product,Region)--> Aggregated Over Day
Step 4. Perform group by Aggregations
Step 5. Store the aggregated Result .
● Create separate data cube structure that stores the aggregated data for different grouping
Step 6. Use Aggregated Data to Answer queries
Attribute Subset Selection:
Attribute Subset Selection is a data preprocessing technique used to reduce the number of input
variables (attributes/features) in a dataset by selecting only the most relevant ones.
This Technique used for :
To eliminate irrelevant, redundant, or noisy features
To reduce computation time and complexity
To improve model accuracy by removing features that may confuse the learning algorithm
To avoid overfitting by simplifying the model.
Steps in Attribute Subset Selection:
1. Start with the Full Set of Attributes
The original dataset may have many features (e.g., age, income, job type, city, etc.)
2. Evaluate Each Attribute's Relevance
Use statistical, information-theoretic, or machine learning-based methods to assess
how much each attribute contributes to the prediction.
3. Common metrics:
Information Gain
Chi-Square test
Correlation Coefficient
Mutual Information
Gini Index
4. Select the Best Subset of Attributes
Choose attributes that give the highest predictive power and remove others.
5. Selection techniques:
Filter Methods: Use ranking based on statistical scores
Wrapper Methods: Use a predictive model to test different subsets
Embedded Methods: Perform selection during model training (e.g., decision trees,
LASSO)
6. Validate the Resulting Subset
Use cross-validation to check that the reduced feature set performs well on unseen data.
Suppose you’re building a model to predict whether a customer will buy a product.
Original Attributes:
● Age
● Gender
● Email Address
● Phone Number
● Annual Income
● Product View Count
● Number of Clicks
● Purchase History
After Attribute Subset Selection, you may find:
● Email and Phone Number are irrelevant to purchase prediction
● Age, Income, Click Count, and Purchase History are highly predictive
Final selected subset = {Age, Annual Income, Product View Count, Purchase History}
Sampling :
Sampling is the process of selecting a subset of data from a large dataset to analyze and draw
conclusions about the entire data.
In data mining, sampling is especially useful when:
● The dataset is too large to process efficiently
● You need quick insights or build prototypes
● You want to improve performance without sacrificing much accuracy
Why Sampling is Used in Data Mining?
● Reduces computational cost
● Speeds up data analysis and model training
● Helps in testing and validating models
● Enables visualization and exploration of massive datasets
Types of Sampling Techniques:
1. Random Sampling
Every record has an equal chance of being selected.
Best for unbiased, general-purpose sampling.
Example: Randomly picking 1,000 records from a million.
2. Stratified Sampling
The dataset is divided into strata (groups) (e.g., by class label), and samples are drawn
from each stratum proportionally.
Useful when some classes are rare.
Example: Sampling 20% from each income bracket.
3. Systematic Sampling
Selects every k-th record from a sorted list.
Example: From 10,000 records, pick every 10th record → total 1,000 records.
4. Cluster Sampling
Divide data into clusters (groups), and then randomly select entire clusters.
Example: Choose a few cities and include all customers from those cities.
5. Reservoir Sampling
Useful for data streams or when data size is unknown in advance.
Maintains a sample of fixed size from a stream of unknown length.
Example:
Scenario: Online Retail Store
You work as a data analyst for an online retail store with a dataset containing 1 million customer
transactions. Each record includes:
CustomerID Age Gender Country PurchaseAmount ProductCategory Purchase
Date
You want to build a machine learning model to predict whether a customer will make a purchase
above ₹10,000. But training on all 1 million rows is computationally expensive.
Step-by-Step Sampling Example
Objective:
Use sampling to create a smaller, representative dataset (e.g., 10,000 records) for fast model
development and testing.
Step 1: Understand the Target Variable
Let’s define:
● HighValuePurchase = 1 if PurchaseAmount > 10,000
● HighValuePurchase = 0 otherwise
Assume class distribution in full dataset:
● 5% are high-value purchases
● 95% are not
If we randomly sample, we might miss the minority class (only 500 out of 10,000 will be high-
value).
Step 2: Choose Sampling Technique
We will use Stratified Sampling to maintain class balance in the sample.
Step 3: Implement Sampling
Step 4: Use Sampled Data for Further Processing
Data Discretization :
Data Discretization is the process of transforming continuous attributes (numeric data) into a
finite set of intervals or categorical labels.
It is a crucial data preprocessing step in data mining and machine learning, particularly for
algorithms that require categorical inputs.
It Reducing computational complexity , Improving model interpretability, Handling noisy data,
Enhancing performance of classifiers
Example:
You have a dataset with customer ages, and you want to group ages into categories (Young,
Middle-aged, Senior) for use in a classification model.
Original Data (Continuous):
CustomerID Age
1 23
2 37
3 45
4 59
5 63
6 72
discretize Age into 3 categories:
● Young: 0–35
● Middle-aged: 36–60
● Senior: 61 and above
Step-by-Step Discretization:
CustomerID Age Age Category
1 23 Young
2 37 Middle-aged
3 45 Middle-aged
4 59 Middle-aged
5 63 Senior
6 72 Senior
Binning in Data Mining?
Binning is a data discretization technique used to convert continuous numerical data into discrete
bins or intervals. It’s commonly used in data preprocessing to smooth noisy data, reduce the
effect of outliers, and prepare the data for categorical models like decision trees or Naive Bayes.
Purpose of Binning
● Reduce complexity of the data
● Handle noise and outliers
● Convert numerical features into categorical features
● Enhance model performance for certain algorithms
Example of Binning
Suppose we have a column of Age values:
[23, 27, 34, 38, 42, 45, 49, 53, 58, 60, 66, 72]
Let’s say we want to bin this into 3 age groups:
● Bin 1: 20–40 → Young
● Bin 2: 41–60 → Middle-aged
● Bin 3: 61–80 → Senior
The binned result:
Age Age Group
23 Young
27 Young
34 Young
38 Young
42 Middle-aged
45 Middle-aged
49 Middle-aged
53 Middle-aged
58 Middle-aged
60 Middle-aged
66 Senior
72 Senior
Histogram Analysis
A histogram is a type of bar chart that represents the frequency distribution of a continuous
variable. The data is divided into bins (intervals), and the height of each bar shows how many data
points fall within that bin.
Histogram Analysis in Data Mining :
Histogram analysis is a visual and statistical method used in data preprocessing and exploratory
data analysis (EDA) to understand the distribution of continuous numerical attributes. It helps
identify patterns such as skewness, modality, outliers, and spread of data.