DM Unit I
DM Unit I
Data mining is a process of extracting and discovering patterns in large data sets.
Very large collections of data- millions or even hundreds of millions of individual records are now being
compiled into centralized data warehouses(place where data gets stored so that applications can access
and share it easily. It is a database but it contains summarized information) and reorganized globally by
topic, allowing analysis to make use of powerful statistical and machine learning methods to examine
data more comprehensively.
Data mining is the art and science of using more powerful algorithms, than traditional query tools such
as SQL, to extract more useful information.
Data mining is concerned with the analysis of data and the use of software techniques for finding patterns
and regularities in sets of data. It is the computer, which is responsible for finding the patterns by
identifying the underlying rules and features in the data.
Data mining is concerned with discovering knowledge. It is about uncovering relationship or patterns
hidden in data that can be used to predict behaviors, outcomes, or provide some other useful function.
In data mining, the goal is to extract meaningful and actionable information from large datasets. Trivial
patterns or insights are ones that are simple and obvious, like finding that most customers purchase a
particular item with another item (e.g., bread and butter). Non-trivial patterns, on the other hand, might
involve identifying complex associations between items that aren't immediately apparent, such as the
correlation between the weather, local events, and sales of a specific product.
Data mining is widely used in diverse areas. There are a number of commercialdata mining system
available today and yet there are many challenges in this field.
Create, store, maintain, modify data indatabase. Extracting, interesting and unknown Information
from raw data.
Can work alone without data mining May not work without dbms
Basic elements are language ,data store Basic tasks are classification, regression, clustering
,query language. and association.
Kinds of Data in Data Mining :
In data mining, various types of data can be analyzed to extract meaningful insights and patterns. The types
of data can be broadly categorized as follows:
Structured Data: Structured data is highly organized and follows a specific format, typically stored in
relational databases or spreadsheets. It consists of rows and columns, where each column represents a
specific attribute or feature, and each row represents an individual data instance. Examples of structured data
include sales transactions, customer information, inventory records, and financial data.
Unstructured Data: Unstructured data lacks a predefined structure and format. It includes text, images,
audio, video, social media posts, emails, and more. Analyzing unstructured data requires specialized
techniques such as natural language processing (NLP), image recognition, and sentiment analysis.
Imagine you're working for a brand that wants to monitor its reputation and customer sentiment in real-time
on social media platforms. The data you're dealing with is unstructured text data from social media posts
(tweets, Facebook posts, Instagram comments, etc.). The unstructured data consists of users' opinions,
emotions, and comments about the brand.
Semi-structured Data: Semi-structured data falls between structured and unstructured data. It has some
level of organization but does not necessarily fit neatly into tables or rows. Examples include XML files,
JSON data, and certain types of documents with tags or annotations.
Imagine you're working for a news aggregation platform that collects and analyzes news articles and user
comments in real-time. The data you're dealing with is semi-structured, as it comes from various sources
and includes both structured elements (like titles and timestamps) and unstructured text (like article content
and user comments).
Temporal or Time Series Data: Temporal data represents values recorded at different points in time. It's
common in applications such as financial analysis, weather forecasting, stock market analysis, and IoT
sensor data. Time series data can reveal trends, patterns, and seasonal variations.
Example: Energy Consumption Forecasting
Imagine you're working for a utility company that generates and distributes electricity. Your task is to
develop a system that predicts energy consumption in real-time based on historical data and external factors.
The data you're dealing with is time series data, where each data point is associated with a specific timestamp.
Categorical Data: Categorical data represents variables that can take on specific categories or labels, but
there's no inherent order between them. Examples include gender (male/female), color (red/blue/green), and
product categories.
Example: Customer Segmentation for E-commerce
Imagine you work for an e-commerce company and you want to segment your customer base for targeted
marketing campaigns. The data you're working with includes various categorical attributes that describe
customer behavior and preferences.
Numerical Data: Numerical data consists of numerical values that can be further categorized into discrete
or continuous data. Discrete data has distinct and separate values (e.g., number of children), while continuous
data can take any value within a range (e.g., temperature, weight).
Example: Stock Price Prediction
Imagine you're working for a financial firm that specializes in stock trading, and you want to predict the
future prices of certain stocks to make informed investment decisions. The data you're working with includes
numerical attributes related to stock prices and trading volumes.
Binary Data: Binary data consists of only two possible values, often represented as 0 and 1. This type of
data is common in applications like classification tasks, where you're trying to categorize data into one of
two classes.
Imagine you're working for an email service provider, and your task is to develop a system that detects
whether incoming emails are spam or legitimate (ham) in real-time. The data you're working with is binary
data, where each email is represented by a set of binary features indicating the presence or absence of certain
keywords or patterns.
Textual Data: Textual data includes any form of written or typed text, such as articles, reviews, emails, and
social media posts. Analyzing textual data involves techniques like sentiment analysis, topic modeling, and
named entity recognition.
Example: Customer Sentiment Analysis in Social Media
Imagine you're working for a consumer electronics company, and you want to monitor how customers are
discussing your products on social media platforms. The data you're dealing with is textual data from social
media posts, comments, and reviews.
Spatial Data: Spatial data is associated with geographic locations or coordinates. Examples include GPS
data, maps, and satellite images. Spatial data mining involves identifying spatial patterns and relationships.
Example: Traffic Congestion Analysis and Routing
Imagine you work for a transportation company that provides real-time navigation services to drivers. You
want to analyze traffic congestion patterns and provide optimized routes to drivers in congested areas. The
data you're dealing with is spatial data, including information about road networks, traffic flow, and
geographic locations.
Network Data: Network data represents relationships between entities in a network or graph. This can
include social networks, communication networks, and transportation networks. Analyzing network data
involves understanding connections and centrality measures.
Example: Social Network Analysis
Imagine you're working for a social media platform, and your goal is to analyze the interactions and
relationships between users to uncover patterns and trends. The data you're dealing with is network data,
where users are nodes, and their connections (follows, likes, comments) form edges in the network.
Multi-dimensional Data: Multi-dimensional data involves multiple attributes or features that are not
necessarily ordered. It's often visualized using techniques like scatter plots and parallel coordinate plots.
Example: Retail Store Sales Analysis
Imagine you're working for a retail chain with multiple stores, and you want to analyze sales data to make
informed decisions about inventory management, marketing strategies, and store performance. The data
you're dealing with is multi-dimensional, as it includes various attributes that contribute to the sales
performance of each store.
Meta Data: Meta data provides information about other data. For example, it might include data source,
data format, creation date, and data owner. Meta data is crucial for data management and understanding the
context of the main data.
Example: Media Content Recommendation
Imagine you're working for a streaming platform that provides movies and TV shows to users. Your goal is
to recommend content to users based on their preferences and viewing history. The data you're working with
includes both the media content and associated metadata.
In data mining, different types of data require specific techniques and tools for analysis. Depending on the
nature of the data, appropriate preprocessing, feature extraction, and modeling methods need to be chosen
to uncover valuable insights.
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. The major issues are –
2. Performance Issues :
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms.
3. Diverse Data Types Issues :
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be structured,
semi structured or unstructured.
Data mining helps us to extract useful information from large databases. It’s a step within the
KDD process.
Whereas Data mining is the use of algorithms to extract patterns or models in KDD process.
KDD Process
The KDD process can be divided into three parts. The first part is data preprocessing including
step 1-3. The second part is data mining where many data mining algorithms involve. And the
last part is evaluation and presentation. The following figures illustrates the overall KDD
process in more details:
DATA PRE-PROCESSING
It describes any type of processing performed on raw data to prepare it for another processing
procedure. Data preprocessing transforms the data into a format that will be more easily and
effectively processed for the purpose of the user.
Data preprocessing is the most crucial step as the operational data is normally never captured
and prepared for data mining purpose. Mostly, the data is captured from several inconsistent,
poorly documented operational systems. Thus, data preprocessing requires substantial efforts
in purifying and organizing the data. This step ensures that the selected data available for
mining is in good quality.
For example, in a neural networks, there are a number of different tools and methods used for
preprocessing, including,
i) Sampling, which selects a representative subset from a large population of data.
ii) Transformation, which manipulates raw data to produce a single input.
iii) Denoising, which removes noise from data; normalization, which organizes data for
more efficient access, and
iv) Feature extraction, which pulls out specified data that is significant in some particular
context.
1. Data Cleaning
Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be
incorrect due to a large number of reasons like due to hardware error/failure, network error or human
error. So it is compulsory to clean the data before mining.
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values. Here are some techniques for data cleaning:
Noisy generally means random error or containing unnecessary data points. Handling noisy data is one
of the most important steps as it leads to the optimization of the model we are using Here are some of the
methods to handle noisy data.
I. Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then
the sorted values are separated and stored in the form of bins or buckets. There are three methods
for smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are replaced by the mean
value of the bin;
Smoothing by bin median: In this method, the values in the bin are replaced by the median value;
Smoothing by bin boundary: In this method, the using minimum and maximum values of the
bin values are taken, and the closest boundary value replaces the values.
II. Regression − Data can be smoothed by fitting the information to a function, including
with regression, and will help to handle data when unnecessary data is present. For the analysis,
purpose regression helps to decide the variable which is suitable for our analysis.
It is the measure of the average relationship between two or more variables in terms of the original units
of data. They are categories of two types:
Linear regression contains finding the “best” line to fit two attributes (or variables) so that
one attribute can be used to forecast the other.
Multiple linear regression is a development of linear regression, where more than two attributes
are contained and the data are fit to a multidimensional area.
III. Clustering − Clustering supports in identifying the outliers. The same values are
organized into clusters and those values which fall outside the cluster are known as
outliers.
Data Integration
Data integration is the phase of combining data from several disparate sources.
data integration is a data pre-processing technique that contains merging data from numerous
heterogeneous data sources into coherent(understandable/meaningful) data to retain and support a
consolidated perspective of the information.
It combines data from various sources into a coherent data store, including in data warehousing. These
sources can involve multiple databases, data cubes(data cube is a multidimensional representation of data
that allows for efficient querying and analysis of data along multiple dimensions/ Each dimension represents
a different attribute or characteristic. For example, in sales data, dimensions could include time, product,
and location). It's particularly useful for analyzing large datasets with complex relationships. Data cubes are
often used in Online Analytical Processing (OLAP) and data warehousing scenarios or flat files(A flat file
is a simple, plain-text file format that stores tabular data. Each row in the file represents a record, and the
columns represent attributes or fields. A flat file contains rows and columns, similar to a spreadsheet. Each
column has a name, and rows hold data values corresponding to those columns.), etc.
Approaches for the efficient data integration:
1. Entity identification problem
2. redundancy and correlation analysis
3. tuple Duplication
4. data value conflict detection and resolution
1. Entity identification problem
Schema integration and object matching are very important issues in data integration.
Schema integration- mismatch in attribute names, It involves merging and reconciling the differences
between various data sources with different structures, formats, and semantics. This process ensures that
data from different sources can be effectively and accurately used for analysis and decision-making.
Example: E-commerce Sales Analysis
Imagine a large e-commerce company that operates across multiple regions and platforms. They collect data
from various sources, such as online stores, mobile apps, and physical retail locations. Each source has its
own database schema and format for recording sales data. The company wants to perform data mining to
gain insights into their sales patterns, customer behavior, and product trends across all channels.
Data Sources:
Online Store Database: Contains information about online purchases, including customer IDs, product
IDs, purchase dates, prices, and payment methods.
Mobile App Database: Stores data about purchases made through the mobile app, with similar attributes
but possibly different naming conventions or additional fields.
Retail Store Database: Records sales data from physical retail locations, including store IDs, product IDs,
transaction timestamps, and payment details.
Schema Differences: Each data source has a different database schema and might use different attribute
names or structures for similar pieces of information.
The same concept (e.g., "customer ID" or "product ID") might be represented differently across sources.
Object matching-Mismatch in structure of the data.
Object matching in data integration within the context of data mining refers to the process of identifying
and linking records from different data sources that correspond to the same real-world entities. These
records can exist in various formats, databases, or datasets and might contain variations due to data entry
errors, inconsistencies, or different representation standards.
The primary goal of object matching is to recognize duplicate or matching records and consolidate them
into a single, unified representation. This process is essential for achieving accurate and meaningful results
in data mining and analysis, as well as for maintaining data quality and integrity in databases and data
warehouses.
Ex: Discount issues,Currency type,redundancy and correlation analysis
2.Redundancy – an attribute may be redundant if it can be “derived” from another attribute or set of
attributes.
Redundancy refers to the presence of duplicated or highly similar information within a dataset. In data
mining, redundancy analysis aims to identify and eliminate redundant attributes or records in a table.
Redundant attributes can lead to increased storage requirements, computational overhead, and potentially
misleading results during analysis.
For example, in a sales dataset, if you have two attributes that convey essentially the same information, like
"Total Sales" and "Net Sales," you might want to perform redundancy analysis to determine which attribute
to keep and which to remove.
Eg) DOB, Age
Correlation analysis- given two attributes such analysis can measure how strongly the attribute the attribute
implies the other, based on the available data.
Correlation analysis focuses on understanding the statistical relationships between different attributes
(columns) within a dataset. Correlation measures how changes in one attribute relate to changes in another.
Correlation coefficients range from -1 to 1, where -1 indicates a strong negative correlation, 1 indicates a
strong positive correlation, and 0 indicates no correlation.
Correlation analysis helps in identifying patterns and dependencies between attributes. For instance, in a
sales dataset, you might be interested in understanding the correlation between "Total Sales" and
"Advertising Spend" to determine whether increased advertising leads to higher sales.
Relation between Redundancy and Correlation Analysis:
While redundancy analysis aims to identify and remove redundant attributes or records, correlation analysis
helps in understanding the relationships between attributes, whether they are redundant or not. It's important
to note that attributes can be correlated without being redundant, and attributes can be redundant without
being highly correlated.
Two datasets: Housing Price Data and Square Footage Data. You want to calculate the Pearson correlation
coefficient between the selling prices of houses and their square footage.
Let's assume you have two variables:
X: Selling prices of houses in different neighborhoods.
Y: Size of houses in terms of square footage.
To calculate the Pearson correlation coefficient, you would follow these steps:
Selling Prices (X): [200000, 250000, 180000, 300000, 220000]
Square Footage (Y): [1500, 1800, 1200, 2100, 1600]
Step 1: Calculate the means of X and Y:
Mean(X) = (200000 + 250000 + 180000 + 300000 + 220000) / 5 = 230000
Mean(Y) = (1500 + 1800 + 1200 + 2100 + 1600) / 5 = 1640
Step 2: Calculate the differences from the mean for each data point:
Differences from Mean(X): [-30000, 20000, -50000, 70000, -10000]
Differences from Mean(Y): [-140, 160, -440, 460, -40]
Step 3: Multiply the corresponding differences for each data point:
Product of Differences: [42000000, 3200000, 22000000, 32200000, 400000]
Step 4: Sum up the products from step 3:
Sum of Products = 42000000 + 3200000 + 22000000 + 32200000 + 400000 = 100040000
Step 5: Calculate the sum of squared differences for X and Y:
Sum of Squared Differences(X) = (-30000)^2 + 20000^2 + (-50000)^2 + 70000^2 + (-10000)^2 =
3500000000
Sum of Squared Differences(Y) = (-140)^2 + 160^2 + (-440)^2 + 460^2 + (-40)^2 = 394000
Step 6: Calculate the Pearson correlation coefficient (r):
r = Sum of Products / (sqrt(Sum of Squared Differences(X)) * sqrt(Sum of Squared Differences(Y)))
In this example, you can see that there are duplicate records for CustomerID 101, which means that the same
customer's information appears multiple times in the dataset.
The use of denormalized tables(often done to improve performance by avoiding joins) is another source of
data redundancy. Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
Data reduction techniques are applied to obtain a reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity of basedata.”
It is a process that reduced the volume of original data and represents it in a much smaller volume. We uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
Data reduction techniques ensure the integrity of data while reducing the data. The time required for data
reduction should not overshadow the time saved by the data mining on the reduced data set.
When you collect data from different data warehouses for analysis, it results in a huge amount of data. It is
difficult for a data analyst to deal with this large volume of data.
It is even difficult to run the complex queries on the huge amount of data as it takes a long time and
sometimes it even becomes impossible to track the desired data.
This is why reducing data becomes important. Data reduction technique reduces the volume of data yet
maintains the integrity of the data.
Data reduction does not affect the result obtained from data mining that means the result obtained from data
mining before data reduction and after data reduction is the same (or almost the same).
The only difference occurs in the efficiency of data mining. Data reduction increases the efficiency of data
mining. In the following section, we will discuss the techniques of data reduction.
Techniques of data deduction include dimensionality reduction, numerosity reduction and data
compression.
Dimensionality Reduction
Dimensionality reduction eliminates the attributes from the data set under consideration thereby reducing
the volume of original data/It eliminates the redundant attributes which are weakly important across the data.
METHODS:
Stepwise forward selection − The process starts with a null set of attributes as the reduced set. The
best of the original attributes is determined and added to the reduced set. At every subsequent iteration
or step, the best of the remaining original attributes is inserted into the set.
Forward stepwise selection (or forward selection) is a variable selection method which:
1. Begins with a model that contains no variables (called the Null Model)
2. Then starts adding the most significant variables one after the other
3. Until a pre-specified stopping rule is reached or until all the variables under consideration are
included in the model
Here’s an example of forward selection with 5 variables:
Backward stepwise
Stepwise backward elimination − The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
Backward stepwise selection (or backward elimination) is a variable selection method which:
1. Begins with a model that contains all variables under consideration (called the Full Model)
2. Then starts removing the least significant variables one after the other
3. Until a pre-specified stopping rule is reached or until no variable is left in the model.
Decision tree induction constructs a flowchart-like structure where each internal (non-leaf) node denotes
a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction. At each node, the algorithm chooses the “best” attribute to partition the data into
individual classes.
All the attributes that do not appear in the tree are assumed to be irrelevant while the attributes that belong
to the tree from the reduced data set.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes
a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
The following decision tree is for the concept buy computer that indicates whether a customer at a company
is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node
represents a class.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Numerosity reduction
Another methodology in data reduction in data mining is numerosity reduction in which the
volume of the data is reduced by representing it in a lower format. There are two types of this
technique: parametric and non-parametric.
1. Parametric Reduction
The parametric numerosity reduction technique holds an assumption that the data fits into the model. Hence,
it estimates the model parameters, and stores only these estimated parameters, and not the original or the
actual data. The other data is discarded, leaving out the potential outliers.
For parametric methods, a model is used to estimate the data, so that only the data parameters need to be
stored, instead of the actual data, for example, Log-linear models, regression.
This enable to store the model of data instead of whole data,
For example:
Regression Models.
Linear
regression
Multiple
regression
Log-linear
Regression:
Regression refers to a data mining technique that is used to predict the numeric values in a given data set.
For example, regression might be used to predict the product or service cost or other variables. It is also used
in various industries for business and marketing behavior, trend analysis, and financial forecast.
Regression involves the technique of fitting a straight line or a curve on numerous data points. It happens in
such a way that the distance between the data points and cure comes out to be the lowest.
Regression refers to a type of supervised machine learning technique that is used to predict any continuous-
valued attribute. Linear Regression analysis is used for studying or summarizing the relationship between
variables that are linearly related. The regression is also of two kinds:
FOR REFERENCE:
Simple Linear regression
you are collecting data on the relationship between the number of hours studied and the exam scores
achieved by a group of students. Here's a simplified dataset:
you want to use simple linear regression to understand the relationship between the number of hours
studied and the resulting exam score.
Where:
Exam Score is the dependent variable (what we're trying to predict).
Hours Studied is the independent variable (the input feature).
β0 is the intercept term.
β1 is the coefficient for the Hours Studied variable, representing the change in exam
score for a one-unit change in hours studied.
ε is the error term, representing the variability in exam scores that the model doesn't
explain.
By fitting this model to the data, you can estimate the coefficients β0 and β1. These
coefficients can give you insights into how much an additional hour of studying is
associated with changes in the exam score.
In this case, the goal of data mining with simple linear regression is to find the best-
fitting line that minimizes the difference between the predicted exam scores (based on
the model) and the actual exam scores in the dataset. This line can then be used for
prediction and to understand the relationship between the variables.
Keep in mind that simple linear regression assumes a linear relationship between the
variables and that the error term ε is normally distributed with a mean of 0.
7
HOURS STUDIED: [2, 3, 4, 5, 6]
EXAM SCORE: [65, 75, 85, 90, 95]
Calculate the mean of Hours Studied (x̄) and Exam Score (ȳ):
x̄ = (2 + 3 + 4 + 5 + 6) / 5 = 4
ȳ = (65 + 75 + 85 + 90 + 95) / 5 = 82
So, the linear regression equation for predicting Exam Score based on the number of
hours studied is:
Now, you can use this equation to predict the Exam Score for different values of Hours
Studied. For example, if a student studies for 7 hours:
Where:
Exam Score is still the dependent variable (what we're trying to predict).
Hours Studied and Hours Slept are the independent variables (input features).
β0 is the intercept term.
β1 is the coefficient for the Hours Studied variable, representing the change in exam
score for a one-unit change in hours studied while holding Hours Slept constant.
β2 is the coefficient for the Hours Slept variable, representing the change in exam score
for a one-unit change in hours slept while holding Hours Studied constant.
ε is the error term.
By fitting this model to the data, you can estimate the coefficients β0, β1, and β2. These
coefficients can help you understand how both hours studied and hours slept collectively
contribute to the exam score.
In data mining, the goal of multiple linear regression is to find the best-fitting plane in
the multi-dimensional space that minimizes the difference between the predicted exam
scores (based on the model) and the actual exam scores in the dataset. This plane can
then be used for prediction and to understand the relationships between the variables.
We need to calculate the values of β0 (the intercept), β1 (the coefficient for Hours
Studied), and β2 (the coefficient for Hours Slept) using the provided data:
Calculate the means of Hours Studied (x̄1), Hours Slept (x̄2), and Exam Score (ȳ):
x̄1 = (2 + 3 + 4 + 5 + 6) / 5 = 4
x̄2 = (6 + 7 + 6 + 8 + 7) / 5 = 6.8
ȳ = (65 + 75 + 85 + 90 + 95) / 5 = 82
Calculate the variance of Hours Studied (Var(x1)) and Hours Slept (Var(x2)):
Var(x1) = Cov(x1, x1)
Var(x2) = Cov(x2, x2)
So, the multiple linear regression equation for predicting Exam Score based on the
number of hours studied and the number of hours slept is:
You can now use this equation to predict the Exam Score for different combinations of
Hours Studied and Hours Slept.
Log-Linear:
Log-linear model discovers the relation between two or more discrete attributes in the database. Suppose,
we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to study the
probability of each tuple in a multidimensional space.
a simple example of how a log-linear model could be applied to analyze the relationship between the
variables:
Suppose we're looking at a dataset that records whether customers made a purchase ("Purchase: Yes" or
"Purchase: No") based on their age group ("Age: Young", "Age: Middle-aged", "Age: Senior") and the
type of product ("Product: A", "Product: B", "Product: C"). The table might look like this:
we want to investigate if there's any relationship between age groups, product choice,
and purchase likelihood. We can build a log-linear model to analyze this relationship.
Here's a simplified example:
log(E(Yij)) = β0 + β1 * Age_i + β2 * Product_j + β3 * Age_i * Product_j
Where:
2. Non-parametric Reduction
On the other hand, the non-parametric methods do not hold the assumption of the data fitting in the model.
Unlike the parametric method, these methods may not give a very high decrease in data reduction though
it generates homogenous and systematic reduced data in spite of the size of the data. Non-parametric
methods are used for storing a reduced representation of the data which include histograms, clustering, and
sampling,data cube aggregation.
A histogram is a ‘graph’ that represents frequency distribution which describes how often a value appears
in the data. Histogram uses the binning method and to represent data distribution of an attribute. It uses
disjoint subset which we call as bin or buckets. We have data for AllElectronics data set, which contains
prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
The diagram below shows a histogram of equal width that shows the frequency of price distribution.
A histogram is capable of representing dense, sparse, uniform or skewed data. Instead of only one attribute,
the histogram can be implemented for multiple attributes. It can effectively represent the up to five attributes.
ii) Clustering
Clustering techniques groups the similar objects from the data in such a way that the objects in a cluster
are similar to each other but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated by using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of cluster depends on the diameter of the cluster i.e. the at max distance between any two
objects in the cluster.
The original data is replaced by the cluster representation. This technique is more effective if the present
data can be classified into a distinct clustered.
iii)Sampling
One of the methods used for data reduction is sampling as it is capable to reduce the large data set into a
much smaller data sample. Below we will discuss the different method in which we can sample a large
data set D containing N tuples:
Simple random sample without replacement (SRSWOR) of size s: In this ‘s number’ of tuples are drawn
from N tuples such that in the data set D (s<N). The probability of drawing any tuple from the data set D is
1/N this means all tuples have an equal probability of getting sampled.
In an SRSWOR, each item in the population has an equal chance of being selected, and once an item is
selected, it is not put back into the population. Let's assume you have a population of 100 individuals, and
you want to draw a sample of size 5 using SRSWOR.
Population (100 individuals): [1, 2, 3, ..., 99, 100]
This sample represents a simple random sample without replacement of size 5 from the population of 100
individuals. Each individual in the sample was selected with an equal probability, and once selected, they
were not put back into the population for subsequent selections.
Simple random sample with replacement (SRSWR) of size s: It is similar to the SRSWOR but the tuple
is drawn from data set D, is recorded and then replaced back into the data set D so that it can be drawn again.
An example of a simple random sample with replacement (SRSWR) of size 5 from a population. In an
SRSWR, each item in the population has an equal chance of being selected, and after each selection, the
item is put back into the population, allowing it to be selected again.
Let's assume you have a population of 100 balls, each numbered from 1 to 100, and you want to draw an
SRSWR of size 5:
Select a random number between 1 and 100. Let's say the first random number is 42. Add ball #42 to the
sample.
Sample: [42]
Select another random number between 1 and 100. Let's say the second random number is 17. Add ball #17
to the sample.
Select a third random number between 1 and 100. Let's say the third random number is 88. Add ball #88 to
the sample.
Select a fourth random number between 1 and 100. Let's say the fourth random number is 5. Add ball #5 to
the sample.
Select a fifth random number between 1 and 100. Let's say the fifth random number is 67. Add ball #67 to
the sample.
Cluster sample: The tuples in data set D are clustered into M mutually disjoint subsets. From these
clusters, a simple random sample of size s could be generated where s<M. The data reduction can be
applied by implementing SRSWOR on these clusters.
Stratified sample: The large data set D is partitioned into mutually disjoint sets called ‘strata’. Now a
simple random sample is taken from each stratum to get stratified data. This method is effective for skewed
data.
.
Data Cube Aggregation
Data Cube Aggregation represents the data in cube format by performing aggregation on data. In this data
reduction technique, each cell of the cube is a placeholder holding the aggregated data point. This cube stores
the data in an aggregated form in a multidimensional space. The resultant value is lower in terms of volume
i.e. takes up less space without losing any information.
EXAMPLE 1:
Consider you have the data of All Electronics sales per quarter for the year 2008 to the year 2010. In case
you want to get the annual sale per year then you just have to aggregate the sales per quarter for each year.
In this way, aggregation provides you with the required data which is much smaller in size and thereby we
achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation which eases multidimensional analysis. Like
in the image below the data cube represent annual sale for each item for each branch. The data cube present
precomputed and summarized data which eases the data mining into fast access.
EXAMPLE 2:
Below we have the quarterly sales data of a company across four item types: entertainment, keyboard, mobile,
and locks, for one of its locations aka Delhi. The data is in 2-dimensional (2D) format and gives us information
about sales on a quarterly basis.
Viewing 2-dimensional data in the form of a table is indeed helpful. Let’s say we increase one
more dimension and add more locations in our data – Gurgaon, Mumbai, along with the
already available location, Delhi, as below:
However, to make an analysis of any sort, we typically work with annual sales. So,can visually
depict the yearly sales data by summing over the sales amount across other dimensions like
below:
Data cube aggregation is useful for storing and showing summarized data in a cube form.
Here, data reduction is achieved by aggregating data across different levels of the cube. It is
also seen as a multidimensional aggregation that enhances aggregation operations.
Data compression in data mining as the name suggests simply compresses the data. This
technique encapsulates the data or information into a condensed form by eliminating
duplicate, not needed information. It changes the structure of the data without taking much
space and is represented in a binary form.
1. String data: In string compression, the data is modified in a limited manner without
complete expansion; hence the string is mostly lossless as the data canbe retrieved back
to its original form. Therefore it is lossless data compression. There are extensive
theories and well-tuned algorithms that are used for data compression.
2. Audio or video data: Unlike string data, audio or video data cannot berecreated to its
original shape, hence is lossy data compression. At times, it may be possible to
reconstruct small bits or pieces of the signal data but you cannot restore it to its whole
form.
3. Time Sequential data: The time-sequential data is not audio data. It is by large, usually
short data fragments and it varies slowly with time as is used for data compression.
Wavelet Transformation
1. Wavelet Transformation
Transforms pixels of the images into wavelets, those will be used for wavelet based compression and
coding.
Let’s say we have a data vector Y, by applying the wavelet transform on this vectorY, we
would receive a different numerical data vector Y’, where the length of both the vectors Y and
Y’ are the same. Now, you may be wondering how transforming Y into Y’ helps us to reduce
the data. This Y’ data can be trimmed or truncated whereas the actual vector Y cannot be
compressed.
Let’s say we have a data vector Y. When we apply wavelet transformation on this vector Y,
we get a different numerical data vector Y’ where the length of both the vectors Y and Y’ are
the same now.
Result: TheY’ data can be trimmed or truncated whereas the actual vector Y
cannot be compressed.
The reason it is called ‘wavelet transform’ is that the information here is present in
the form of waves, like how a frequency is depicted graphically as signals. The wavelet
transform also has efficiency for data cubes, sparse or skewed data. It is mostly applicable for
image compression and for signal processing.
2. Principal Component Analysis (PCA)
It is used to reduce data size using ‘k’ orthogonal vectors.
Unit vectors that each point in a direction perpendicular to the others. Those vectors are reffered
to as the principal components.
Principal component analysis, a technique for data reduction in data mining, groups the
important variables into a component taking the maximum information present within the data
and discards the other, not important variables.
Now, let’s say out of total n variables, k are such variables that are identified and are part of
this new component. This component is now what is representative of the data and used for
further analysis.
In short, PCA is applied to reducing multi-dimensional data into lower- dimensional data. This
is done by eliminating variables containing the same information as provided by other
variables and combining the relevant variables into components. The principal component
analysis is also useful for sparse, and skewed data.
DATA TRANSFORMATION
Data transformation is the process of converting data from one format, such as a database file, XML
document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into a cleansed, validated and ready-to-use
format. Data transformation is crucial to data management processes that include data integration, data
migration, data warehousing and data preparation.
The process of data transformation can also be referred to as extract/transform/load (ETL). The extraction
phase involves identifying and pulling data from the various source systems that create data and then moving
the data to a single repository. Next, the raw data is cleansed, if needed. It's then transformed into a target
format that can be fed into operational systems or into a data warehouse, a date lake or another repository
for use in business intelligence and analytics applications. The transformation may involve converting data
types, removing duplicate data and enriching the source data.
Raw data is difficult to trace or understand. That's why it needs to be preprocessed before retrieving any
information from it. Data transformation is a technique used to convert the raw data into a suitable format
that efficiently eases data mining and retrieves strategic information. Data transformation includes data
cleaning techniques and a data reduction technique to convert the data into the appropriate form.
There are several data transformation techniques that can help structure and clean up the data before analysis
or storage in a data warehouse. The data are transformed or consolidated so that the resulting mining process
may be more efficient, and the patterns found may be easier to understand.
Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It allows
for highlighting important features present in the dataset. It helps in predicting the patterns. When collecting
data, it can be manipulated to eliminate or reduce any variance or any other noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who need to look at a lot of data which can
often be difficult to digest for finding patterns that they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning, regression,
clustering.
Binning: This method splits the sorted data into the number of bins and smoothens the data values in each
bin considering the neighborhood values around it.
Regression: This method identifies the relation among two dependent attributes so that if we have one
attribute, it can be used to predict the other attribute.
Clustering: This method groups similar data values and form a cluster. The values that lie outside a cluster
are known as outliers.
Attribute Construction New attributes are constructed and added from the given set of attributes
In the attribute construction method, the new attributes consult the existing attributes to construct a new data
set that eases data mining. New attributes are created and applied to assist the mining process from the given
attributes. This simplifies the original data and makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we may have the
height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height' and
'weight'. This also helps understand the relations among the attributes in a data set.
Aggregation:
Summery and aggregation functions can be applied on the data for constructing data cube for data
analysis(data reduction).
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis description.
This is a crucial step since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or business strategy of
the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We can
aggregate the data to get the enterprise's annual sales report.
Discretization
Numeric values are replaced by interval labels or conceptual labels. The labels, in turn, can be recursively
organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute.
This is a process of converting continuous data into a set of data intervals. Continuous attribute values are
substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task
handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This
improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset into a set of categorical
data. Discretization also uses decision tree-based algorithms to produce short, compact, and accurate results
when using discrete values.
Data discretization can be classified into two types: supervised discretization, where the class information is
used, and unsupervised discretization, which is based on which direction the process proceeds, i.e., 'top-
down splitting strategy' or 'bottom-up merging strategy'.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-20…)
or (kid, youth, adult, senior).
Middle Level:
Electronics
Phones
Laptops
Clothing
Men's Clothing
Women's Clothing
Books
Fiction
Non-Fiction
Bottom Level:
Phones
Laptops
Men's Clothing
Women's Clothing
Fiction
Non-Fiction
Here, we're generalizing categories into higher-level groups and specializing into subcategories.
Step 5: Granularity
We've defined three levels of granularity in our hierarchy.
You can represent this hierarchy as a tree structure or a directed acyclic graph. Here's a simplified textual
representation:
- Electronics
- Phones
- Laptops
- Clothing
- Men's Clothing
- Women's Clothing
- Books
- Fiction
- Non-Fiction
Replace the original categorical values with the corresponding hierarchy levels in your dataset. For instance,
if a product was originally labeled as "Laptops," it would now be represented as "Electronics > Laptops."
Step 8: Benefits
With this concept hierarchy in place, you can now perform various data mining tasks more effectively. For
example, you can analyze sales patterns at different levels of the hierarchy (e.g., total sales for "Electronics,"
sales of "Phones" vs. "Laptops," etc.). The hierarchy also helps to maintain data consistency and enables
efficient querying.
Normalization:
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0, 1.0].
There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, Vn.
The measurement unit used can affect the data analysis. To help avoid dependence on the choice of
measurement units. The data should be normalized or standardized.
Normalizing the data attempts to give all attributes an equal weight.
Methods for data normalization:
1. Min max normalization
2. z-score normalization
3. normalization by decimal scaling
Z-Score Normalization
Z-Score helps in the normalization of data. If we normalize the data into a simpler form with the help of z
score normalization, then it’s very easy to understand by our brains.
Min Max is a technique that helps to normalize the data. It will scale the data between 0 and 1. This
normalization helps us to understand the data easily.
For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit confusing as
compared to when I ask you to tell me the difference between 0.2 and 1.
It helps to normalize the data. It will scale the data between 0 and 1. This normalization helps us
to understand the data easily.
For example, if I say you to tell me the difference between 200 and 1000 then it’s a little bit confusing as
compared to when I ask you to tell me the difference between 0.2 and 1.
Min Max normalization formula
marks
8
10
15
20
Min:
The minimum value of the given attribute. Here Min is 8
Max:
The maximum value of the given attribute. Here Max is 20
V: V is the respective value of the attribute. For example here V1=8, V2=10, V3=15, and V4=20
newMax:
1
newMin:
0
marks marks after Min-Max normalization
8 0
10 0.16
15 0.58
20 1
Another example is analytics, where we gather the static data of website visitors. For example,
all visitors who visit the site with the IP address of India are shown under country level.
This is done to replace the raw values of numeric attribute by interval levels orconceptual levels.
Divide the range of a continuous attribute into intervals.
Some classification algorithms only accept categorical attributes.Reduce
data size by discretization.
Reduce the number of values for a given continuous attribute by dividing the rangeof the
attribute into intervals.
Interval labels can then be used to replace actual data values.
Concept Hierarchy Generation : Attributes are converted from lower level to higher level in
hierarchy.
For Example- The attribute “city” can be converted to “country”.
The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we cansay that a hierarchy
concept refers to a sequence of mappings with a set of more general concepts to complex
concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A document
is placed in a folder in windows at a specific place in the tree structure is the best example of
a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi canbe mapped
to India, and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general informationand ends
with the bottom to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and
ends with the top to the generalized information.
DATA BINARIZATION
Data discretization is a method of converting attributes values of continuous data into a finite
set of intervals with minimum data loss. In contrast, data binarization is used to transform
the continuous and discrete attributes into binary attributes.
FOR REFERENCE:
preprocessing step (data cleaning, data integration, data reduction, and data transformation)
with a deep explanation and the obtained table structure at each step.
2. Data Integration:
In this step, we merge the customer information and purchase history datasets using the common
"Customer ID" field.
1. Data Reduction:
In this step, we perform dimensionality reduction, but for simplicity, we'll skip this step in our
example.
4. Data Transformation:
In this step, we'll perform data transformation including normalization, one-hot encoding, and
creating a new feature "Total Purchases."