Data Mining Using R
Data Mining Using R
UNIT:1
An idea on Data Warehouse, Data mining-KDD versus data mining, Stages of the Data Mining Process-Task
primitives., Data Mining Techniques - Data mining knowledge representation.
UNIT:2
Data mining query languages- Integration of Data Mining System with a Data Warehouse-Issues, Data pre-
processing - Data Cleaning, Data transformation - Feature selection Dimensionality Reduction.
UNIT:3
Concept Description: Characterization and comparison What is Concept Description, Data Generalization by
Attribute-Oriented Induction(AOI), AOI for Data Characterization, Efficient Implementation of AOI.
Mining Frequent Patterns, Associations and Correlations: Basic Concepts, FrequentItemset Mining Methods:
Apriori method, generating Association Rules, Improving the Efficiency of Apriori, Pattern-Growth Approach for
mining Frequent Item sets.
UNIT:4
Classification Basic Concepts: Basic Concepts, Decision Tree Induction: Decision Tree Induction Algorithm,
Attribute Selection Measures, Tree Pruning. Bayes Classification Methods.
UNIT:5
Cluster Analysis: Cluster Analysis, Partitioning Methods, Hierarchal methods, Density based methods-DBSCAN.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of
management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a
particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should
never be altered.
A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases where the technology is
mature and well known, and where the business problems that must be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business
modeling and technology development. It allows an organization to move forward at considerably less expense and
to evaluate the benefits of the technology before making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach
while retaining the rapid implementation and opportunistic application of the bottom-up approach.
Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration,
sales, or the general ledger. If the business process is organizational and involves multiple complex object
collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the
fact table for this process, for example, individual transactions, individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer,
supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like
dollars sold and units sold.
Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and
utilities are used to feed data into the bottom tier from operational databases or other external sources (such as
customer profile information provided by external consultants). These tools and utilities perform data extraction,
cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load
and refresh functions to update the data warehouse. The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL
code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also
contains a metadata repository, which stores information about the data warehouse and its contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a
multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on multidimensional data to standard relational
operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements
multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining
tools (e.g., trend analysis, prediction, and so on).
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire organization.
It provides corporate-wide data integration, usually from one or more operational systems or external information
providers, and is cross-functional in scope.
It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer super servers, or parallel
architecture platforms. It requires extensive business modeling and may take years to design and build.
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item,
and sales. The data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are UNIX/LINUX- or Windows-based.
The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years.
However, it may involve complex integration in the long run if its design and planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent more dependent. Independent data
marts are sourced from data captured from one or more operational systems or external information providers, or
from data generated locally within a particular department or geographic area. Dependent data marts are source
directly from enterprise data warehouses.
3. Virtual warehouse:
A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the
possible summary views may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational database servers.
Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer.
Thus, data mining should have been more appropriately named as knowledge mining which emphasis on mining
from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform it into an
understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable business information in a large
database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a
vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently
probing it to find exactly where the value resides.
Given databases of sufficient size and quality, data mining technology can generate new business opportunities by
providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding predictive
information in large databases. Questions that traditionally required extensive hands- on analysis can now be
answered directly from the data — quickly.
A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings
to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond
similarly to given events.
Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying
anomalous data that could represent data entry keying errors.
Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might
be interesting or data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can
determine which products are frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar",
without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program
might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including Visualization and report
generation.
KDD Knowledge Discovery in Databases (KDD) is a comprehensive, multi-step process for extracting useful
knowledge from large datasets, while Data Mining (DM) is a specific step within the KDD process that uses
Scope In the KDD method, the fourth phase is KDD is a broad method that includes
called "data mining." data mining as one of its steps.
Knowledge Presentation
Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.
Data mining is often a synonym for Knowledge Discovery from Data (KDD). Some people see data mining as a
key part of KDD, where smart methods are used to find patterns in the data. The term "Knowledge Discovery in
Databases" (KDD) was first coined by Gregory Piatetsky-Shapiro in 1989. However, "data mining" became more
widely used in business and media. Today, both terms are often used interchangeably.
Knowledge discovery from data (KDD) is a multi-step process for extracting useful insights. The following are the
key steps involved:
Data Selection: Identify and select relevant data from various sources for analysis.
Data Preprocessing: Clean and transform the data to address errors and inconsistencies, making it suitable for
analysis.
Data Transformation: Convert the cleaned data into a form that is suitable for data mining algorithms.
Data Mining: Apply data mining techniques to identify patterns and relationships in the data, selecting appropriate
algorithms and models.
Pattern Evaluation: Evaluate the identified patterns to determine their usefulness in making predictions or
decisions.
Knowledge Refinement: Refine the knowledge obtained to improve accuracy and usefulness based on feedback.
Knowledge Dissemination: Share the results in an easily understandable format to aid decision-making.
Now we discuss here different types of Data Mining Techniques which are used to predict desire output.
1. Association
Association analysis looks for patterns where certain items or conditions tend to appear together in a dataset. It's
commonly used in market basket analysis to see which products are often bought together. One method, called
associative classification, generates rules from the data and uses them to build a model for predictions.
2. Classification
Classification builds models to sort data into different categories. The model is trained on data with known labels and
is then used to predict labels for unknown data. Some examples of classification models are:
Decision Tree
Bayesian classification
Classification by Backpropagation
K-NN Classifier
Rule-Based Classification
Fuzzy Logic
3. Prediction
Prediction is similar to classification, but instead of predicting categories, it predicts continuous values (like
numbers). The goal is to build a model that can estimate the value of a specific attribute for new data.
4. Clustering
Clustering groups similar data points together without using predefined categories. It helps discover hidden patterns
in the data by organizing objects into clusters where items in each cluster are more similar to each other than to those
in other clusters.
5. Regression
Regression is used to predict continuous values, like prices or temperatures, based on past data. There are two main
types: linear regression, which looks for a straight-line relationship, and multiple linear regression, which uses more
variables to make predictions.
7. Outlier Detection
Outlier detection identifies data points that are very different from the rest of the data. These unusual points, called
outliers, can be spotted using statistical methods or by checking if they are far away from other data points.
8. Genetic Algorithm
Genetic algorithms are inspired by natural selection. They solve problems by evolving solutions over several
generations. Each solution is like a "species," and the fittest solutions are kept and improved over time, simulating
"survival of the fittest" to find the best solution to a problem.
Data mining is a powerful tool that offers many benefits across a wide range of industries. The following are some of
the advantages of data mining:
Advantages Description
Helps extract useful information from large datasets for informed decision
Better Decision Making making.
Disadvantages Description
Histograms
It consists of a set of rectangles, that reflects the counts or frequencies of the classes present in the given data.
Example: Histogram of an electricity bill generated for 4 months, as shown in diagram given below.
Patterns in the data are marked easily by using the data visualization technique.
In pixel based visualization techniques, there are separate sub-windows for the value of each attribute and it is
represented by one colored pixel.
It maximizes the amount of information represented at one time without any overlap.
Tuple with 'm' variable has different 'm' colored pixel to represent each variable and each variable has a sub window.
The color mapping of the pixel is decided on the basis of data characteristics and visualization tasks.
i. Scatter-plot matrices
It consists of scatter plots of all possible pairs of variables in a dataset.
The parallel vertical lines which are separated defines the axes.
Chernoff faces
The faces in Chernoff faces are related to facial expressions or features of human being. So, it becomes easy to
identify the difference between the faces.
It includes the mapping of different data dimensions with different facial features.
For example: The face width, the length of the mouth and the length of nose, etc. as shown in the following
diagram.
Hierarchical visualization techniques are used for partitioning of all dimensions in to subset.
i. Dimensional stacking
Helps to mark the important attributes and are used on the outer level.
Rectangles are used to represent the count of categorical data and at every stage, rectangles are split parallel.
Innermost word must have a function and two most important parameters.
Through this, N-vision of data are possible like data glove and stereo displays, including rotation, scaling (inner)
and translation (inner/outer).
Tree maps visualization techniques are well suited for displaying large amount of hierarchical structured data.
The visualization space is divided into the multiple rectangles that are ordered, according to a quantitative
variable.
The levels in the hierarchy are seen as rectangles containing the other rectangle.
Each set of rectangles on the same level in the hierarchy represents a category, a column or an expression in a data
set.
A tag cloud is a visualization method which helps to understand the information of user generated tags.
It is also possible to arrange the tags alphabetically or according to the user preferences with different font sizes
and colors.
1. Purpose:
o Enable users to specify patterns to be discovered, data to be analyzed, and the conditions under which data mining
operations should be performed.
2. Examples:
o DMQL (Data Mining Query Language): Designed for specifying data mining tasks and providing access to data
mining models.
o SQL with Data Mining Extensions: Some database systems extend SQL to include data mining capabilities,
allowing users to integrate data mining tasks directly within their SQL queries.
3. Typical Functions:
o Specifying target variables and patterns (e.g., mining frequent itemsets, building classification models).
o Supporting the retrieval of discovered patterns and associations for further analysis.
Let’s work through examples that cover association rule mining (Apriori), classification (Decision Tree)
Example Code:
c("Bread", "Milk"),
# Convert the list to transaction data format trans <- as(transactions, "transactions")
# Apply Apriori algorithm to find frequent itemsets and association rules rules <- apriori(trans, parameter = list(supp
= 0.2, conf = 0.6))
Explanation:
Apriori: We specify a minimum support of 20% and a confidence level of 60% to mine the association
rules.
Sample Output:
Rule 1: If Milk is bought, there is a 75% chance Bread will also be bought.
Rule 2: If Diaper is bought, there is a 75% chance Milk will also be bought.
Example Code:
library(rpart)
# Sample customer data customer_data <- data.frame( Age = c(25, 45, 35, 50, 23),
churn_model: Builds a decision tree using Age and Income to predict Churn.
The model divides the dataset based on attribute values to make classifications.
Sample Output:
n= 5
1. Flexibility: R offers a flexible and extensible environment for data mining, with many libraries available
for different tasks.
2. Wide Range of Algorithms: R supports a broad spectrum of data mining algorithms (e.g., decision trees,
k-means, apriori) through its packages.
3. Visualization Support: R integrates data mining with advanced visualization tools to help users
understand patterns.
4. Scalability: R can handle large datasets and complex mining tasks using parallel computing and big data
solutions.
Conclusion: While R does not have a formal Data Mining Query Language (DMQL), its extensive packages such as
rpart, arules, and dplyr provide a rich framework to perform data mining tasks. These packages offer functionalities
that mimic DMQL, enabling users to classify, cluster, and mine association rules effectively.
With its versatility and powerful statistical tools, R is a go-to solution for data mining tasks in academic and
professional environments.
1. No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It may
fetch data from a particular source (such as a file system), process data using some data mining algorithms, and
then store the mining results in another file.
2. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching
data from a data repository managed by these systems, performing data mining, and then storing the mining results
either in a file or in a designated place in a database or data Warehouse. Loose coupling is better than no coupling
because it can fetch any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities.
3. Semi-tight coupling: Semi-tight coupling means that besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data
mining functions) can be provided in the DB/DW system. These primitives can include sorting, indexing,
aggregation, histogram analysis, multi way join, and pre computation of some essential statistical measures, such as
sum, count, max, min ,standard deviation,
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. The data
mining subsystem is treated as one functional component of information system. Data mining queries and functions
are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of
a DB or DW system.
When you integrate the data in Data Mining, you may face many issues. There are some of those issues:
As you understand, the records are obtained from heterogeneous sources, and how can you 'match the real- world
entities from the data'.
For example, you were given client data from specialized statistics sites. Customer identity is assigned to an entity
from one statistics supply, while a customer range is assigned to an entity from another statistics supply. Analyzing
such metadata statistics will prevent you from making errors during schema integration.
One of the major issues in the course of data integration is redundancy. Unimportant data that are no longer required
are referred to as redundant data. It may also appear due to attributes created from the use of another property inside
the information set.
For example, if one truth set contains the patronage and distinct data set as the purchaser's date of the beginning, then
age may be a redundant attribute because it can be deduced from the use of the beginning date.
3.Tuple Duplication
The data warfare technique of combining records from several sources is unhealthy. In the same way, that
characteristic values can vary, so can statistics units. The disparity may be related to the fact that they are represented
differently within the special data units. For example, in one-of-a-kind towns, the price of an inn room might be
expressed in a particular currency. This type of issue is recognized and fixed during the data integration process.
DATA PREPROCESSING:
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.
Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing. Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks).
Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data
to make it suitable for analysis. Some common steps in data preprocessing include:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration.
3. Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization is used to
transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into
discrete categories.
4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while preserving the important
information.
5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and
clustering.
6. Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or - 1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing:
K.NAGA JYOTHI MA,M.TECH(CS)
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The
specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more accurate.
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.
1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
Missing Data: This situation arises when some data is missing in the data. It can be handled in various
ways.Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :
o Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to complete the
task.
o Regression:Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
o Utilize regex for pattern matching and string manipulation to clean and standardize text data.
Data Profiling:
o Perform data profiling to understand the data's structure, quality, and content, which aids in identifying cleaning
needs.
Data Integration:
o When merging data from multiple sources, ensure consistent formats and values across datasets.
1. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to “country”.
o The ETL process is crucial in data warehousing, where data is extracted from various sources, transformed into a
usable format, and then loaded into the data warehouse.
o Data transformation can be accomplished using programming languages such as Python (with libraries like
pandas and NumPy) or R, which offer extensive functionalities for data manipulation and transformation.
o Utilize data transformation tools (e.g., Talend, Apache Nifi, Alteryx) that provide user-friendly interfaces for
performing various transformation tasks.
o Use regex for pattern matching and text manipulation, particularly for cleaning and standardizing string data.
2. Data Integration: Integrating data from heterogenous sources of data are combined into single dataset. There are
two type of data integration:
2. Loose coupling: only an interface is created and data is combined through the interface.
Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component analysis (PCA).
Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the
important information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-negative
matrix factorization (NMF).
Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the
size of the dataset while preserving the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.
Clustering:
This involves grouping similar data points together into clusters. Clustering is often used to reduce the size of the
dataset by replacing similar data points with a representative centroid. It can be done using techniques such as k-
means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques
such as wavelet compression, JPEG compression, and gif compression
The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality reduction.A dataset contains a huge number of input
features in various cases, which makes the predictive modeling task more complicated. Because it is very difficult
to visualize or make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into
lesser dimensions dataset ensuring that it provides similar information."
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal processing,
bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis, etc.
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of dimensionality. If
the dimensionality of the input dataset increases, any machine learning algorithm and model becomes more complex.
As the number of features increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional data, it becomes overfitted
and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction.
o By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
There are also some disadvantages of applying the dimensionality reduction, which are given below:
o In the PCA dimensionality reduction technique, sometimes the principal components required to consider
are unknown.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features
present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features
from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common
techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation.
In this method, some features are fed to the ML model, and evaluate the performance. The performance decides
whether to add those features or remove to increase the accuracy of the model. This method is more accurate than the
filtering method but complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
o LASSO
o Elastic Net
Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources while
processing the information.
3. Kernel PCA
2. Backward Elimination
3. Forward Selection
4. Score comparison
8. Random Forest
9. Factor Analysis
10. Auto-Encoder
Principal Component Analysis is a statistical process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
PCA works by considering the variance of each attribute because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various communication channels.
o In this technique, firstly, all the n variables of the given dataset are taken to train the model.
o Now we will remove one feature each time and train the model on n-1 features for n times, and will
compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance of the model, and
then we will drop that variable or features; after that, we will be left with n-1 features.
In this technique, by selecting the optimum performance of the model and maximum tolerable error rate, we can
define the optimal number of features require for the machine learning algorithms.
Forward feature selection follows the inverse process of the backward elimination process. It means, in this
technique, we don't eliminate the feature; instead, we will find the best features that can produce the highest increase
in the performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
o The process will be repeated until we get a significant increase in the performance of the model.
If a dataset has too many missing values, then we drop those variables as they do not carry much useful information.
To perform this, we can set a threshold level, and if a variable has missing values more than that threshold, we will
drop that variable. The higher the threshold value, the more efficient the reduction.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning. This algorithm contains
an in-built feature importance package, so we do not need to program it separately. In this technique, we need to
generate a large set of trees against the target variable, and with the help of usage statistics of each attribute, we need
to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input data into numeric data
using hot encoding.
1. Principal Component Analysis (PCA): Transforms the original features into a new set of orthogonal features
(principal components) that capture the maximum variance in the data.
2. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique that is particularly useful for
visualizing high-dimensional data in two or three dimensions.
3. Linear Discriminant Analysis (LDA): A supervised dimensionality reduction technique that aims to project data
in a way that maximizes class separability.
4. Autoencoders: Neural network-based techniques that learn to compress data into a lower- dimensional space and
then reconstruct it back.
Key Differences
Objective Selects a subset of original features Transforms data into a new feature space
When to Use
o You have a large number of features and want to identify the most relevant ones.
o You aim to improve model performance and interpretability without altering the feature space.
o You want to reduce noise and redundancy while preserving the underlying structure of the data.
You are dealing with data that has multicollinearity issues (highly correlated features).
Data mining can be classified into two categories: descriptive data mining and predictive data mining.
Descriptive data mining describes the data set in a concise and summative manner and presents interesting
general properties of the data. Predictive data mining analyzes the data in order to construct one or a set of
models, and attempts to predict the behavior of new data sets. Data base is usually storing the large
amounts of data in great detail. However users often like to view sets of summarized data in concise,
descriptive terms. Such data descriptions may provide an overall picture of a class of data or distinguish it
from a set of comparative classes. Moreover, users like the ease and flexibility of having data sets described
at different levels of granularity and from different angles. Such descriptive data mining is called concept
description and forms an important component of data mining.
The simplest kind of descriptive data mining is concept description. A concept usually refers to a collection
of data such as frequent_buyers, graduate_students, and so on. As a data mining task, concept description is
not a simple enumeration of the data. Instead, concept description generates descriptions for
characterization and comparison of the data. It is some times called class description, when the concept to
be described refers to a class of objects. Characterization provides a concise and succinct summarization of
the given collection of the data, while concept or class comparison (also known as discrimination) provides
discriminations comparing two or more collections of data. Since concept description involves both
characterization and comparison, techniques for accomplishing each of these tasks will study. Concept
description has close ties with the data generalization. Given the large amount of data stored in database, it
is useful to be describe concepts in concise and succinct terms at generalized at multiple levels of
abstraction facilities users in examining the general behavior of the data. Given the ABCompany database,
for example, instead of examining individual customer transactions, sales managers may prefer to view the
data generalized to higher levels, such as summarized to higher levels, such as summarized by customer
groups according to geographic regions, frequency of purchases per group, and customer income. Such
multiple dimensional, multilevel data generalization is similar to multidimensional data analysis in data
warehouses. The fundamental differences between concept description in large databases and online
analytical processing involve the following.
Concept description:
➢ Characterization: provides a concise and succinct summarization of the given collection of data
❖ As a data mining task, concept description is not a simple enumeration (number of things done one by one) of the
data.
❖ Concept description generates descriptions for characterization and comparison of the data it is also called class
description.
While concept or class comparison (also known as discrimination) provides discriminations (inequity) comparing
two or more collections of data.
Example:
➢ Given the ABC Company database, for example, examining individual customer transactions.
➢ Sales managers may prefer to view the data generalized to higher level, such as summarized by customer groups
according to geographic regions, frequency of purchases per group and customer income.
Data Generalization and Summarization-Based Characterization Data and objects in databases often contain detailed
information at primitive concept levels. .For example, the item relation in sales database may contain attributes
describing low-level item information such s item _ID , name , brand, category, supplier, place made, and price. It is
useful to be able to summarize a large set or data and present it at a high conceptual level.. For example,
summarizing a large set of items relating to Christmas season sales provides a general description of such data ,
which can be very helpful for sales and marketing managers.
This requires an important functionality in data mining: data generalization. Data generalization is a process that
abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual
levels. Methods for the efficient and flexible generalization of large data sets can be categorized according to two
approaches :(1) the data cube (or OLAP) approach and (2) the attribute –oriented induction approach .
Data generalization
A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.
Approaches:
Generalized relation:
Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.
Cross tabulation:
Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms.
Mapping generalized result into characteristic rules with quantitative information associated with it
AOI:
Attribute-Oriented Induction The attribute-oriented induction (AOI)) approach to data generalization and
summarization-based characterization was first proposed in 1989,a few years prior to the introduction of the data
cube approach. The data cube approach can be considered as a data warehouse-based, pre-computation oriented,
materialized-view approach. It performs off-line aggregation before an OLAP or data mining query is submitted for
processing. On the other hand, the attribute-oriented induction approach, at least in its initial proposal, is a relational
database query –oriented, generalization –based, on-line data analysis technique. However, there is no inherent
barrier distinguishing the two approaches based on on-line aggregation versus off-line pre computation. Some
aggregations in the data cube can be computed on-line, while off-line while off-line pre -computation of
multidimensional space can speed up attribute –oriented induction as well.
Why?
What?
How?
1. Data Collection
2. Preliminary
3. Relevance Analysis
Relevance Analysis
K.NAGA JYOTHI MA,M.TECH(CS)
1.Use relevance analysis measure e.g. information gain to identify highly relevant dimensions and levels.
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization-based characterization
was first proposed in 1989, a few years before the introduction of the data cub approach. The data cube approach can
be considered as a data warehouse-based, pre-computational-oriented, materialized approach. It performs offline
aggregation before an OLAP or data mining query is submitted for processing. On the other hand, the attribute-
oriented induction approach, at least in its initial proposal, is a relational database query-oriented, generalized-based,
online data analysis technique. However, there is no inherent barrier distinguishing the two approaches based on
online aggregation versus offline pre-computation.
A set of basic principles for the attribute-oriented induction in relational databases is summarized as follows:-
1. follows Data focusing: Analyzing task-relevant data, including dimensions, and the result is the initial relation.
2. Attribute-removal: To remove attribute A if there is a large set of distinct values for A but either
3. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A. 3.
AOI stands for Attribute-Oriented Induction. The attribute-oriented induction approach to concept description was
first proposed in 1989, a few years before the introduction of the data cube approach. The data cube approach is
essentially based on materialized views of the data, which typically have been pre-computed in a data warehouse.
In general, it implements off-line aggregation earlier an OLAP or data mining query is submitted for processing. In
other words, the attribute-oriented induction approach is generally a query-oriented, generalization-based, on-line
data analysis methods.
The general idea of attribute-oriented induction is to first collect the task-relevant data using a database query and
then perform generalization based on the examination of the number of distinct values of each attribute in the
relevant collection of data.
Algorithm:
First, data focusing must be implemented before attribute-oriented induction. This step corresponds to the description
of the task-relevant records (i.e., data for analysis). The data are collected based on the data supported in the data
mining query.
Because a data mining query is usually relevant to only a portion of the database, selecting the relevant set of data
not only makes mining more efficient, but also changes more significant results than mining the whole database.
It can be specifying the set of relevant attributes (i.e., attributes for mining, as indicated in DMQL with the in
relevance to clause) may be difficult for the user. A user can choose only a few attributes that it is important,
while missing others that can also play a role in the representation.
For example, suppose that the dimension birth place is defined by the attributes city, province or state, and country. It
can allow generalization on the birth place dimension, the other attributes defining this dimension should also be
included.
In other terms, having the system automatically involve province or state and country as relevant attributes enables
city to be generalized to these larger conceptual levels during the induction phase.
At the other extreme, suppose that the user may have introduced too many attributes by specifying all of the possible
attributes with the clause “in relevance to *”. In this case, all of the attributes in the relation specified by the from
clause would be included in the analysis.
Procedure:
Output:
Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 - E 1
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set.
For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a
frequent itemset.A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern.
A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern.Finding frequent patterns plays an essential role in mining associations, correlations, and many other
interesting relationships among data. Moreover, it helps in data classification, clustering, and other data mining tasks.
Frequent item set mining leads to the discovery of associations and correlations among items in large transactional or
relational data sets. With massive amounts of data continuously being collected and stored, many industries are
becoming interested in mining such patterns from their databases. The discovery of interesting correlation
relationships among huge amounts of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping behavior analysis.
This process analyzes customer buying habits by finding associations between the different items that customers
place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by customers. For instance, if customers are
buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket?
This information can lead to increased sales by helping retailers do selective marketing and plan their shelf space.
Let’s look at an example of how market basket analysis can be useful.Suppose, as manager of an All Electronics
branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?”
To answer your question, market basket analysis may be performed on the retail data of customer
Transactions at your store. You can then use the results to plan marketing or advertising strategies, or in the design of
a new catalog. For instance, market basket analysis may help you design different store layouts. In one strategy,
K.NAGA JYOTHI MA,M.TECH(CS)
items that are frequently purchased together can be placed in proximity to further encourage the combined sale of
such items. If customers who purchase computers also tend to buy antivirus software at the same time, then placing
the hardware display close to the software display may help increase the sales of both items.
The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or
purchased together. These patterns can be represented in the form of association rules. For example, the information
that customers who purchase computers also tend to buy antivirus software at the same time is represented in the
following association
A support of 2% for Rule means that 2% of all the transactions under analysis show that computer and antivirus
software are purchased together.
A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.
Association rules are considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. These thresholds can be a set by users or domain experts.
contain A if A ⊆ T.
Each transaction is associated with an identifier, called a TID. Let A be a set of items. A transaction T is said to
An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠∅, and A ∩B = φ. The rule A
⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A ∪B
(i.e., the union of sets A and B say, or, both A and B). This is taken to be the probability, P(A ∪B).
The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A
that also contain B. This is taken to be the conditional probability, P(B|A). That is
confidence(A⇒B) =P(B|A).
Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf ) are
called strong. support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.
If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of
I satisfies the corresponding minimum support count threshold), then I is a frequent itemset.
An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a 2-itemset. The occurrence
frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the
frequency, support count, or count of the itemset
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined
minimum support count, min sup.
X is frequent
A Pattern-Growth Approach for Mining Frequent Itemsets OR FP-growth (finding frequent itemsets without
candidate generation).
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k- itemsets are used to explore (k + 1)-
itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted by L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent
k-itemsets can be found. The finding of each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori
property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the
itemset I, then the resulting itemset (i.e., I ∪A) cannot occur more frequently than I. Therefore, I ∪A is not frequent
minimum support threshold, min sup, then I is not frequent, that is, P(I) < min sup. If an item A is added to the
This property belongs to a special category of properties called antimonotonicity in the sense that if a set cannot
pass a test, all of its supersets will fail the same test as well.
1. The join step: This step generates (K+1) item set from K-itesets by joining each item with itself. To find Lk
, a set of candidate k-itemsets is generated by joining L k−1 with itself. This set of candidates is denoted
Ck . Let l1 and l2 be itemsets in L k−1.
does not meet minimum support then it is regarded as infrequent and thus it is removed. This step is performed to
reduce the size of the candidate itemsets.
Example:
There are nine transactions in this database, that is, |D| = 9 i.e Transactional Data for an AllElectronics Branch.
illustrate the Apriori algorithm for finding frequent itemsets in D. min sup = 2, minimum confidence threshold is,
say, 70%.
1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm
simply scans all of the transactions to count the number of occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. (The corresponding relative support is
2/9 = 22%.) The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.
4.To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 L1 to generate a candidate set of 2-
itemsets, C2.
5.The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having
minimum support.
6.Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property: The generation of the set
of the candidate 3-itemsets, C3. From the join step, we first get
C3 = L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the
Apriori property that all subsets of a frequent itemset must also be frequent
The 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, and {I2, I3}. All 2-item subsets of {I1, I2, I3} are
members of L2. Therefore, keep {I1, I2, I3} in C3.
The 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, and {I2, I5}. All 2-item subsets of {I1, I2, I5} are
members of L2. Therefore, keep {I1, I2, I5} in C3., etc
(c) Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning.
8.The algorithm uses L3 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in
{{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not frequent. Thus, C4 = φ, and
the algorithm terminates, having found all of the frequent itemsets.
Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong
association rules from them (where strong association rules satisfy both minimum support and minimum
confidence).
The conditional probability is expressed in terms of itemset support count, where support count(A
∪B) is the number of transactions containing the itemsets A ∪B, and support count(A) is the number of transactions
containing the itemset A.
if support count(l)∕ support count(s) ≥ min conf, where min conf is the minimum confidence threshold.
Example:
Generating association rules. Let’s try an example based on the transactional data for AllElectronics shown before in
Table 6.1. The data contain frequent itemset X = {I1, I2, I5}. What are the association rules that can be generated
from X? The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}.
The resulting association rules are as shown below, each listed with its confidence:
{I2, I5} ⇒ I1, confidence = 2/2 = 100% I1 ⇒ {I2, I5}, confidence = 2/6 = 33% I2 ⇒ {I1, I5}, confidence = 2/7 =
29% I5 ⇒ {I1, I2}, confidence = 2/2 = 100%
If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because
these are the only ones generated that are strong.
“How can we further improve the efficiency of Apriori-based mining?” Many variations of the Apriori algorithm
have been proposed that focus on improving the efficiency of the original algorithm. Several of these variations are
summarized as follows
A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck , for k > 1.
• For example, when scanning each transaction in the database to generate the frequent 1- itemsets, L1, we
can generate all the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash
table structure, and increase the corresponding bucket counts (Figure 6.5).
• A 2-itemset with a corresponding bucket count in the hash table that is below the support threshold cannot
be frequent and thus should be removed from the candidate set.
• Such a hash-based technique may substantially reduce the number of candidate k-itemsets examined
(especially when k = 2).
Example:
• A transaction that does not contain any frequent k-item sets cannot contain any frequent (k + 1)- item sets.
• Therefore, such a transaction can be marked or removed from further consideration because subsequent
database scans for j-item sets, where j > k, will not need to consider such a transaction.
• A partitioning technique can be used that requires just two database scans to mine the frequent item sets.
• In phase I, the algorithm divides the transactions of D into n non overlapping partitions.
• In phase II a second scan of D is conducted in which the actual support of each candidate is assessed to
determine the global frequent itemsets.
Example:
• The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search
for frequent itemsets in S instead of D.
• The S sample size is such that the search for frequent itemsets in S can be done in main memory, and so
only one scan of the transactions in S is required overall.
• Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of
the global frequent itemsets.
Dynamic itemset counting (adding candidate itemsets at different points during a scan):
• A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked
by start points.
• In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines
new candidate itemsets only immediately before each complete database scan.
• The technique uses the count-so-far as the lower bound of the actual count.
• If the count-so-far passes the minimum support, the itemset is added into the frequent itemset collection and
can be used to generate longer candidates. This leads to fewer database scans than with Apriori for finding
all the frequent itemsets.
It may still need to generate a huge number of candidate sets. For example, if there are 104 frequent 1-
itemsets, the Apriori algorithm will need to generate more than 107 candidate 2- itemsets.
It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching. It
is costly to go over each transaction in the database to determine the support of the candidate itemsets.
• The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance.
• The core of this method is the usage of a special data structure named frequent-pattern tree (FP- tree), which
retains the item set association information.
• First, it compresses the input database creating an FP-tree instance to represent frequent items.
• After this first step, it divides the compressed database into a set of conditional databases, each associated
with one frequent pattern.
The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information about frequent
patterns in a database. Each transaction is read and then mapped onto a path in the FP- tree.
A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is to mine the
most frequent pattern. Each node of the FP tree represents an item of the item set.
The root node represents null, while the lower nodes represent the item sets. The associations of the nodes with the
lower nodes, that is, the item sets with the other item sets, are maintained while forming the tree.
We reexamine the mining of transaction database, D, of Table 6.1 in Example 6.3 using the frequent pattern growth
approach.
The first scan of the database is the same as Apriori, which derives the set of frequent items (1- itemsets) and their
support counts (frequencies).
The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted by L.
Thus, we have L = {{I2: 7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.
One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-header table.
– Count: the number of transactions represented by the portion of the path reaching the node;
– Node-link: links to the next node in the FP-tree carrying the same item name or null if there is none.
First, create the root of the tree, labeled with “null.” Scan database D a second time. The items in each
transaction are processed in L order (i.e., sorted according to descending support count), and a branch is
created for each transaction.
For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L
order), leads to the construction of the first branch of the tree with three nodes, hI2: 1i, hI1: 1i, and hI5: 1i,
where I2 is linked as a child to the root, I1 is linked to I2, and I5 is linked to I1.
The second transaction, T200, contains the items I2 and I4 in L order, which would result in a branch
where I2 is linked to the root and I4 is linked to I2. However, this branch would share a common prefix, I2,
with the existing path for T100.
In this way, the problem of mining frequent patterns in databases is transformed into that of mining the FP-
tree
Start from each frequent length-1 pattern (as an initial suffix pattern), construct its conditional pattern base (a “sub-
database,” which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct
its (conditional) FP-tree, and perform mining recursively on the tree. The pattern growth is achieved by the
concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
In the realm of machine learning, classification is a fundamental tool that enables us to categorise data into distinct
groups. Understanding its significance and nuances is crucial for making informed decisions based on data patterns.
Machine Learning
Machine learning is the process of teaching a computer system certain algorithms that can improve themselves with
experience. A very technical definition would be:
"A computer program is said to learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience."
- Tom Mitchell, 1997
Just like humans, the system will be able to perform simple classification tasks and complex mathematical
computations like regression. It involves the building of mathematical models that are used in classification
or regression.To ‘train’ these mathematical models, you need a set of training data. This is the dataset over which the
system builds the model. This article will cover all your Machine Learning Classification needs, starting with the
very basics.Coming to machine learning algorithms, the mathematical models are divided into two categories,
depending on their training data - supervised and unsupervised learning models.
Supervised Learning:
When building supervised learning models, the training data used contains the required answers. These required
answers are called labels. For example, you show a picture of a dog and also label it as a dog.
So, with enough pictures of a dog, the algorithm will be able to classify an image of a dog correctly. Supervised
learning models can also be used to predict continuous numeric values such as the price of the stock of a certain
company. These models are known as regression models.In this case, the labels would be the price of the stock in the
past. So the algorithm would follow that trend.
Decision Trees
Random Forests
Unsupervised Learning
In unsupervised learning, as the name suggests, the dataset used for training does not contain the required answers.
Instead, the algorithm uses techniques such as clustering and dimensionality reduction to train.
A major application of unsupervised learning is anomaly detection. This uses a clustering algorithm to find out major
outliers in a graph. These are used in credit card fraud detection. Explore the 'unsupervised learning course' from
Quantra.Since classification is a part of supervised learning models, let us find out more about the same.
Supervised models are trained on labelled dataset. It can either be a continuous label or categorical label. Following
are the types of supervised models:
Regression
Classification
Regression models
Regression is used when one is dealing with continuous values such as the cost of a house when you are given
features such as location, the area covered, historic prices etc. Popular regression models are:
Linear Regression
Lasso Regression
Ridge Regression
Classification models
Classification is used for data that is separated into categories with each category represented by a label. The
training data must contain the labels and must have sufficient observations of each label so that the accuracy of the
model is respectable. Some popular classification models include:
Decision Trees
Let us now see some general examples of classification below to learn about this concept properly.
In this example of email spam detection, the categories can be “Spam” and “Not Spam”.
If an incoming email contains phrases like "win a prize", "free offer" and "urgent money transfer", your spam
filter might classify it as "Spam".
Disease Diagnosis
In disease diagnosis, let us assume that two categories are "Pneumonia" and "Common Cold".
If a patient's medical data includes symptoms like high fever, severe cough, and chest pain, a diagnostic
model might classify it as "Pneumonia".
If the patient's data indicates mild fatigue and occasional headaches, the model might classify it as "Common
Cold".
In this example of sentiment analysis, we can have two categories, namely "Positive" and "Negative".
If a tweet contains positive phrases like "amazing", "great experience", and "highly recommend", a sentiment
analysis model might classify it as "Positive".
If a tweet includes negative terms such as "terrible", "disappointed" and "waste of money", it might classify
it as "Negative".
Conceptually, the “best” splitting criterion is the most approximately results in such a method. Attribute selection
measures are called a splitting rules because they decides how the tuples at a given node are to be divided.The
attribute selection measure supports a ranking for every attribute defining the given training tuples. The attribute
having the best method for the measure is selected as the splitting attribute for the given tuples.If the splitting
attribute is constant-valued or if it is restricted to binary trees, accordingly, either a split point or a splitting subset
should also be decided as an element of the splitting criterion.The tree node generated for partition D is labeled with
the splitting criterion, branches are increase for each result of the criterion, and the tuples are isolated accordingly.
There are three famous attribute selection measures including information gain, gain ratio, and gini index.
Information gain − Information gain is used for deciding the best features/attributes that render maximum data
about a class. It follows the method of entropy while aiming at reducing the level of entropy, starting from the root
node to the leaf nodes.Let node N defines or hold the tuples of partition D. The attribute with the largest information
gain is selected as the splitting attribute for node N. This attribute minimizes the data required to define the tuples in
the resulting subdivide and reflects the least randomness or “impurity” in these subdivide.
Gain ratio − The information gain measure is biased approaching tests with several results. It can select attributes
having a high number of values. For instance, consider an attribute that facilitates as a unique identifier, including
product ID.
A split on product ID can result in a huge number of partitions, each one including only one tuple. Because each
partition is authentic, the data needed to define data set D based on this partitioning would be Info product_ID(D) = 0.
K.NAGA JYOTHI MA,M.TECH(CS)
Gini index − The Gini index can be used in CART. The Gini index calculates the impurity of D, a data partition or
collection of training tuples, as
Gini(D)=1−∑i=1mp2i
where pi is the probability that a tuple in D belongs to class Ci and is calculated by |Ci,D|/|D|.
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on
an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in
the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely
to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has a
hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It It works like a
flowchart help to make decisions step by step where:
Decision trees are widely used due to their interpretability, flexibility and low preprocessing needs.
DECISION TREE:
Let’s consider a decision tree for predicting whether a customer will buy a product based on age, income and
previous purchases: Here's how the decision tree works:
If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"
If the person is above 30 and has not made previous purchases, predict "No Purchase" (leaf node).
Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees
Yes: "Purchase"
Yes: "Purchase"
Once we have predictions from both trees, we can combine the results to make a final prediction. If Tree 1 predicts
"Purchase" and Tree 2 predicts "No Purchase", the final prediction might be "Purchase" or "No Purchase" depending
on the weight or confidence assigned to each tree. This can be decided based on the problem context.
Till now we have discovered the basic intuition and approach of how decision tree works, so lets just move to the
attribute selection measure of decision tree. We have two popular attribute selection measures used:
1. Information Gain
For example if we split a dataset of people into "Young" and "Old" based on age and all young people bought the
product while all old people did not, the Information Gain would be high because the split perfectly separates the two
groups with no uncertainty left
Suppose SS is a set of instances AA is an attribute, SvSv is the subset of SS, vv represents an individual value
that the attribute AA can take and Values (AA) is the set of all possible values of AA then
Gain(S,A)=Entropy(S)−∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)−∑vA∣S∣∣Sv∣.Entropy(Sv)
Entropy: is the measure of uncertainty of a random variable it characterizes the impurity of an arbitrary
collection of examples. The higher the entropy more the information content.
For example if a dataset has an equal number of "Yes" and "No" outcomes (like 3 people who bought a product and 3
who didn’t), the entropy is high because it’s uncertain which outcome to predict. But if all the outcomes are the same
(all "Yes" or all "No") the entropy is 0 meaning there is no uncertainty left in predicting the outcome
Suppose SS is a set of instances, AA is an attribute, SvSv is the subset of SS with AA= vv and Values (AA) is the set
of all possible values of AA, then
Gain(S,A)=Entropy(S)−∑vϵValues(A)∣Sv∣∣S∣.Entropy(Sv) Gain(S,A)=Entropy(S)−∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:
Start with all training instances associated with the root node
Use info gain to choose which attribute to label each node with
Recursively construct each subtree on the subset of training instances that would be classified down that path
in the tree.
If all positive or all negative training instances remain, the label that node “yes" or “no" accordingly
If no attributes remain label with a majority vote of training instances left at that node
If no instances remain label with a majority vote of the parent's training instances.
TREE PRUNING:
Tree pruning, a technique to prevent over fitting in decision trees, can be combined with Bayes classification in
methods like Naïve Bayesian Tree (NBTree) and Bayes Minimum Risk (PBMR) pruning to build more accurate
and generalizable models. These approaches use Bayesian concepts, such as estimated risk rates or local accuracy, to
guide the pruning process, either by stopping tree growth early (pre-pruning) or by simplifying a fully grown tree
(post-pruning).
Reduces Overfitting: Decision trees can become too complex, memorizing training data's noise and
anomalies. Pruning simplifies the tree by removing unnecessary branches, improving its ability to generalize
to new, unseen data.
Enhances Generalization: By removing less reliable branches, pruning creates smaller, less complex trees
that perform better on independent test data.
Improves Interpretability and Efficiency: Simpler trees are easier to comprehend and faster to execute.
A pruning strategy is introduced based on estimating local accuracy, using the accuracy of local
classifiers (like those at leaf nodes) to guide the decision-making process, rather than directly using
the most specific classifier.
This is a novel post-pruning method that converts a parent node of a subtree into a leaf node if the
estimated risk-rate for the parent node is less than the risk-rates of its children.
It uses the Bayes minimum risk criterion to estimate the risk-rate, aiming to find the decision tree
with the lowest error rate on unobserved instances.
Bayesian Classification: In data mining, Bayesian classification uses Bayes' Theorem to predict class
membership probabilities.
Pruning with Bayes Minimum Risk: Some methods apply a bottom-up approach where a subtree is
pruned and replaced by a leaf node if the estimated risk-rate (based on misclassification costs) of the
parent node is lower than that of the subtree.
Pre-Pruning (Early Stopping): Tree construction is halted early when a node is about to be split if it would
result in a split that falls below a pre-specified threshold.
Post-Pruning (Subtree Removal): A fully grown decision tree has subtrees removed and replaced by leaf
nodes, with the leaf labeled with the most frequent class among the replaced subtree's instances.
For instance, in a retail dataset, an association rule might identify that “if a customer buys bread, they are likely to
buy butter”. Such insights help businesses improve cross-selling strategies, inventory management, and customer
satisfaction.
Association rules are derived through algorithms that evaluate the frequency and strength of these relationships. They
use metrics like support, confidence, and lift to measure the relevance and reliability of discovered patterns. These
rules have applications in various fields, such as retail, healthcare, and marketing, where analyzing customer
behavior or trends is critical for success.
Association rules are evaluated using key metrics that determine their relevance, strength, and reliability. These
metrics include support, confidence, and lift, which quantify the frequency and strength of relationships between data
items.
1. Support
Support measures how frequently an itemset (both antecedent and consequent) appears in the dataset. It provides an
indication of how common a particular association is.
Formula:
Example: If bread and butter appear together in 100 out of 1,000 transactions, the support is:
A higher support value indicates a more frequently occurring pattern in the dataset.
2. Confidence
Confidence measures the likelihood of the consequent occurring given that the antecedent has already occurred. It
evaluates the reliability of the rule.
Formula:
Example: If 70% of customers who buy bread also buy butter, the confidence is:
Higher confidence suggests a stronger relationship between the antecedent and consequent.
Lift measures the strength of an association compared to its random occurrence in the dataset. It identifies how much
more likely the antecedent and consequent are to appear together than independently.
Formula:
Example: A lift value greater than 1 indicates a strong positive association, while a value equal to 1 suggests no
association. For instance, if the lift is 1.5, it means the antecedent makes the consequent 1.5 times more likely.
Association rule learning is a multi-step process designed to identify meaningful patterns and relationships in large
datasets. It involves two main stages:
1. Identifying Frequent Itemsets: The process begins by identifying frequent itemsets—combinations of items
that appear together in transactions with a frequency above a predefined threshold. Metrics like support are
used to measure how often these itemsets occur in the dataset. For example, a frequent itemset might reveal
that bread and butter are purchased together in 10% of transactions.
2. Generating Association Rules: Once frequent itemsets are identified, association rules are generated. These
rules take the form of if-then statements that describe relationships between items (e.g., “If a customer buys
bread, they are likely to buy butter”). Metrics such as confidence and lift are applied to evaluate the strength
and reliability of these rules.
Iterative Refinement
The process is iterative, with thresholds for support and confidence adjusted to refine the rules. This ensures that only
the most significant and actionable rules are selected. For instance, a rule with low confidence may be excluded from
further analysis.Through this systematic approach, association rule learning uncovers valuable insights from raw
data, enabling organizations to make data-driven decisions.
Several algorithms are used for association rule learning, each with unique strengths and applications. The three most
commonly used algorithms are:
1. Apriori Algorithm
The Apriori algorithm employs a breadth-first search approach to identify frequent itemsets. It relies on the principle
that all subsets of a frequent itemset must also be frequent, reducing the search space.
Advantage: Simple to implement and effective for small datasets with low dimensionality.
Limitation: Performance degrades significantly with large or dense datasets due to repeated scanning of the
database.
2. Eclat Algorithm
The Eclat algorithm uses a depth-first search strategy to discover frequent itemsets. Instead of scanning the database
multiple times, it represents transactions as vertical itemsets and directly computes intersections.
Advantage: Efficient for datasets with sparse data or where there are fewer frequent itemsets.
3. FP-Growth Algorithm
Advantage: Significantly faster and more memory-efficient than Apriori, especially for large datasets.
Association rules are widely applied across various industries to uncover patterns and relationships in data, enabling
better decision-making and operational efficiency.
1. Retail and Market Basket Analysis: Retailers use association rules to identify frequently purchased product
combinations, helping them optimize store layouts or create product bundles to increase sales.
Example: A supermarket discovers that customers who buy bread often purchase butter and jam, leading to
strategic placement of these items together.
2. Healthcare: In healthcare, association rules help discover co-occurrence patterns in symptoms, aiding in
diagnostic processes and treatment plans.
Example: Identifying that patients with high blood pressure often have a higher risk of developing diabetes can
guide preventative care strategies.
3. E-Commerce and Recommendation Systems: E-commerce platforms leverage association rules to build
recommendation systems that enhance user experiences and drive sales.
Example: Amazon’s “Customers who bought this also bought” feature suggests complementary products,
boosting cross-selling opportunities.
4. Fraud Detection: Association rules are used in financial services to identify unusual patterns in transaction data,
which can help detect fraudulent activities.
Example: Flagging transactions that deviate significantly from established spending patterns for further
investigation.
Consider a small transaction dataset where customers purchase items like bread, butter, and milk.
Dataset Example:
1 Bread, Butter
2 Bread, Milk
4 Milk
5 Bread, Butter
1. Support Calculation:
Support = Transactions containing both bread and butter ÷ Total transactions
2. Confidence Calculation:
Confidence = Support of bread and butter ÷ Support of bread
3. Lift Calculation:
Lift = Confidence ÷ Support of butter
A lift value greater than 1 indicates a positive association between bread and butter.
This example demonstrates how association rules are derived and evaluated, providing actionable insights from
transactional data.
Conclusion
Association Rules are a vital tool in data mining, enabling the discovery of valuable patterns and relationships within
large datasets. Their applications span industries such as retail, healthcare, and finance, driving smarter decision-
making processes. By leveraging advanced algorithms and exploring innovative applications, Association Rules
continue to empower businesses to solve complex problems and unlock new opportunities.
ASSOCIATION RULES:
In data mining, association rules discover relationships in data as "if-then" statements, where the antecedent is the
"if" part (e.g., buying bread) and the consequent is the "then" part (e.g., also buying butter). Multi-relational
association rules extend this by finding associations across different dimensions (like customer age and purchase
category) or databases, providing a more comprehensive understanding of complex relationships.
Definition: Association rules aim to find correlations between items in large datasets. An example is in
market basket analysis, where a rule might be: "If a customer buys bread, then they will also buy milk".
Antecedent: This is the "if" part of the rule, representing a condition or the presence of certain items. In the
example, the antecedent is "a customer buys bread".
Consequent: This is the "then" part of the rule, indicating the likelihood of other items or conditions
occurring with the antecedent. In the example, the consequent is "they will also buy milk".
Definition: These rules go beyond simple associations within a single dataset or dimension. Instead, they find
patterns across different data dimensions or even from multiple related databases.
How they work: MRAR considers interactions between various aspects, such as a customer's demographic
information (age), their buying behavior (items purchased), and other contextual details.
Example: Instead of just "If bread, then milk," a multi-relational rule might look like: "If a student of
age 20 buys diapers, then they will also buy beer". Here, "student" and "age 20" are additional dimensions
that provide context to the association.
ECLAT (Equivalence Class Transformation) is a market basket analysis algorithm used in data mining to find
frequently co-occurring items by converting horizontal data (transactions) into vertical format (item and transaction
ID lists) and efficiently mining frequent itemsets to discover association rules, as seen in studies applied to e-
commerce, supermarkets, and retail settings to improve product placement and promotions. Case studies highlight
ECLAT's speed advantage over algorithms like Apriori due to its efficient vertical-based scanning, its ability to
uncover purchase patterns, and its role in strategic decision-making for businesses.
1.Vertical Data Conversion: ECLAT transforms the dataset from a horizontal format (showing items within each
transaction) to a vertical format, where each item is associated with a list of transaction IDs (TIDs) that contain it.
2.Frequent Itemset Mining: It uses a depth-first search strategy and set intersections to efficiently identify frequent
itemsets (groups of items purchased together) that meet a user-defined minimum support threshold.
3.Association Rule Generation: Once frequent itemsets are found, association rules are generated in the form of
"if-then" statements (e.g., "if customers buy A, then they are likely to buy B") with measures
like confidence and lift to evaluate their strength.
Retail Strategy: In a supermarket scenario, the ECLAT algorithm was used to analyze transactions, identify
frequently bought item combinations (like Indomie goreng special and Indomie ayam bawang), and generate
association rules to help strategize promotions and bundle offers for customers.
E-commerce: Studies on e-commerce book retailers used ECLAT to find relationships between book purchases,
demonstrating its ability to generate the same high-quality association rules as other algorithms, but often faster
due to its vertical scanning approach.
Inventory Management: A case study on a retail store applying ECLAT to identify frequently purchased product
combinations helped with inventory management, allowing the store to optimize ordering quantities for popular
items to prevent stockouts.
Benefits of ECLAT
Efficiency: ECLAT is known for its speed and efficiency, especially compared to the Apriori algorithm, because its
vertical scanning method avoids multiple passes over the entire dataset.
Scalability: It can handle large databases by converting them into a vertical format, allowing for more efficient
computation of itemsets.
Actionable Insights: The association rules derived from ECLAT analysis provide valuable insights into customer
purchasing patterns, which businesses can use for cross-selling, product placement, and personalized
recommendations to drive sales.
3.Adaptability to Different Data Types: It can work with numerical data like age, salary and categorical data like
gender, occupation.
4.Handling Noisy and Missing Data: Usually, datasets contain missing values or inconsistencies and clustering
can manage them easily.
Distance Metrics
Distance metrics are simple mathematical formulas to figure out how similar or different two data points are. Type
of distance metrics we choose plays a big role in deciding clustering results. Some of the common metrics are:
Euclidean Distance: It is the most widely used distance metric and finds the straight-line distance between two
points.
Manhattan Distance: It measures the distance between two points based on grid-like path. It adds the absolute
differences between the values.
Cosine Similarity: This method checks the angle between two points instead of looking at the distance. It’s used in
text data to see how similar two documents are.
Jaccard Index: A statistical tool used for comparing the similarity of sample sets. It’s mostly used for yes/no type
data or categories.
1. Partitioning Methods
Partitioning Methods divide the data into k groups (clusters) where each data point belongs to only one
group. These methods are used when you already know how many clusters you want to create. A common
example is K-means clustering.
In K-means the algorithm assigns each data point to the nearest center and then updates the center based on
the average of all points in that group. This process repeats until the centres stop changing. It is used in real-
life applications like streaming platforms like Spotify to group users based on their listening habits.
2. Hierarchical Methods
Hierarchical clustering builds a tree-like structure of clusters known as a dendrogram that represents the
merging or splitting of clusters. It can be divided into:
Divisive Approach (Top-down): It starts with one big cluster and splits it repeatedly into smaller clusters.
For example, classifying animals into broad categories like mammals, reptiles, etc and further refining them.
3. Density-Based Methods
Density-based clustering group data points that are densely packed together and treat regions with fewer data
points as noise or outliers. This method is particularly useful when clusters are irregular in shape.
For example, it can be used in fraud detection as it identifies unusual patterns of activity by grouping
similar behaviors together.
4. Grid-Based Methods
Grid-Based Methods divide data space into grids making clustering efficient. This makes the clustering
process faster because it reduces the complexity by limiting the number of calculations needed and is useful
for large datasets.
Climate researchers often use grid-based methods to analyze temperature variations across different
geographical regions. By dividing the area into grids they can more easily identify temperature patterns and
trends.
5. Model-Based Methods
Model-based clustering groups data by assuming it comes from a mix of distributions. Gaussian Mixture
Models (GMM) are commonly used and assume the data is formed by several overlapping normal
distributions.
GMM is commonly used in voice recognition systems as it helps to distinguish different speakers by
modeling each speaker’s voice as a Gaussian distribution.
6. Constraint-Based Methods
It uses User-defined constraints to guide the clustering process. These constraints may specify certain
relationships between data points such as which points should or should not be in the same cluster.
In healthcare, clustering patient data might take into account both genetic factors and lifestyle choices.
Constraints specify that patients with similar genetic backgrounds should be grouped together while also
considering their lifestyle choices to refine the clusters.
1. Numerical Data Numerical data consists of measurable quantities like age, income or temperature. Algorithms
like k-means and DBSCAN work well with numerical data because they depend on distance metrics. For example a
fitness app cluster users based on their average daily step count and heart rate to identify different fitness levels.
2. Categorical Data It contain non-numerical values like gender, product categories or answers to survey questions.
Algorithms like k-modes or hierarchical clustering are better for this. For example grouping customers based on
preferred shopping categories like "electronics" "fashion" and "home appliances."
3. Mixed Data Some datasets contain both numerical and categorical features that require hybrid approaches. For
example, clustering a customer database based on income (numerical) and shopping preferences (categorical) can use
k-prototype method.
Market Segmentation: This is used to segment customers based on purchasing behavior and allow
businesses send the right offers to the right people.
Image Segmentation: In computer vision it can be used to group pixels in an image to detect objects like
faces, cars or animals.
Biological Classification: Scientists use clustering to group genes with similar behaviors to understand
diseases and treatments.
Document Classification: It is used by search engines to categorize web pages for better search results.
Anomaly Detection: Cluster Analysis is used for outlier detection to identify rare data points that do not
belong to any cluster.
Choosing the Number of Clusters: Methods like K-means requires user to specify the number of clusters
before starting which can be difficult to guess correctly.
Scalability: Some algorithms like hierarchical clustering does not scale well with large datasets.
Cluster Shape: Many algorithms assume clusters are round or evenly shaped which doesn’t always match
real-world data.
Handling Noise and Outliers: They are sensitive to noise and outliers which can affect the results.
Cluster analysis is like organising a messy room—sorting items into meaningful groups making everything
easier to understand. Choosing the right clustering method depends on the dataset and goal of analysis.