0% found this document useful (0 votes)

199 views62 pages

Data Mining Using R

The document provides an overview of data mining techniques using R, covering key concepts such as data warehouses, data preprocessing, and various data mining methods including classification, clustering, and association rule mining. It outlines the data mining process, including stages like data selection, transformation, and evaluation, as well as the differences between data mining and knowledge discovery in databases (KDD). Additionally, it discusses the advantages of data mining, such as improved decision-making and operational efficiency.

Uploaded by

Rajasekhar Kallakunta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views62 pages

Data Mining Using R

Uploaded by

Rajasekhar Kallakunta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

DATA MINING TECHNIQUES USING R

UNIT:1

An idea on Data Warehouse, Data mining-KDD versus data mining, Stages of the Data Mining Process-Task
primitives., Data Mining Techniques - Data mining knowledge representation.

UNIT:2

Data mining query languages- Integration of Data Mining System with a Data Warehouse-Issues, Data pre-
processing - Data Cleaning, Data transformation - Feature selection Dimensionality Reduction.

UNIT:3

Concept Description: Characterization and comparison What is Concept Description, Data Generalization by
Attribute-Oriented Induction(AOI), AOI for Data Characterization, Efficient Implementation of AOI.

Mining Frequent Patterns, Associations and Correlations: Basic Concepts, FrequentItemset Mining Methods:
Apriori method, generating Association Rules, Improving the Efficiency of Apriori, Pattern-Growth Approach for
mining Frequent Item sets.

UNIT:4

Classification Basic Concepts: Basic Concepts, Decision Tree Induction: Decision Tree Induction Algorithm,
Attribute Selection Measures, Tree Pruning. Bayes Classification Methods.

UNIT:5

Association rule mining: Antecedent, consequent, muti-relational association rules,

ECLAT. Case study on Market Basket Analysis.

Cluster Analysis: Cluster Analysis, Partitioning Methods, Hierarchal methods, Density based methods-DBSCAN.

K.NAGA JYOTHI MA,M.TECH(CS)

UNIT:1

Introduction to Data Warehouse:

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of
management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a
particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a
product.

Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6
months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often
only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should
never be altered.

Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination of both.

The top-down approach starts with the overall design and planning. It is useful in cases where the technology is
mature and well known, and where the business problems that must be solved are clear and well understood.

The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business
modeling and technology development. It allows an organization to move forward at considerably less expense and
to evaluate the benefits of the technology before making significant commitments.

In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach
while retaining the rapid implementation and opportunistic application of the bottom-up approach.

The warehouse design process consists of the following steps:

Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration,
sales, or the general ledger. If the business process is organizational and involves multiple complex object
collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.

Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the
fact table for this process, for example, individual transactions, individual daily snapshots, and so on.

Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer,
supplier, warehouse, transaction type, and status.

Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like
dollars sold and units sold.

K.NAGA JYOTHI MA,M.TECH(CS)

A Three Tier Data Warehouse Architecture:

Tier-1:

The bottom tier is a warehouse database server that is almost always a relational database system. Back-end tools and
utilities are used to feed data into the bottom tier from operational databases or other external sources (such as
customer profile information provided by external consultants). These tools and utilities perform data extraction,
cleaning, and transformation (e.g., to merge similar data from different sources into a unified format), as well as load
and refresh functions to update the data warehouse. The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL
code to be executed at a server. Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Open Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also
contains a metadata repository, which stores information about the data warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP) model or a
multidimensional OLAP.

OLAP model is an extended relational DBMS thatmaps operations on multidimensional data to standard relational
operations.

A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements
multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or data mining
tools (e.g., trend analysis, prediction, and so on).

K.NAGA JYOTHI MA,M.TECH(CS)

Data Warehouse Models: There are three data warehouse models.

1. Enterprise warehouse:

An enterprise warehouse collects all of the information about subjects spanning the entire organization.

It provides corporate-wide data integration, usually from one or more operational systems or external information
providers, and is cross-functional in scope.

It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be implemented on traditional mainframes, computer super servers, or parallel
architecture platforms. It requires extensive business modeling and may take years to design and build.

2. Data mart:

A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is
confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item,
and sales. The data contained in data marts tend to be summarized.

Data marts are usually implemented on low-cost departmental servers that are UNIX/LINUX- or Windows-based.
The implementation cycle of a data mart is more likely to be measured in weeks rather than months or years.
However, it may involve complex integration in the long run if its design and planning were not enterprise-wide.

Depending on the source of data, data marts can be categorized as independent more dependent. Independent data
marts are sourced from data captured from one or more operational systems or external information providers, or
from data generated locally within a particular department or geographic area. Dependent data marts are source
directly from enterprise data warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the
possible summary views may be materialized.

A virtual warehouse is easy to build but requires excess capacity on operational database servers.

Fundamentals of Data Mining:

Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer.
Thus, data mining should have been more appropriately named as knowledge mining which emphasis on mining
from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and database systems.

The overall goal of the data mining process is to extract information from a data set and transform it into an
understandable structure for further use.

The key properties of data mining are

Automatic discovery of patterns

Prediction of likely outcomes

Creation of actionable information

K.NAGA JYOTHI MA,M.TECH(CS)

Focus on large datasets and databases

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business information in a large
database — for example, finding linked products in gigabytes of store scanner data — and mining a mountain for a
vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently
probing it to find exactly where the value resides.

Given databases of sufficient size and quality, data mining technology can generate new business opportunities by
providing these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of finding predictive
information in large databases. Questions that traditionally required extensive hands- on analysis can now be
answered directly from the data — quickly.

A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings
to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond
similarly to given events.

Automated discovery of previously unknown patterns.

Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying
anomalous data that could represent data entry keying errors.

DATA MINING INVOLVES SIX COMMON CLASSES OF TASKS/TASK PRIMITIVES:

Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might
be interesting or data errors that require further investigation.

Association rule learning (Dependency modelling) – Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can
determine which products are frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar",
without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program
might attempt to classify an e-mail as "legitimate" or as "spam".

Regression – attempts to find a function which models the data with the least error.

Summarization – providing a more compact representation of the data set, including Visualization and report
generation.

KDD Knowledge Discovery in Databases (KDD) is a comprehensive, multi-step process for extracting useful
knowledge from large datasets, while Data Mining (DM) is a specific step within the KDD process that uses

K.NAGA JYOTHI MA,M.TECH(CS)

algorithms to discover patterns. KDD encompasses the entire workflow, including data cleaning, selection,
transformation, data mining, and interpretation of results, whereas data mining focuses on the pattern extraction
itself.
Key Data Mining KDD
Data Mining
Features
vs KDD
Now we will
Basic Data mining is the process of identifying The KDD method is a complex and discuss the
Definition patterns and extracting details about big iterative approach to knowledge main
data sets using intelligent methods. extraction from big data. difference
covering data
mining vs
KDD.
Goal To extract patterns from datasets. To discover knowledge from datasets.

Scope In the KDD method, the fourth phase is KDD is a broad method that includes
called "data mining." data mining as one of its steps.

Used  Classification  Data cleaning

Techniques

 Clustering  Data Integration

 Decision Trees  Data selection

 Dimensionality Reduction  Data transformation

 Neural Networks  Data mining

 Regression  Pattern evaluation

 Knowledge Presentation

Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.

K.NAGA JYOTHI MA,M.TECH(CS)

DATA MINING TECHNIQUES:
Data Mining is the process of discovering useful patterns and insights from large amounts of data. Data science,
information technology, and artisanal practices put together to reassemble the collected information into something
valuable. Researchers and professionals are working to develop newer, faster, cheaper, and more accurate ways to
accomplish this process. Various other terms are attached to data mining, like "knowledge mining from data,"
"knowledge extraction," "data analysis," and "data dredging," which all simply refer to the same idea.

Data mining is often a synonym for Knowledge Discovery from Data (KDD). Some people see data mining as a
key part of KDD, where smart methods are used to find patterns in the data. The term "Knowledge Discovery in
Databases" (KDD) was first coined by Gregory Piatetsky-Shapiro in 1989. However, "data mining" became more
widely used in business and media. Today, both terms are often used interchangeably.

Steps in Knowledge Discovery from Data (KDD)

Knowledge discovery from data (KDD) is a multi-step process for extracting useful insights. The following are the
key steps involved:

 Data Selection: Identify and select relevant data from various sources for analysis.

 Data Preprocessing: Clean and transform the data to address errors and inconsistencies, making it suitable for
analysis.

 Data Transformation: Convert the cleaned data into a form that is suitable for data mining algorithms.

 Data Mining: Apply data mining techniques to identify patterns and relationships in the data, selecting appropriate
algorithms and models.

 Pattern Evaluation: Evaluate the identified patterns to determine their usefulness in making predictions or
decisions.

K.NAGA JYOTHI MA,M.TECH(CS)

 Knowledge Representation: Present the patterns in a way that is understandable and useful for decision-making.

 Knowledge Refinement: Refine the knowledge obtained to improve accuracy and usefulness based on feedback.

 Knowledge Dissemination: Share the results in an easily understandable format to aid decision-making.

Now we discuss here different types of Data Mining Techniques which are used to predict desire output.

Data Mining Techniques

1. Association

Association analysis looks for patterns where certain items or conditions tend to appear together in a dataset. It's
commonly used in market basket analysis to see which products are often bought together. One method, called
associative classification, generates rules from the data and uses them to build a model for predictions.

2. Classification

Classification builds models to sort data into different categories. The model is trained on data with known labels and
is then used to predict labels for unknown data. Some examples of classification models are:

 Decision Tree

 SVM(Support Vector Machine)

 Generalized Linear Models

 Bayesian classification

 Classification by Backpropagation

 K-NN Classifier

 Rule-Based Classification

 Frequent-Pattern Based Classification

 Rough Set Theory

 Fuzzy Logic

3. Prediction

Prediction is similar to classification, but instead of predicting categories, it predicts continuous values (like
numbers). The goal is to build a model that can estimate the value of a specific attribute for new data.

4. Clustering

Clustering groups similar data points together without using predefined categories. It helps discover hidden patterns
in the data by organizing objects into clusters where items in each cluster are more similar to each other than to those
in other clusters.

5. Regression

Regression is used to predict continuous values, like prices or temperatures, based on past data. There are two main
types: linear regression, which looks for a straight-line relationship, and multiple linear regression, which uses more
variables to make predictions.

6. Artificial Neural Network (ANN) Classifier

K.NAGA JYOTHI MA,M.TECH(CS)

An artificial neural network (ANN) is a model inspired by how the human brain works. It learns from data by
adjusting connections between artificial neurons. Neural networks are great for recognizing complex patterns but
require a lot of training and can be hard to interpret.

7. Outlier Detection

Outlier detection identifies data points that are very different from the rest of the data. These unusual points, called
outliers, can be spotted using statistical methods or by checking if they are far away from other data points.

8. Genetic Algorithm

Genetic algorithms are inspired by natural selection. They solve problems by evolving solutions over several
generations. Each solution is like a "species," and the fittest solutions are kept and improved over time, simulating
"survival of the fittest" to find the best solution to a problem.

Advantages of Data Mining

Data mining is a powerful tool that offers many benefits across a wide range of industries. The following are some of
the advantages of data mining:

Advantages Description

Helps extract useful information from large datasets for informed decision
Better Decision Making making.

Assists in identifying target markets and developing personalized

Improved Marketing marketing strategies.

Improves operational efficiency by identifying inefficiencies and

Increased Efficiency optimizing processes.

Fraud Detection Detects fraudulent activities by analyzing patterns in financial transactions.

Helps identify customers at risk of leaving and develop strategies to retain

Customer Retention them.

Provides businesses with insights into new opportunities and emerging

Competitive Advantage trends.

Improves healthcare outcomes by identifying risk factors and enabling

Improved Healthcare early diagnosis.

Disadvantages Of Data Mining

K.NAGA JYOTHI MA,M.TECH(CS)

While data mining offers many benefits, there are also some disadvantages and challenges associated with the
process. The following are some of the main disadvantages of data mining:

Disadvantages Description

Results can be unreliable if the data is incomplete, inaccurate, or

Data Quality inconsistent.

Sensitive data could be misused if it falls into the wrong hands,

Data Privacy and Security risking privacy and security.

Raises ethical concerns about privacy, surveillance, and

Ethical Considerations discrimination.

Requires expertise in statistics, computer science, and domain

Technical Complexity knowledge.

Cost Can be expensive, especially when large datasets need to be analyzed.

Generated data can be difficult to interpret and find meaningful

Interpretation of Results patterns.

Relies heavily on technology, and technical failures can lead to data

Dependence on Technology loss or corruption.

DATA MINING Knowledge Representation:

Knowledge representation is the presentation of knowledge to the user for visualization in terms of trees, tables, rules
graphs, charts, matrices, etc.
For Example: Histograms

Histograms

 Histogram provides the representation of a distribution of values of a single attribute.

 It consists of a set of rectangles, that reflects the counts or frequencies of the classes present in the given data.

Example: Histogram of an electricity bill generated for 4 months, as shown in diagram given below.

K.NAGA JYOTHI MA,M.TECH(CS)

Data Visualization

 It deals with the representation of data in a graphical or pictorial format.

 Patterns in the data are marked easily by using the data visualization technique.

Some of the vital data visualization techniques are:

1. Pixel- oriented visualization technique

 In pixel based visualization techniques, there are separate sub-windows for the value of each attribute and it is
represented by one colored pixel.

 It maximizes the amount of information represented at one time without any overlap.

 Tuple with 'm' variable has different 'm' colored pixel to represent each variable and each variable has a sub window.

 The color mapping of the pixel is decided on the basis of data characteristics and visualization tasks.

2. Geometric projection visualization technique

Techniques used to find geometric transformation are:

i. Scatter-plot matrices
It consists of scatter plots of all possible pairs of variables in a dataset.

K.NAGA JYOTHI MA,M.TECH(CS)

ii. Hyper slice
It is an extension to scatter-plot matrices. They represent multi-dimensional function as a matrix of orthogonal two
dimensional slices.

iii. Parallel co-ordinates

 The parallel vertical lines which are separated defines the axes.

 A point in the Cartesian coordinates corresponds to a polyline in parallel coordinates.

3. Icon-based visualization techniques

 Icon-based visualization techniques are also known as iconic display techniques.

 Each multidimensional data item is mapped to an icon.

 This technique allows visualization of large amount of data.

 The most commonly used technique is Chernoff faces.

Chernoff faces

 This concept was introduced by Herman Chernoff in 1973.

 The faces in Chernoff faces are related to facial expressions or features of human being. So, it becomes easy to
identify the difference between the faces.

 It includes the mapping of different data dimensions with different facial features.

For example: The face width, the length of the mouth and the length of nose, etc. as shown in the following
diagram.

4. Hierarchical visualization techniques

 Hierarchical visualization techniques are used for partitioning of all dimensions in to subset.

 These subsets are visualized in hierarchical manner.

Some of the visualization techniques are:

i. Dimensional stacking

 In dimension stacking, n-dimensional attribute space is partitioned in 2-dimensional subspaces.

K.NAGA JYOTHI MA,M.TECH(CS)

 Attribute values are partitioned into various classes.

 Each element is two dimensional space in the form of xy plot.

 Helps to mark the important attributes and are used on the outer level.

ii. Mosaic plot

 Mosaic plot gives the graphical representation of successive decompositions.

 Rectangles are used to represent the count of categorical data and at every stage, rectangles are split parallel.

iii. Worlds within worlds

 Worlds within worlds are useful to generate an interactive hierarchy of display.

 Innermost word must have a function and two most important parameters.

 Remaining parameters are fixed with the constant value.

 Through this, N-vision of data are possible like data glove and stereo displays, including rotation, scaling (inner)
and translation (inner/outer).

 Using queries, static interaction is possible.

iv. Tree maps

 Tree maps visualization techniques are well suited for displaying large amount of hierarchical structured data.

 The visualization space is divided into the multiple rectangles that are ordered, according to a quantitative
variable.

 The levels in the hierarchy are seen as rectangles containing the other rectangle.

 Each set of rectangles on the same level in the hierarchy represents a category, a column or an expression in a data
set.

v. Visualization complex data and relations

 This technique is used to visualize non-numeric data.

For example: text, pictures, blog entries and product reviews.

 A tag cloud is a visualization method which helps to understand the information of user generated tags.

 It is also possible to arrange the tags alphabetically or according to the user preferences with different font sizes
and colors.

K.NAGA JYOTHI MA,M.TECH(CS)

UNIT:2
Data Mining Query Language:

1. Purpose:

o Enable users to specify patterns to be discovered, data to be analyzed, and the conditions under which data mining
operations should be performed.

o Support complex queries involving aggregation, filtering, and pattern recognition.

2. Examples:

o DMQL (Data Mining Query Language): Designed for specifying data mining tasks and providing access to data
mining models.

o SQL with Data Mining Extensions: Some database systems extend SQL to include data mining capabilities,
allowing users to integrate data mining tasks directly within their SQL queries.

3. Typical Functions:

o Specifying target variables and patterns (e.g., mining frequent itemsets, building classification models).

o Supporting the retrieval of discovered patterns and associations for further analysis.

Data Mining Query Language Using R

In R, we don’t have a specific "Data Mining Query Language" (DMQL), but we can perform data mining tasks such
as association rule mining, classification, and clustering using R packages and functions. Below, I'll demonstrate
these tasks with sample data, R code, and sample output to simulate a data mining process.

Let’s work through examples that cover association rule mining (Apriori), classification (Decision Tree)

1. Association Rule Mining Using Apriori

Objective: Find frequent patterns in transactional data.

Example Code:

We’ll use the arules package to mine association rules.

R # Install and load the necessary package

if (!require("arules")) install.packages("arules", dependencies=TRUE) library(arules)

# Sample transactional data transactions <- list(

c("Bread", "Milk"),

c("Bread", "Diaper", "Beer", "Eggs"),

c("Milk", "Diaper", "Beer", "Cola"),

K.NAGA JYOTHI MA,M.TECH(CS)
c("Bread", "Milk", "Diaper", "Beer"),

c("Bread", "Milk", "Diaper", "Cola")

# Convert the list to transaction data format trans <- as(transactions, "transactions")

# Apply Apriori algorithm to find frequent itemsets and association rules rules <- apriori(trans, parameter = list(supp
= 0.2, conf = 0.6))

# Inspect the rules generated inspect(rules)

Explanation:

 transactions: A list simulating transactional data (products bought together).

 Apriori: We specify a minimum support of 20% and a confidence level of 60% to mine the association
rules.

Sample Output:

LhS rhs support confidence lift

1 {Milk} => {Bread} 0.6 0.75 1.25

2 {Diaper}=> {Milk} 0.6 0.75 1.25

This output means:

 Rule 1: If Milk is bought, there is a 75% chance Bread will also be bought.

 Rule 2: If Diaper is bought, there is a 75% chance Milk will also be bought.

2. Classification Using Decision Tree

Objective: Classify if a customer will churn based on age and income.

Example Code:

Using the rpart package for decision tree classification. R

# Load rpart library

if (!require("rpart")) install.packages("rpart", dependencies=TRUE)

library(rpart)

# Sample customer data customer_data <- data.frame( Age = c(25, 45, 35, 50, 23),

Income = c(40000, 90000, 60000, 85000, 45000),

Churn = as.factor(c('No', 'Yes', 'No', 'Yes', 'No'))

# Build a decision tree model

churn_model <- rpart(Churn ~ Age + Income, data = customer_data, method = "class")

# Display the decision tree print(churn_model)

K.NAGA JYOTHI MA,M.TECH(CS)
Explanation:

 churn_model: Builds a decision tree using Age and Income to predict Churn.

 The model divides the dataset based on attribute values to make classifications.

Sample Output:

n= 5

node), split, n, loss, yval, (yprob)

1) Age>=45 2 0 Yes (0.0 1.0) *

2) Age< 45 3 1 No (0.67 0.33) *

Advantages of Using R for Data Mining:

1. Flexibility: R offers a flexible and extensible environment for data mining, with many libraries available
for different tasks.

2. Wide Range of Algorithms: R supports a broad spectrum of data mining algorithms (e.g., decision trees,
k-means, apriori) through its packages.

3. Visualization Support: R integrates data mining with advanced visualization tools to help users
understand patterns.

4. Scalability: R can handle large datasets and complex mining tasks using parallel computing and big data
solutions.

Conclusion: While R does not have a formal Data Mining Query Language (DMQL), its extensive packages such as
rpart, arules, and dplyr provide a rich framework to perform data mining tasks. These packages offer functionalities
that mimic DMQL, enabling users to classify, cluster, and mine association rules effectively.

With its versatility and powerful statistical tools, R is a go-to solution for data mining tasks in academic and
professional environments.

Integration of a data mining system with a data warehouse:

DB and DW systems, possible integration schemes include no coupling, loose coupling, semi-tight coupling, and
tight coupling. We examine each of these schemes, as follows:

1. No coupling: No coupling means that a DM system will not utilize any function of a DB or DW system. It may
fetch data from a particular source (such as a file system), process data using some data mining algorithms, and
then store the mining results in another file.

2. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching
data from a data repository managed by these systems, performing data mining, and then storing the mining results
either in a file or in a designated place in a database or data Warehouse. Loose coupling is better than no coupling
because it can fetch any portion of data stored in databases or data warehouses by using query processing,
indexing, and other system facilities.

K.NAGA JYOTHI MA,M.TECH(CS)

However, many loosely coupled mining systems are main memory-based. Because mining does not explore data
structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to
achieve high scalability and good performance with large data sets.

3. Semi-tight coupling: Semi-tight coupling means that besides linking a DM system to a DB/DW system, efficient
implementations of a few essential data mining primitives (identified by the analysis of frequently encountered data
mining functions) can be provided in the DB/DW system. These primitives can include sorting, indexing,
aggregation, histogram analysis, multi way join, and pre computation of some essential statistical measures, such as
sum, count, max, min ,standard deviation,

4. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. The data
mining subsystem is treated as one functional component of information system. Data mining queries and functions
are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of
a DB or DW system.

Fig: Integration of a data mining system with a data warehouse

Issues in Data Integration

When you integrate the data in Data Mining, you may face many issues. There are some of those issues:

1.Entity Identification Problem

As you understand, the records are obtained from heterogeneous sources, and how can you 'match the real- world
entities from the data'.

For example, you were given client data from specialized statistics sites. Customer identity is assigned to an entity
from one statistics supply, while a customer range is assigned to an entity from another statistics supply. Analyzing
such metadata statistics will prevent you from making errors during schema integration.

2.Redundancy and Correlation Analysis

One of the major issues in the course of data integration is redundancy. Unimportant data that are no longer required
are referred to as redundant data. It may also appear due to attributes created from the use of another property inside
the information set.

For example, if one truth set contains the patronage and distinct data set as the purchaser's date of the beginning, then
age may be a redundant attribute because it can be deduced from the use of the beginning date.

3.Tuple Duplication

K.NAGA JYOTHI MA,M.TECH(CS)

Information integration has also handled duplicate tuples in addition to redundancy. Duplicate tuples may also appear
in the generated information if the denormalized table was utilized as a deliverable for data integration.

4.Data warfare Detection and backbone

The data warfare technique of combining records from several sources is unhealthy. In the same way, that
characteristic values can vary, so can statistics units. The disparity may be related to the fact that they are represented
differently within the special data units. For example, in one-of-a-kind towns, the price of an inn room might be
expressed in a particular currency. This type of issue is recognized and fixed during the data integration process.

DATA PREPROCESSING:

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.
Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to
contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares
raw data for further processing. Data preprocessing is used database-driven applications such as customer
relationship management and rule-based applications (like neural networks).

Steps of Data Preprocessing

Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data
to make it suitable for analysis. Some common steps in data preprocessing include:

1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.

2. Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration.

3. Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization is used to
transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into
discrete categories.

4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature selection and feature
extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature
extraction involves transforming the data into a lower-dimensional space while preserving the important
information.

5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and
clustering.

6. Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or - 1 and 1. Normalization is often used to handle data with different units and scales. Common
normalization techniques include min-max normalization, z-score normalization, and decimal scaling.

Data preprocessing:
K.NAGA JYOTHI MA,M.TECH(CS)
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results. The
specific steps involved in data preprocessing may vary depending on the nature of the data and the analysis goals.

By performing these steps, the data mining process becomes more efficient and the results become more accurate.

Preprocessing in Data Mining

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.

Steps Involved in Data Preprocessing

1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

 Missing Data: This situation arises when some data is missing in the data. It can be handled in various
ways.Some of them are:

o Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

o Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

o Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to complete the
task.

o Regression:Here data can be made smooth by fitting it to a regression function.The regression used may be
linear (having one independent variable) or multiple (having multiple independent variables).

o Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.

Data Cleaning Techniques:

 Automated Data Cleaning Tools:

K.NAGA JYOTHI MA,M.TECH(CS)

o Use tools and libraries (e.g., OpenRefine, pandas in Python, dplyr in R) that provide functionalities for cleaning
data efficiently.

 Regular Expressions (Regex):

o Utilize regex for pattern matching and string manipulation to clean and standardize text data.

 Data Profiling:

o Perform data profiling to understand the data's structure, quality, and content, which aids in identifying cleaning
needs.

 Data Integration:

o When merging data from multiple sources, ensure consistent formats and values across datasets.

1. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

 Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

 Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.

 Discretization: This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

 Concept Hierarchy Generation: Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to “country”.

Data Transformation Techniques

1. ETL (Extract, Transform, Load):

o The ETL process is crucial in data warehousing, where data is extracted from various sources, transformed into a
usable format, and then loaded into the data warehouse.

2. Scripting and Programming:

o Data transformation can be accomplished using programming languages such as Python (with libraries like
pandas and NumPy) or R, which offer extensive functionalities for data manipulation and transformation.

3. Data Transformation Tools:

o Utilize data transformation tools (e.g., Talend, Apache Nifi, Alteryx) that provide user-friendly interfaces for
performing various transformation tasks.

4. Regular Expressions (Regex):

o Use regex for pattern matching and text manipulation, particularly for cleaning and standardizing string data.

2. Data Integration: Integrating data from heterogenous sources of data are combined into single dataset. There are
two type of data integration:

1. Tight coupling: data is combined together into a physical location

2. Loose coupling: only an interface is created and data is combined through the interface.

K.NAGA JYOTHI MA,M.TECH(CS)

3. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data analysis and to
avoid overfitting of the model. Some common steps involved in data reduction are:

 Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature selection is often
performed to remove irrelevant or redundant features from the dataset. It can be done using various techniques
such as correlation analysis, mutual information, and principal component analysis (PCA).

 Feature Extraction: This involves transforming the data into a lower-dimensional space while preserving the
important information. Feature extraction is often used when the original features are high-dimensional and
complex. It can be done using techniques such as PCA, linear discriminant analysis (LDA), and non-negative
matrix factorization (NMF).

 Sampling: This involves selecting a subset of data points from the dataset. Sampling is often used to reduce the
size of the dataset while preserving the important information. It can be done using techniques such as random
sampling, stratified sampling, and systematic sampling.

Clustering:
This involves grouping similar data points together into clusters. Clustering is often used to reduce the size of the
dataset by replacing similar data points with a representative centroid. It can be done using techniques such as k-
means, hierarchical clustering, and density-based clustering.

 Compression: This involves compressing the dataset while preserving the important information. Compression is
often used to reduce the size of the dataset for storage and transmission purposes. It can be done using techniques
such as wavelet compression, JPEG compression, and gif compression

The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the
process to reduce these features is called dimensionality reduction.A dataset contains a huge number of input
features in various cases, which makes the predictive modeling task more complicated. Because it is very difficult
to visualize or make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into
lesser dimensions dataset ensuring that it provides similar information."

It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal processing,
bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis, etc.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as the curse of dimensionality. If
the dimensionality of the input dataset increases, any machine learning algorithm and model becomes more complex.
As the number of features increases, the number of samples also gets increased proportionally, and the chance of
overfitting also increases. If the machine learning model is trained on high-dimensional data, it becomes overfitted
and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction.

Benefits of applying Dimensionality Reduction

K.NAGA JYOTHI MA,M.TECH(CS)

Some benefits of applying dimensionality reduction technique to the given dataset are given below:

o By reducing the dimensions of the features, the space required to store the dataset also gets reduced.

o Less Computation training time is required for reduced dimensions of features.

o Reduced dimensions of features of the dataset help in visualizing the data quickly.

o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are given below:

o Some data may be lost due to dimensionality reduction.

o In the PCA dimensionality reduction technique, sometimes the principal components required to consider
are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection

Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features
present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features
from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common
techniques of filters method are:

o Correlation

o Chi-Square Test

o ANOVA

o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation.
In this method, some features are fed to the ML model, and evaluate the performance. The performance decides
whether to add those features or remove to increase the accuracy of the model. This method is more accurate than the
filtering method but complex to work. Some common techniques of wrapper methods are:

o Forward Selection

o Backward Selection

o Bi-directional Elimination

K.NAGA JYOTHI MA,M.TECH(CS)

3. Embedded Methods: Embedded methods check the different training iterations of the machine learning model
and evaluate the importance of each feature. Some common techniques of Embedded methods are:

o LASSO

o Elastic Net

o Ridge Regression, etc. Feature Extraction:

Feature extraction is the process of transforming the space containing many dimensions into space with fewer
dimensions. This approach is useful when we want to keep the whole information but use fewer resources while
processing the information.

Some common feature extraction techniques are:

1. Principal Component Analysis

2. Linear Discriminant Analysis

3. Kernel PCA

4. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction

1. Principal Component Analysis

2. Backward Elimination

3. Forward Selection

4. Score comparison

5. Missing Value Ratio

6. Low Variance Filter

7. High Correlation Filter

8. Random Forest

9. Factor Analysis

10. Auto-Encoder

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations of correlated features into a set of
linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.

PCA works by considering the variance of each attribute because the high attribute shows the good split between the
classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various communication channels.

Backward Feature Elimination

K.NAGA JYOTHI MA,M.TECH(CS)
The backward feature elimination technique is mainly used while developing Linear Regression or Logistic
Regression model. Below steps are performed in this technique to reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken to train the model.

o The performance of the model is checked.

o Now we will remove one feature each time and train the model on n-1 features for n times, and will
compute the performance of the model.

o We will check the variable that has made the smallest or no change in the performance of the model, and
then we will drop that variable or features; after that, we will be left with n-1 features.

o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum tolerable error rate, we can
define the optimal number of features require for the machine learning algorithms.

Forward Feature Selection

Forward feature selection follows the inverse process of the backward elimination process. It means, in this
technique, we don't eliminate the feature; instead, we will find the best features that can produce the highest increase
in the performance of the model. Below steps are performed in this technique:

o We start with a single feature only, and progressively we will add each feature at a time.

o Here we will train the model on each feature separately.

o The feature with the best performance is selected.

o The process will be repeated until we get a significant increase in the performance of the model.

Missing Value Ratio

If a dataset has too many missing values, then we drop those variables as they do not carry much useful information.
To perform this, we can set a threshold level, and if a variable has missing values more than that threshold, we will
drop that variable. The higher the threshold value, the more efficient the reduction.

Random Forest

Random Forest is a popular and very useful feature selection algorithm in machine learning. This algorithm contains
an in-built feature importance package, so we do not need to program it separately. In this technique, we need to
generate a large set of trees against the target variable, and with the help of usage statistics of each attribute, we need
to find the subset of features.

Random forest algorithm takes only numerical variables, so we need to convert the input data into numeric data
using hot encoding.

Methods of Dimensionality Reduction:

1. Principal Component Analysis (PCA): Transforms the original features into a new set of orthogonal features
(principal components) that capture the maximum variance in the data.

o Useful for linear relationships and when preserving variance is crucial.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique that is particularly useful for
visualizing high-dimensional data in two or three dimensions.

K.NAGA JYOTHI MA,M.TECH(CS)

o Preserves local relationships, making it ideal for clustering and visualization.

3. Linear Discriminant Analysis (LDA): A supervised dimensionality reduction technique that aims to project data
in a way that maximizes class separability.

o Useful for classification problems where class labels are available.

4. Autoencoders: Neural network-based techniques that learn to compress data into a lower- dimensional space and
then reconstruct it back.

o Effective for capturing complex relationships in data.

Key Differences

Aspect Feature Selection Dimensionality Reduction

Objective Selects a subset of original features Transforms data into a new feature space

Approach Subset selection Transformation and projection

Resulting Features Original features retained New features (components) generated

Interpretability Easier to interpret Less interpretable (new dimensions)

When to Use

 Use Feature Selection when:

o You have a large number of features and want to identify the most relevant ones.

o You aim to improve model performance and interpretability without altering the feature space.

 Use Dimensionality Reduction when:

o You need to visualize high-dimensional data.

o You want to reduce noise and redundancy while preserving the underlying structure of the data.

You are dealing with data that has multicollinearity issues (highly correlated features).

K.NAGA JYOTHI MA,M.TECH(CS)

UNIT:3
CONCEPT DESCRIPTION

Data mining can be classified into two categories: descriptive data mining and predictive data mining.
Descriptive data mining describes the data set in a concise and summative manner and presents interesting
general properties of the data. Predictive data mining analyzes the data in order to construct one or a set of
models, and attempts to predict the behavior of new data sets. Data base is usually storing the large
amounts of data in great detail. However users often like to view sets of summarized data in concise,
descriptive terms. Such data descriptions may provide an overall picture of a class of data or distinguish it
from a set of comparative classes. Moreover, users like the ease and flexibility of having data sets described
at different levels of granularity and from different angles. Such descriptive data mining is called concept
description and forms an important component of data mining.

The simplest kind of descriptive data mining is concept description. A concept usually refers to a collection
of data such as frequent_buyers, graduate_students, and so on. As a data mining task, concept description is
not a simple enumeration of the data. Instead, concept description generates descriptions for
characterization and comparison of the data. It is some times called class description, when the concept to
be described refers to a class of objects. Characterization provides a concise and succinct summarization of
the given collection of the data, while concept or class comparison (also known as discrimination) provides
discriminations comparing two or more collections of data. Since concept description involves both
characterization and comparison, techniques for accomplishing each of these tasks will study. Concept
description has close ties with the data generalization. Given the large amount of data stored in database, it
is useful to be describe concepts in concise and succinct terms at generalized at multiple levels of
abstraction facilities users in examining the general behavior of the data. Given the ABCompany database,
for example, instead of examining individual customer transactions, sales managers may prefer to view the
data generalized to higher levels, such as summarized to higher levels, such as summarized by customer
groups according to geographic regions, frequency of purchases per group, and customer income. Such
multiple dimensional, multilevel data generalization is similar to multidimensional data analysis in data
warehouses. The fundamental differences between concept description in large databases and online
analytical processing involve the following.

Concept description:

➢ Characterization: provides a concise and succinct summarization of the given collection of data

➢ Comparison: provides descriptions comparing two or more collections of data.

K.NAGA JYOTHI MA,M.TECH(CS)
A concept usually refers to a collection of data such as frequent_buyers , graduate_students etc.

❖ As a data mining task, concept description is not a simple enumeration (number of things done one by one) of the
data.

❖ Concept description generates descriptions for characterization and comparison of the data it is also called class
description.

Characterization provides a concise and brief summarization of the data.

While concept or class comparison (also known as discrimination) provides discriminations (inequity) comparing
two or more collections of data.

Example:

➢ Given the ABC Company database, for example, examining individual customer transactions.

➢ Sales managers may prefer to view the data generalized to higher level, such as summarized by customer groups
according to geographic regions, frequency of purchases per group and customer income.

Data Generalization and Summarization-Based Characterization:

Data Generalization and Summarization-Based Characterization Data and objects in databases often contain detailed
information at primitive concept levels. .For example, the item relation in sales database may contain attributes
describing low-level item information such s item _ID , name , brand, category, supplier, place made, and price. It is
useful to be able to summarize a large set or data and present it at a high conceptual level.. For example,
summarizing a large set of items relating to Christmas season sales provides a general description of such data ,
which can be very helpful for sales and marketing managers.

This requires an important functionality in data mining: data generalization. Data generalization is a process that
abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual
levels. Methods for the efficient and flexible generalization of large data sets can be categorized according to two
approaches :(1) the data cube (or OLAP) approach and (2) the attribute –oriented induction approach .

DATA GENERALIZATION AND SUMMARIZATION

Data generalization

A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.

Approaches:

➢ Data cube approach(OLAP approach)

➢ Attribute-oriented induction approach

K.NAGA JYOTHI MA,M.TECH(CS)

PRESENTATION OF GENERALIZED RESULTS

Generalized relation:

Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.

Cross tabulation:

Mapping results into cross tabulation form (similar to contingency tables).

Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms.

Quantitative characteristic rules:

Mapping generalized result into characteristic rules with quantitative information associated with it

AOI:
Attribute-Oriented Induction The attribute-oriented induction (AOI)) approach to data generalization and
summarization-based characterization was first proposed in 1989,a few years prior to the introduction of the data
cube approach. The data cube approach can be considered as a data warehouse-based, pre-computation oriented,
materialized-view approach. It performs off-line aggregation before an OLAP or data mining query is submitted for
processing. On the other hand, the attribute-oriented induction approach, at least in its initial proposal, is a relational
database query –oriented, generalization –based, on-line data analysis technique. However, there is no inherent
barrier distinguishing the two approaches based on on-line aggregation versus off-line pre computation. Some
aggregations in the data cube can be computed on-line, while off-line while off-line pre -computation of
multidimensional space can speed up attribute –oriented induction as well.

Why?

1.Which dimensions should be included?

2.How high level of generalization?

3.Automatic vs. interactive Reduce # attributes; easy to understand patterns

What?

1. statistical method for preprocessing data

■ filter out irrelevant or weakly relevant attributes

■ retain or rank the relevant attributes

2. relevance related to dimensions and levels

3. analytical characterization, analytical comparison

How?

1. Data Collection

2. Preliminary

3. Relevance Analysis

4.Perform AOI to removing or generalizing attributes.

Relevance Analysis
K.NAGA JYOTHI MA,M.TECH(CS)
1.Use relevance analysis measure e.g. information gain to identify highly relevant dimensions and levels.

2.Sort and select the most relevant dimensions and levels.

Attribute-oriented Induction for class description

1.On selected dimension/level

OLAP operations (e.g. drilling, slicing) on relevance rules

ATTRIBUTE ORIENTED INDUCTION FOR DATA CHARACTERIZATION :

The Attribute-Oriented Induction (AOI) approach to data generalization and summarization-based characterization
was first proposed in 1989, a few years before the introduction of the data cub approach. The data cube approach can
be considered as a data warehouse-based, pre-computational-oriented, materialized approach. It performs offline
aggregation before an OLAP or data mining query is submitted for processing. On the other hand, the attribute-
oriented induction approach, at least in its initial proposal, is a relational database query-oriented, generalized-based,
online data analysis technique. However, there is no inherent barrier distinguishing the two approaches based on
online aggregation versus offline pre-computation.

Basic Principles of Attribute Oriented Induction

A set of basic principles for the attribute-oriented induction in relational databases is summarized as follows:-

1. follows Data focusing: Analyzing task-relevant data, including dimensions, and the result is the initial relation.

2. Attribute-removal: To remove attribute A if there is a large set of distinct values for A but either

a.There is no generalization operator on A, or

b. A's higher-level concepts are expressed in terms of other attributes.

3. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A. 3.

4. Attribute-threshold control: Typical 2-8, specified/default.

EFFICIENT IMPLEMENTATION OF ATTRIBUTE ORIENTED INDUCTION ALGORITHM

AOI stands for Attribute-Oriented Induction. The attribute-oriented induction approach to concept description was
first proposed in 1989, a few years before the introduction of the data cube approach. The data cube approach is
essentially based on materialized views of the data, which typically have been pre-computed in a data warehouse.

In general, it implements off-line aggregation earlier an OLAP or data mining query is submitted for processing. In
other words, the attribute-oriented induction approach is generally a query-oriented, generalization-based, on-line
data analysis methods.

The general idea of attribute-oriented induction is to first collect the task-relevant data using a database query and
then perform generalization based on the examination of the number of distinct values of each attribute in the
relevant collection of data.

The generalization is implemented by attribute removal or attribute generalization. Aggregation is implemented by

combining identical generalized tuples and accumulating their specific counts. This decreases the size of the

K.NAGA JYOTHI MA,M.TECH(CS)

generalized data set. The resulting generalized association can be mapped into several forms for presentation to the
user, including charts or rules.

Algorithm:

The process of attribute-oriented induction which are as follows −

 First, data focusing must be implemented before attribute-oriented induction. This step corresponds to the description
of the task-relevant records (i.e., data for analysis). The data are collected based on the data supported in the data
mining query.

 Because a data mining query is usually relevant to only a portion of the database, selecting the relevant set of data
not only makes mining more efficient, but also changes more significant results than mining the whole database.

 It can be specifying the set of relevant attributes (i.e., attributes for mining, as indicated in DMQL with the in
relevance to clause) may be difficult for the user. A user can choose only a few attributes that it is important,
while missing others that can also play a role in the representation.

 For example, suppose that the dimension birth place is defined by the attributes city, province or state, and country. It
can allow generalization on the birth place dimension, the other attributes defining this dimension should also be
included.

 In other terms, having the system automatically involve province or state and country as relevant attributes enables
city to be generalized to these larger conceptual levels during the induction phase.

 At the other extreme, suppose that the user may have introduced too many attributes by specifying all of the possible
attributes with the clause “in relevance to *”. In this case, all of the attributes in the relation specified by the from
clause would be included in the analysis.

 Some attributes are unlikely to contribute to an interesting representation. A correlation-based or entropy-based

analysis method can be used to perform attribute relevance analysis and filter out statistically irrelevant or
weakly relevant attributes from the descriptive mining process.

Procedure:

Step1:open the weka explorer Step2:load the data set

Step 3:choose select attributes option Choose cfssubseteval option

Select the ranker option Click the start button

Output:

=== Run information ===

Evaluator: weka.attributeSelection.CfsSubsetEval -P 1 - E 1

Search: weka.attributeSelection.GreedyStepwise -T - 1.7976931348623157E308 -N -1 -num-

slots 1

Relation: breast-cancer Instances:

286

MINING FREQUENT PATTERNS,ASSOCIATIONS AND CORRELATIONS

K.NAGA JYOTHI MA,M.TECH(CS)
Basic Concepts:

 Market Basket Analysis: A Motivating Example

 Frequent Itemsets, Closed Itemsets, and Association Rules

Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set.

For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a
frequent itemset.A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs
frequently in a shopping history database, is a (frequent) sequential pattern.

A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern.Finding frequent patterns plays an essential role in mining associations, correlations, and many other
interesting relationships among data. Moreover, it helps in data classification, clustering, and other data mining tasks.

Market Basket Analysis: A Motivating Example

Frequent item set mining leads to the discovery of associations and correlations among items in large transactional or
relational data sets. With massive amounts of data continuously being collected and stored, many industries are
becoming interested in mining such patterns from their databases. The discovery of interesting correlation
relationships among huge amounts of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping behavior analysis.

An example of frequent itemset mining is market basket analysis.

This process analyzes customer buying habits by finding associations between the different items that customers
place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by customers. For instance, if customers are
buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket?
This information can lead to increased sales by helping retailers do selective marketing and plan their shelf space.

Market basket analysis

Let’s look at an example of how market basket analysis can be useful.Suppose, as manager of an All Electronics
branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, “Which
groups or sets of items are customers likely to purchase on a given trip to the store?”

To answer your question, market basket analysis may be performed on the retail data of customer

Transactions at your store. You can then use the results to plan marketing or advertising strategies, or in the design of
a new catalog. For instance, market basket analysis may help you design different store layouts. In one strategy,
K.NAGA JYOTHI MA,M.TECH(CS)
items that are frequently purchased together can be placed in proximity to further encourage the combined sale of
such items. If customers who purchase computers also tend to buy antivirus software at the same time, then placing
the hardware display close to the software display may help increase the sales of both items.

The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or
purchased together. These patterns can be represented in the form of association rules. For example, the information
that customers who purchase computers also tend to buy antivirus software at the same time is represented in the
following association

rule: computer ⇒ antivirus software [support = 2%,confidence = 60%]

Rule support and confidence are two measures of rule interestingness.

A support of 2% for Rule means that 2% of all the transactions under analysis show that computer and antivirus
software are purchased together.

A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.

Association rules are considered interesting if they satisfy both a minimum support threshold and a minimum
confidence threshold. These thresholds can be a set by users or domain experts.

Frequent Itemsets, Closed Itemsets, and Association Rules:

contain A if A ⊆ T.
Each transaction is associated with an identifier, called a TID. Let A be a set of items. A transaction T is said to

An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠∅, and A ∩B = φ. The rule A
⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A ∪B
(i.e., the union of sets A and B say, or, both A and B). This is taken to be the probability, P(A ∪B).

The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A
that also contain B. This is taken to be the conditional probability, P(B|A). That is

support(A⇒B) =P(A ∪B)

confidence(A⇒B) =P(B|A).

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf ) are
called strong. support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.

If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of
I satisfies the corresponding minimum support count threshold), then I is a frequent itemset.

A set of items is referred to as an itemset.

An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a 2-itemset. The occurrence
frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the
frequency, support count, or count of the itemset

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined
minimum support count, min sup.

K.NAGA JYOTHI MA,M.TECH(CS)

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support
and minimum confidence.

An itemset X is closed frequent Itemset only if

 X is frequent

 No immediate superset of X has sam support of X.

super-itemset Y such that X ⊂ Y and Y is frequent in D.

An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent, and there exists no

Frequent Itemset Mining Methods:

 Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

 A Pattern-Growth Approach for Mining Frequent Itemsets OR FP-growth (finding frequent itemsets without
candidate generation).

Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation

 Generating Association Rules from Frequent Itemsets

 Improving the Efficiency of Apriori

The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k- itemsets are used to explore (k + 1)-
itemsets.

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted by L1.

Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent
k-itemsets can be found. The finding of each Lk requires one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori
property is used to reduce the search space.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

The Apriori property is based on the following observation. By definition, if an itemset I does not satisfy the

itemset I, then the resulting itemset (i.e., I ∪A) cannot occur more frequently than I. Therefore, I ∪A is not frequent
minimum support threshold, min sup, then I is not frequent, that is, P(I) < min sup. If an item A is added to the

either, that is, P(I ∪A) < min sup.

This property belongs to a special category of properties called antimonotonicity in the sense that if a set cannot
pass a test, all of its supersets will fail the same test as well.

“How is the Apriori property used in the algorithm?”

To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2.

A two-step process is followed, consisting of join and prune actions.

1. The join step: This step generates (K+1) item set from K-itesets by joining each item with itself. To find Lk
, a set of candidate k-itemsets is generated by joining L k−1 with itself. This set of candidates is denoted
Ck . Let l1 and l2 be itemsets in L k−1.

K.NAGA JYOTHI MA,M.TECH(CS)

2. The Prune step: This step scans the count of each item in the database. If the candidate item

does not meet minimum support then it is regarded as infrequent and thus it is removed. This step is performed to
reduce the size of the candidate itemsets.

Example:

There are nine transactions in this database, that is, |D| = 9 i.e Transactional Data for an AllElectronics Branch.
illustrate the Apriori algorithm for finding frequent itemsets in D. min sup = 2, minimum confidence threshold is,
say, 70%.

1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm
simply scans all of the transactions to count the number of occurrences of each item.

2. Suppose that the minimum support count required is 2, that is, min sup = 2. (The corresponding relative support is
2/9 = 22%.) The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets
satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.

Transactional Data for an AllElectronics Branch

4.To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 L1 to generate a candidate set of 2-
itemsets, C2.

5.The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having
minimum support.

6.Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property: The generation of the set
of the candidate 3-itemsets, C3. From the join step, we first get

C3 = L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the

Apriori property that all subsets of a frequent itemset must also be frequent

K.NAGA JYOTHI MA,M.TECH(CS)

(b) Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent. Do any of the
candidates have a subset that is not frequent?

 The 2-item subsets of {I1, I2, I3} are {I1, I2}, {I1, I3}, and {I2, I3}. All 2-item subsets of {I1, I2, I3} are
members of L2. Therefore, keep {I1, I2, I3} in C3.

 The 2-item subsets of {I1, I2, I5} are {I1, I2}, {I1, I5}, and {I2, I5}. All 2-item subsets of {I1, I2, I5} are
members of L2. Therefore, keep {I1, I2, I5} in C3., etc

K.NAGA JYOTHI MA,M.TECH(CS)

7.The transactions in D are scanned to determine L3, consisting of those candidate 3-itemsets in C3 having minimum
support.

8.The algorithm uses L3 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in

{{I1, I2, I3, I5}}, itemset {I1, I2, I3, I5} is pruned because its subset {I2, I3, I5} is not frequent. Thus, C4 = φ, and
the algorithm terminates, having found all of the frequent itemsets.

Generating Association Rules from Frequent Itemsets:

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong
association rules from them (where strong association rules satisfy both minimum support and minimum
confidence).

The conditional probability is expressed in terms of itemset support count, where support count(A

∪B) is the number of transactions containing the itemsets A ∪B, and support count(A) is the number of transactions
containing the itemset A.

Based on this equation, association rules can be generated as follows:

 For each frequent itemset l, generate all nonempty subsets of l.

 For every nonempty subset s of l, output the rule “s ⇒ (l − s)”

if support count(l)∕ support count(s) ≥ min conf, where min conf is the minimum confidence threshold.

Example:

Generating association rules. Let’s try an example based on the transactional data for AllElectronics shown before in
Table 6.1. The data contain frequent itemset X = {I1, I2, I5}. What are the association rules that can be generated
from X? The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and {I5}.

The resulting association rules are as shown below, each listed with its confidence:

K.NAGA JYOTHI MA,M.TECH(CS)

{I1, I2} ⇒ I5, confidence = 2/4 = 50%

{I1, I5} ⇒ I2, confidence = 2/2 = 100%

{I2, I5} ⇒ I1, confidence = 2/2 = 100% I1 ⇒ {I2, I5}, confidence = 2/6 = 33% I2 ⇒ {I1, I5}, confidence = 2/7 =
29% I5 ⇒ {I1, I2}, confidence = 2/2 = 100%

If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because
these are the only ones generated that are strong.

Improving the Efficiency of Apriori:

“How can we further improve the efficiency of Apriori-based mining?” Many variations of the Apriori algorithm
have been proposed that focus on improving the efficiency of the original algorithm. Several of these variations are
summarized as follows

Hash-based technique (hashing itemsets into corresponding buckets):

A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck , for k > 1.

• For example, when scanning each transaction in the database to generate the frequent 1- itemsets, L1, we
can generate all the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash
table structure, and increase the corresponding bucket counts (Figure 6.5).

• A 2-itemset with a corresponding bucket count in the hash table that is below the support threshold cannot
be frequent and thus should be removed from the candidate set.

• Such a hash-based technique may substantially reduce the number of candidate k-itemsets examined
(especially when k = 2).

Example:

K.NAGA JYOTHI MA,M.TECH(CS)

Transaction reduction (reducing the number of transactions scanned in future iterations):

• A transaction that does not contain any frequent k-item sets cannot contain any frequent (k + 1)- item sets.

• Therefore, such a transaction can be marked or removed from further consideration because subsequent
database scans for j-item sets, where j > k, will not need to consider such a transaction.

Partitioning (partitioning the data to find candidate item sets):

• A partitioning technique can be used that requires just two database scans to mine the frequent item sets.

• It consists of two phases.

• In phase I, the algorithm divides the transactions of D into n non overlapping partitions.

• In phase II a second scan of D is conducted in which the actual support of each candidate is assessed to
determine the global frequent itemsets.

K.NAGA JYOTHI MA,M.TECH(CS)

Fig: Mining by partitioning the data.

Example:

Sampling (mining on a subset of the given data):

• The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search
for frequent itemsets in S instead of D.

• In this way, we trade off some degree of accuracy against efficiency.

• The S sample size is such that the search for frequent itemsets in S can be done in main memory, and so
only one scan of the transactions in S is required overall.

• Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of
the global frequent itemsets.

Dynamic itemset counting (adding candidate itemsets at different points during a scan):

• A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked
by start points.

• In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines
new candidate itemsets only immediately before each complete database scan.

• The technique uses the count-so-far as the lower bound of the actual count.

• If the count-so-far passes the minimum support, the itemset is added into the frequent itemset collection and
can be used to generate longer candidates. This leads to fewer database scans than with Apriori for finding
all the frequent itemsets.

Apriori suffer from two nontrivial costs:

 It may still need to generate a huge number of candidate sets. For example, if there are 104 frequent 1-
itemsets, the Apriori algorithm will need to generate more than 107 candidate 2- itemsets.

 It may need to repeatedly scan the whole database and check a large set of candidates by pattern matching. It
is costly to go over each transaction in the database to determine the support of the candidate itemsets.

A Pattern-Growth Approach for Mining Frequent Itemsets:

K.NAGA JYOTHI MA,M.TECH(CS)

“Can we design a method that mines the complete set of frequent itemsets without such a costly candidate generation
process?”

• The FP-Growth Algorithm is an alternative way to find frequent item sets without using candidate
generations, thus improving performance.

• It uses a divide-and-conquer strategy.

• The core of this method is the usage of a special data structure named frequent-pattern tree (FP- tree), which
retains the item set association information.

This algorithm works as follows:

• First, it compresses the input database creating an FP-tree instance to represent frequent items.

• After this first step, it divides the compressed database into a set of conditional databases, each associated
with one frequent pattern.

• Finally, each such database is mined separately

The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative information about frequent
patterns in a database. Each transaction is read and then mapped onto a path in the FP- tree.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of the FP tree is to mine the
most frequent pattern. Each node of the FP tree represents an item of the item set.

The root node represents null, while the lower nodes represent the item sets. The associations of the nodes with the
lower nodes, that is, the item sets with the other item sets, are maintained while forming the tree.

Example: FP-growth (finding frequent itemsets without candidate generation).

We reexamine the mining of transaction database, D, of Table 6.1 in Example 6.3 using the frequent pattern growth
approach.

The first scan of the database is the same as Apriori, which derives the set of frequent items (1- itemsets) and their
support counts (frequencies).

Let the minimum support count be 2.

The set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted by L.

Thus, we have L = {{I2: 7}, {I1: 6}, {I3: 6}, {I4: 2}, {I5: 2}}.

FP-tree as the tree structure given below:

One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-item-header table.

Each node in the item-prefix subtree consists of three fields:

– Item-name: registers which item is represented by the node;

– Count: the number of transactions represented by the portion of the path reaching the node;

– Node-link: links to the next node in the FP-tree carrying the same item name or null if there is none.

Each entry in the frequent-item-header table consists of two fields:

– Item-name: as the same to the node;

K.NAGA JYOTHI MA,M.TECH(CS)

Head of node-link: a pointer to the first node in the FP-tree carrying the item name

 First, create the root of the tree, labeled with “null.” Scan database D a second time. The items in each
transaction are processed in L order (i.e., sorted according to descending support count), and a branch is
created for each transaction.

 For example, the scan of the first transaction, “T100: I1, I2, I5,” which contains three items (I2, I1, I5 in L
order), leads to the construction of the first branch of the tree with three nodes, hI2: 1i, hI1: 1i, and hI5: 1i,
where I2 is linked as a child to the root, I1 is linked to I2, and I5 is linked to I1.

 The second transaction, T200, contains the items I2 and I4 in L order, which would result in a branch
where I2 is linked to the root and I4 is linked to I2. However, this branch would share a common prefix, I2,
with the existing path for T100.

In this way, the problem of mining frequent patterns in databases is transformed into that of mining the FP-
tree

Figure 6.7 An FP-tree registers compressed, frequent pattern information.

The FP-tree is mined as follows:

Start from each frequent length-1 pattern (as an initial suffix pattern), construct its conditional pattern base (a “sub-
database,” which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct
its (conditional) FP-tree, and perform mining recursively on the tree. The pattern growth is achieved by the
concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.

Mining of the FP-tree is summarized in Table 6.2

K.NAGA JYOTHI MA,M.TECH(CS)

Fig: The conditional FP-tree associated with the conditional node I3.

K.NAGA JYOTHI MA,M.TECH(CS)

FP-growth algorithm for discovering frequent itemsets without candidate generation:

K.NAGA JYOTHI MA,M.TECH(CS)

UNIT :4

CLASSIFICATION BASIC CONCEPTS

In the realm of machine learning, classification is a fundamental tool that enables us to categorise data into distinct
groups. Understanding its significance and nuances is crucial for making informed decisions based on data patterns.

Machine Learning

Machine learning is the process of teaching a computer system certain algorithms that can improve themselves with
experience. A very technical definition would be:

"A computer program is said to learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience."
- Tom Mitchell, 1997

Just like humans, the system will be able to perform simple classification tasks and complex mathematical
computations like regression. It involves the building of mathematical models that are used in classification
or regression.To ‘train’ these mathematical models, you need a set of training data. This is the dataset over which the
system builds the model. This article will cover all your Machine Learning Classification needs, starting with the
very basics.Coming to machine learning algorithms, the mathematical models are divided into two categories,
depending on their training data - supervised and unsupervised learning models.

Supervised Learning:

When building supervised learning models, the training data used contains the required answers. These required
answers are called labels. For example, you show a picture of a dog and also label it as a dog.

So, with enough pictures of a dog, the algorithm will be able to classify an image of a dog correctly. Supervised
learning models can also be used to predict continuous numeric values such as the price of the stock of a certain
company. These models are known as regression models.In this case, the labels would be the price of the stock in the
past. So the algorithm would follow that trend.

Few popular algorithms include:

K.NAGA JYOTHI MA,M.TECH(CS)

 Linear Regression

 Support Vector Classifiers

 Decision Trees

 Random Forests

Unsupervised Learning

In unsupervised learning, as the name suggests, the dataset used for training does not contain the required answers.
Instead, the algorithm uses techniques such as clustering and dimensionality reduction to train.

A major application of unsupervised learning is anomaly detection. This uses a clustering algorithm to find out major
outliers in a graph. These are used in credit card fraud detection. Explore the 'unsupervised learning course' from
Quantra.Since classification is a part of supervised learning models, let us find out more about the same.

Types of Supervised Models

Supervised models are trained on labelled dataset. It can either be a continuous label or categorical label. Following
are the types of supervised models:

 Regression

 Classification

Regression models

Regression is used when one is dealing with continuous values such as the cost of a house when you are given
features such as location, the area covered, historic prices etc. Popular regression models are:

 Linear Regression

 Lasso Regression

 Ridge Regression

Classification models

Classification is used for data that is separated into categories with each category represented by a label. The
training data must contain the labels and must have sufficient observations of each label so that the accuracy of the
model is respectable. Some popular classification models include:

 Support Vector Classifiers

 Decision Trees

 Random Forests Classifiers

General examples of Machine Learning classification

Let us now see some general examples of classification below to learn about this concept properly.

Email Spam Detection

 In this example of email spam detection, the categories can be “Spam” and “Not Spam”.

 If an incoming email contains phrases like "win a prize", "free offer" and "urgent money transfer", your spam
filter might classify it as "Spam".

K.NAGA JYOTHI MA,M.TECH(CS)

 If an email contains professional language and is from a known contact, it might classify it as "Not Spam".

Disease Diagnosis

 In disease diagnosis, let us assume that two categories are "Pneumonia" and "Common Cold".

 If a patient's medical data includes symptoms like high fever, severe cough, and chest pain, a diagnostic
model might classify it as "Pneumonia".

 If the patient's data indicates mild fatigue and occasional headaches, the model might classify it as "Common
Cold".

Sentiment Analysis in Social Media

 In this example of sentiment analysis, we can have two categories, namely "Positive" and "Negative".

 If a tweet contains positive phrases like "amazing", "great experience", and "highly recommend", a sentiment
analysis model might classify it as "Positive".

 If a tweet includes negative terms such as "terrible", "disappointed" and "waste of money", it might classify
it as "Negative".

AN ATTRIBUTE SELECTION MEASURE:

An attribute selection measure is a heuristic for choosing the splitting test that “best” separates a given data
partition, D, of class-labeled training tuples into single classes. If it can split D into smaller partitions as per the
results of the splitting criterion, ideally every partition can be pure (i.e., some tuples that fall into a given partition
can belong to the same class).

Conceptually, the “best” splitting criterion is the most approximately results in such a method. Attribute selection
measures are called a splitting rules because they decides how the tuples at a given node are to be divided.The
attribute selection measure supports a ranking for every attribute defining the given training tuples. The attribute
having the best method for the measure is selected as the splitting attribute for the given tuples.If the splitting
attribute is constant-valued or if it is restricted to binary trees, accordingly, either a split point or a splitting subset
should also be decided as an element of the splitting criterion.The tree node generated for partition D is labeled with
the splitting criterion, branches are increase for each result of the criterion, and the tuples are isolated accordingly.
There are three famous attribute selection measures including information gain, gain ratio, and gini index.

Information gain − Information gain is used for deciding the best features/attributes that render maximum data
about a class. It follows the method of entropy while aiming at reducing the level of entropy, starting from the root
node to the leaf nodes.Let node N defines or hold the tuples of partition D. The attribute with the largest information
gain is selected as the splitting attribute for node N. This attribute minimizes the data required to define the tuples in
the resulting subdivide and reflects the least randomness or “impurity” in these subdivide.

Gain ratio − The information gain measure is biased approaching tests with several results. It can select attributes
having a high number of values. For instance, consider an attribute that facilitates as a unique identifier, including
product ID.
A split on product ID can result in a huge number of partitions, each one including only one tuple. Because each
partition is authentic, the data needed to define data set D based on this partitioning would be Info product_ID(D) = 0.
K.NAGA JYOTHI MA,M.TECH(CS)
Gini index − The Gini index can be used in CART. The Gini index calculates the impurity of D, a data partition or
collection of training tuples, as

Gini(D)=1−∑i=1mp2i
where pi is the probability that a tuple in D belongs to class Ci and is calculated by |Ci,D|/|D|.

DECISION TREE INDUCTION:

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on
an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in
the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely
to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.

 It is easy to comprehend.

 The learning and classification steps of a decision tree are simple and fast.

DECISION TREE ALGORITHM:

A decision tree is a supervised learning algorithm used for both classification and regression tasks. It has a
hierarchical tree structure which consists of a root node, branches, internal nodes and leaf nodes. It It works like a
flowchart help to make decisions step by step where:

 Internal nodes represent attribute tests

 Branches represent attribute values

 Leaf nodes represent final decisions or predictions.

Decision trees are widely used due to their interpretability, flexibility and low preprocessing needs.

K.NAGA JYOTHI MA,M.TECH(CS)

A decision tree splits the dataset based on feature values to create pure subsets ideally all items in a group belong to
the same class. Each leaf node of the tree corresponds to a class label and the internal nodes are feature-based
decision points. Let’s understand this with an example.

DECISION TREE:

Let’s consider a decision tree for predicting whether a customer will buy a product based on age, income and
previous purchases: Here's how the decision tree works:

1. Root Node (Income)

First Question: "Is the person’s income greater than $50,000?"

 If Yes, proceed to the next question.

 If No, predict "No Purchase" (leaf node).

2. Internal Node (Age):

If the person’s income is greater than $50,000, ask: "Is the person’s age above 30?"

 If Yes, proceed to the next question.

 If No, predict "No Purchase" (leaf node).

3. Internal Node (Previous Purchases):

K.NAGA JYOTHI MA,M.TECH(CS)

 If the person is above 30 and has made previous purchases, predict "Purchase" (leaf node).

 If the person is above 30 and has not made previous purchases, predict "No Purchase" (leaf node).

Decision making with 2 Decision Tree

Example: Predicting Whether a Customer Will Buy a Product Using Two Decision Trees

Tree 1: Customer Demographics

First tree asks two questions:

1. "Income > $50,000?"

 If Yes, Proceed to the next question.

 If No, "No Purchase"

2. "Age > 30?"

 Yes: "Purchase"

 No: "No Purchase"

Tree 2: Previous Purchases

"Previous Purchases > 0?"

 Yes: "Purchase"

 No: "No Purchase"

Once we have predictions from both trees, we can combine the results to make a final prediction. If Tree 1 predicts
"Purchase" and Tree 2 predicts "No Purchase", the final prediction might be "Purchase" or "No Purchase" depending
on the weight or confidence assigned to each tree. This can be decided based on the problem context.

Information Gain and Gini Index in Decision Tree

Till now we have discovered the basic intuition and approach of how decision tree works, so lets just move to the
attribute selection measure of decision tree. We have two popular attribute selection measures used:

1. Information Gain

K.NAGA JYOTHI MA,M.TECH(CS)

Information Gain tells us how useful a question (or feature) is for splitting data into groups. It measures how much
the uncertainty decreases after the split. A good question will create clearer groups and the feature with the highest
Information Gain is chosen to make the decision.

For example if we split a dataset of people into "Young" and "Old" based on age and all young people bought the
product while all old people did not, the Information Gain would be high because the split perfectly separates the two
groups with no uncertainty left

 Suppose SS is a set of instances AA is an attribute, SvSv is the subset of SS, vv represents an individual value
that the attribute AA can take and Values (AA) is the set of all possible values of AA then

Gain(S,A)=Entropy(S)−∑vA∣Sv∣∣S∣.Entropy(Sv)Gain(S,A)=Entropy(S)−∑vA∣S∣∣Sv∣.Entropy(Sv)

 Entropy: is the measure of uncertainty of a random variable it characterizes the impurity of an arbitrary
collection of examples. The higher the entropy more the information content.

For example if a dataset has an equal number of "Yes" and "No" outcomes (like 3 people who bought a product and 3
who didn’t), the entropy is high because it’s uncertain which outcome to predict. But if all the outcomes are the same
(all "Yes" or all "No") the entropy is 0 meaning there is no uncertainty left in predicting the outcome

Suppose SS is a set of instances, AA is an attribute, SvSv is the subset of SS with AA= vv and Values (AA) is the set
of all possible values of AA, then

Gain(S,A)=Entropy(S)−∑vϵValues(A)∣Sv∣∣S∣.Entropy(Sv) Gain(S,A)=Entropy(S)−∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)

Example:

For the set X = {a,a,a,b,b,b,b,b}

Total instances: 8
Instances of b: 5
Instances of a: 3

Entropy H(X)=[(38)log⁡238+(58)log⁡258]=−[0.375(−1.415)+0.625(−0.678)]=−(−0.53−0.424)=0.954Entropy H(X)

=[(83)log283+(85)log285]=−[0.375(−1.415)+0.625(−0.678)]=−(−0.53−0.424)=0.954

Building Decision Tree using Information Gain the essentials

 Start with all training instances associated with the root node

 Use info gain to choose which attribute to label each node with

 Recursively construct each subtree on the subset of training instances that would be classified down that path
in the tree.

 If all positive or all negative training instances remain, the label that node “yes" or “no" accordingly

 If no attributes remain label with a majority vote of training instances left at that node

 If no instances remain label with a majority vote of the parent's training instances.

TREE PRUNING:

Tree pruning, a technique to prevent over fitting in decision trees, can be combined with Bayes classification in
methods like Naïve Bayesian Tree (NBTree) and Bayes Minimum Risk (PBMR) pruning to build more accurate
and generalizable models. These approaches use Bayesian concepts, such as estimated risk rates or local accuracy, to
guide the pruning process, either by stopping tree growth early (pre-pruning) or by simplifying a fully grown tree
(post-pruning).

K.NAGA JYOTHI MA,M.TECH(CS)

Why Tree Pruning is Combined with Bayes Classification

 Reduces Overfitting: Decision trees can become too complex, memorizing training data's noise and
anomalies. Pruning simplifies the tree by removing unnecessary branches, improving its ability to generalize
to new, unseen data.

 Enhances Generalization: By removing less reliable branches, pruning creates smaller, less complex trees
that perform better on independent test data.

 Improves Interpretability and Efficiency: Simpler trees are easier to comprehend and faster to execute.

How Bayesian Concepts are Used in Tree Pruning

1. Naïve Bayesian Tree (NBTree):

 This method combines decision tree and Naïve Bayes classifiers.

 A pruning strategy is introduced based on estimating local accuracy, using the accuracy of local
classifiers (like those at leaf nodes) to guide the decision-making process, rather than directly using
the most specific classifier.

2. Bayes Minimum Risk (PBMR):

 This is a novel post-pruning method that converts a parent node of a subtree into a leaf node if the
estimated risk-rate for the parent node is less than the risk-rates of its children.

 It uses the Bayes minimum risk criterion to estimate the risk-rate, aiming to find the decision tree
with the lowest error rate on unobserved instances.

3. General Bayesian Pruning Methods:

 Bayesian Classification: In data mining, Bayesian classification uses Bayes' Theorem to predict class
membership probabilities.

 Pruning with Bayes Minimum Risk: Some methods apply a bottom-up approach where a subtree is
pruned and replaced by a leaf node if the estimated risk-rate (based on misclassification costs) of the
parent node is lower than that of the subtree.

Key Pruning Approaches in a Bayesian Context

 Pre-Pruning (Early Stopping): Tree construction is halted early when a node is about to be split if it would
result in a split that falls below a pre-specified threshold.

 Post-Pruning (Subtree Removal): A fully grown decision tree has subtrees removed and replaced by leaf
nodes, with the leaf labeled with the most frequent class among the replaced subtree's instances.

K.NAGA JYOTHI MA,M.TECH(CS)

K.NAGA JYOTHI MA,M.TECH(CS)
K.NAGA JYOTHI MA,M.TECH(CS)
UNIT:5

Association Rule Mining:

Association Rule Mining is a powerful technique used to uncover meaningful relationships between variables within
large datasets. They are designed to discover “if-then” patterns, providing insights into how data items are related
and frequently occur together. These rules are particularly useful in identifying correlations and dependencies,
enabling data-driven decision-making.

For instance, in a retail dataset, an association rule might identify that “if a customer buys bread, they are likely to
buy butter”. Such insights help businesses improve cross-selling strategies, inventory management, and customer
satisfaction.

Key Components of Association Rules

1. Antecedent: The “if” part of the rule, representing the condition.

 Example: A customer buys bread.

2. Consequent: The “then” part of the rule, representing the outcome.

 Example: The customer also buys butter.

Association rules are derived through algorithms that evaluate the frequency and strength of these relationships. They
use metrics like support, confidence, and lift to measure the relevance and reliability of discovered patterns. These
rules have applications in various fields, such as retail, healthcare, and marketing, where analyzing customer
behavior or trends is critical for success.

Rule Evaluation Metrics

Association rules are evaluated using key metrics that determine their relevance, strength, and reliability. These
metrics include support, confidence, and lift, which quantify the frequency and strength of relationships between data
items.

1. Support

Support measures how frequently an itemset (both antecedent and consequent) appears in the dataset. It provides an
indication of how common a particular association is.

Formula:

Example: If bread and butter appear together in 100 out of 1,000 transactions, the support is:

A higher support value indicates a more frequently occurring pattern in the dataset.

2. Confidence

Confidence measures the likelihood of the consequent occurring given that the antecedent has already occurred. It
evaluates the reliability of the rule.

Formula:

Example: If 70% of customers who buy bread also buy butter, the confidence is:

Higher confidence suggests a stronger relationship between the antecedent and consequent.

K.NAGA JYOTHI MA,M.TECH(CS)

3. Lift

Lift measures the strength of an association compared to its random occurrence in the dataset. It identifies how much
more likely the antecedent and consequent are to appear together than independently.

Formula:

Example: A lift value greater than 1 indicates a strong positive association, while a value equal to 1 suggests no
association. For instance, if the lift is 1.5, it means the antecedent makes the consequent 1.5 times more likely.

Association Rule Learning

Association rule learning is a multi-step process designed to identify meaningful patterns and relationships in large
datasets. It involves two main stages:

1. Identifying Frequent Itemsets: The process begins by identifying frequent itemsets—combinations of items
that appear together in transactions with a frequency above a predefined threshold. Metrics like support are
used to measure how often these itemsets occur in the dataset. For example, a frequent itemset might reveal
that bread and butter are purchased together in 10% of transactions.

2. Generating Association Rules: Once frequent itemsets are identified, association rules are generated. These
rules take the form of if-then statements that describe relationships between items (e.g., “If a customer buys
bread, they are likely to buy butter”). Metrics such as confidence and lift are applied to evaluate the strength
and reliability of these rules.

Iterative Refinement

The process is iterative, with thresholds for support and confidence adjusted to refine the rules. This ensures that only
the most significant and actionable rules are selected. For instance, a rule with low confidence may be excluded from
further analysis.Through this systematic approach, association rule learning uncovers valuable insights from raw
data, enabling organizations to make data-driven decisions.

Types of Association Rule Learning Algorithms

Several algorithms are used for association rule learning, each with unique strengths and applications. The three most
commonly used algorithms are:

1. Apriori Algorithm

The Apriori algorithm employs a breadth-first search approach to identify frequent itemsets. It relies on the principle
that all subsets of a frequent itemset must also be frequent, reducing the search space.

 Advantage: Simple to implement and effective for small datasets with low dimensionality.

 Limitation: Performance degrades significantly with large or dense datasets due to repeated scanning of the
database.

2. Eclat Algorithm

The Eclat algorithm uses a depth-first search strategy to discover frequent itemsets. Instead of scanning the database
multiple times, it represents transactions as vertical itemsets and directly computes intersections.

 Advantage: Efficient for datasets with sparse data or where there are fewer frequent itemsets.

3. FP-Growth Algorithm

K.NAGA JYOTHI MA,M.TECH(CS)

The FP-Growth (Frequent Pattern Growth) algorithm leverages a prefix-tree structure called the FP-tree to represent
transactional data compactly. Unlike Apriori, it avoids generating candidate itemsets explicitly, making it faster and
more efficient.

 Advantage: Significantly faster and more memory-efficient than Apriori, especially for large datasets.

Applications of Association Rules

Association rules are widely applied across various industries to uncover patterns and relationships in data, enabling
better decision-making and operational efficiency.

1. Retail and Market Basket Analysis: Retailers use association rules to identify frequently purchased product
combinations, helping them optimize store layouts or create product bundles to increase sales.

 Example: A supermarket discovers that customers who buy bread often purchase butter and jam, leading to
strategic placement of these items together.

2. Healthcare: In healthcare, association rules help discover co-occurrence patterns in symptoms, aiding in
diagnostic processes and treatment plans.

 Example: Identifying that patients with high blood pressure often have a higher risk of developing diabetes can
guide preventative care strategies.

3. E-Commerce and Recommendation Systems: E-commerce platforms leverage association rules to build
recommendation systems that enhance user experiences and drive sales.

 Example: Amazon’s “Customers who bought this also bought” feature suggests complementary products,
boosting cross-selling opportunities.

4. Fraud Detection: Association rules are used in financial services to identify unusual patterns in transaction data,
which can help detect fraudulent activities.

 Example: Flagging transactions that deviate significantly from established spending patterns for further
investigation.

Example of Association Rules

Consider a small transaction dataset where customers purchase items like bread, butter, and milk.

Dataset Example:

Transaction ID Items Purchased

1 Bread, Butter

2 Bread, Milk

3 Bread, Butter, Milk

4 Milk

5 Bread, Butter

K.NAGA JYOTHI MA,M.TECH(CS)

Rule Discovery Process:

Rule Example: “If bread is purchased, then butter is likely to be purchased.”

1. Support Calculation:
Support = Transactions containing both bread and butter ÷ Total transactions

2. Confidence Calculation:
Confidence = Support of bread and butter ÷ Support of bread

3. Lift Calculation:
Lift = Confidence ÷ Support of butter

A lift value greater than 1 indicates a positive association between bread and butter.

This example demonstrates how association rules are derived and evaluated, providing actionable insights from
transactional data.

Conclusion

Association Rules are a vital tool in data mining, enabling the discovery of valuable patterns and relationships within
large datasets. Their applications span industries such as retail, healthcare, and finance, driving smarter decision-
making processes. By leveraging advanced algorithms and exploring innovative applications, Association Rules
continue to empower businesses to solve complex problems and unlock new opportunities.

ASSOCIATION RULES:

In data mining, association rules discover relationships in data as "if-then" statements, where the antecedent is the
"if" part (e.g., buying bread) and the consequent is the "then" part (e.g., also buying butter). Multi-relational
association rules extend this by finding associations across different dimensions (like customer age and purchase
category) or databases, providing a more comprehensive understanding of complex relationships.

Antecedent and Consequent

 Definition: Association rules aim to find correlations between items in large datasets. An example is in
market basket analysis, where a rule might be: "If a customer buys bread, then they will also buy milk".

 Antecedent: This is the "if" part of the rule, representing a condition or the presence of certain items. In the
example, the antecedent is "a customer buys bread".

 Consequent: This is the "then" part of the rule, indicating the likelihood of other items or conditions
occurring with the antecedent. In the example, the consequent is "they will also buy milk".

Multi-Relational Association Rules (MRAR)

 Definition: These rules go beyond simple associations within a single dataset or dimension. Instead, they find
patterns across different data dimensions or even from multiple related databases.

 How they work: MRAR considers interactions between various aspects, such as a customer's demographic
information (age), their buying behavior (items purchased), and other contextual details.

 Example: Instead of just "If bread, then milk," a multi-relational rule might look like: "If a student of
age 20 buys diapers, then they will also buy beer". Here, "student" and "age 20" are additional dimensions
that provide context to the association.

K.NAGA JYOTHI MA,M.TECH(CS)

ECLAT (Equivalence Class Transformation):

ECLAT (Equivalence Class Transformation) is a market basket analysis algorithm used in data mining to find
frequently co-occurring items by converting horizontal data (transactions) into vertical format (item and transaction
ID lists) and efficiently mining frequent itemsets to discover association rules, as seen in studies applied to e-
commerce, supermarkets, and retail settings to improve product placement and promotions. Case studies highlight
ECLAT's speed advantage over algorithms like Apriori due to its efficient vertical-based scanning, its ability to
uncover purchase patterns, and its role in strategic decision-making for businesses.

How ECLAT Works for Market Basket Analysis

1.Vertical Data Conversion: ECLAT transforms the dataset from a horizontal format (showing items within each
transaction) to a vertical format, where each item is associated with a list of transaction IDs (TIDs) that contain it.

2.Frequent Itemset Mining: It uses a depth-first search strategy and set intersections to efficiently identify frequent
itemsets (groups of items purchased together) that meet a user-defined minimum support threshold.

3.Association Rule Generation: Once frequent itemsets are found, association rules are generated in the form of
"if-then" statements (e.g., "if customers buy A, then they are likely to buy B") with measures
like confidence and lift to evaluate their strength.

Applications in Case Studies

 Retail Strategy: In a supermarket scenario, the ECLAT algorithm was used to analyze transactions, identify
frequently bought item combinations (like Indomie goreng special and Indomie ayam bawang), and generate
association rules to help strategize promotions and bundle offers for customers.

 E-commerce: Studies on e-commerce book retailers used ECLAT to find relationships between book purchases,
demonstrating its ability to generate the same high-quality association rules as other algorithms, but often faster
due to its vertical scanning approach.

 Inventory Management: A case study on a retail store applying ECLAT to identify frequently purchased product
combinations helped with inventory management, allowing the store to optimize ordering quantities for popular
items to prevent stockouts.

Benefits of ECLAT

 Efficiency: ECLAT is known for its speed and efficiency, especially compared to the Apriori algorithm, because its
vertical scanning method avoids multiple passes over the entire dataset.

 Scalability: It can handle large databases by converting them into a vertical format, allowing for more efficient
computation of itemsets.

 Actionable Insights: The association rules derived from ECLAT analysis provide valuable insights into customer
purchasing patterns, which businesses can use for cross-selling, product placement, and personalized
recommendations to drive sales.

Data Mining - Cluster Analysis

Data mining is the process of finding patterns, relationships and trends to gain useful insights from large datasets. It
includes techniques like classification, regression, association rule mining and clustering. In this article, we will
learn about clustering analysis in data mining.

Understanding Cluster Analysis

K.NAGA JYOTHI MA,M.TECH(CS)

Cluster analysis is also known as clustering, which groups similar data points forming clusters. The goal is to ensure
that data points within a cluster are more similar to each other than to those in other clusters. For example, in e-
commerce retailers use clustering to group customers based on their purchasing habits. If one group frequently
buys fitness gear while another prefers electronics. This helps companies to give personalized recommendations
and improve customer experience. It is useful for:

1.Scalability: It can efficiently handle large volumes of data.

2.High Dimensionality: Can handle high-dimensional data.

3.Adaptability to Different Data Types: It can work with numerical data like age, salary and categorical data like
gender, occupation.

4.Handling Noisy and Missing Data: Usually, datasets contain missing values or inconsistencies and clustering
can manage them easily.

5.Interpretability: Output of clustering is easy to understand and apply in real-world scenarios.

Distance Metrics

Distance metrics are simple mathematical formulas to figure out how similar or different two data points are. Type
of distance metrics we choose plays a big role in deciding clustering results. Some of the common metrics are:

 Euclidean Distance: It is the most widely used distance metric and finds the straight-line distance between two
points.

 Manhattan Distance: It measures the distance between two points based on grid-like path. It adds the absolute
differences between the values.

 Cosine Similarity: This method checks the angle between two points instead of looking at the distance. It’s used in
text data to see how similar two documents are.

 Jaccard Index: A statistical tool used for comparing the similarity of sample sets. It’s mostly used for yes/no type
data or categories.

Types of Clustering Techniques

Clustering can be broadly classified into several methods. The choice of method depends on the type of data and the
problem you're solving.

1. Partitioning Methods

 Partitioning Methods divide the data into k groups (clusters) where each data point belongs to only one
group. These methods are used when you already know how many clusters you want to create. A common
example is K-means clustering.

 In K-means the algorithm assigns each data point to the nearest center and then updates the center based on
the average of all points in that group. This process repeats until the centres stop changing. It is used in real-
life applications like streaming platforms like Spotify to group users based on their listening habits.

2. Hierarchical Methods

Hierarchical clustering builds a tree-like structure of clusters known as a dendrogram that represents the
merging or splitting of clusters. It can be divided into:

K.NAGA JYOTHI MA,M.TECH(CS)

 Agglomerative Approach (Bottom-up): Agglomerative Approach starts with individual points and merges
similar ones. Like a family tree where relatives are grouped step by step.

 Divisive Approach (Top-down): It starts with one big cluster and splits it repeatedly into smaller clusters.
For example, classifying animals into broad categories like mammals, reptiles, etc and further refining them.

3. Density-Based Methods

 Density-based clustering group data points that are densely packed together and treat regions with fewer data
points as noise or outliers. This method is particularly useful when clusters are irregular in shape.

 For example, it can be used in fraud detection as it identifies unusual patterns of activity by grouping
similar behaviors together.

4. Grid-Based Methods

 Grid-Based Methods divide data space into grids making clustering efficient. This makes the clustering
process faster because it reduces the complexity by limiting the number of calculations needed and is useful
for large datasets.

 Climate researchers often use grid-based methods to analyze temperature variations across different
geographical regions. By dividing the area into grids they can more easily identify temperature patterns and
trends.

5. Model-Based Methods

 Model-based clustering groups data by assuming it comes from a mix of distributions. Gaussian Mixture
Models (GMM) are commonly used and assume the data is formed by several overlapping normal
distributions.

 GMM is commonly used in voice recognition systems as it helps to distinguish different speakers by
modeling each speaker’s voice as a Gaussian distribution.

6. Constraint-Based Methods

 It uses User-defined constraints to guide the clustering process. These constraints may specify certain
relationships between data points such as which points should or should not be in the same cluster.

 In healthcare, clustering patient data might take into account both genetic factors and lifestyle choices.
Constraints specify that patients with similar genetic backgrounds should be grouped together while also
considering their lifestyle choices to refine the clusters.

Impact of Data on Clustering Techniques

Clustering techniques must be adapted based on the type of data:

1. Numerical Data Numerical data consists of measurable quantities like age, income or temperature. Algorithms
like k-means and DBSCAN work well with numerical data because they depend on distance metrics. For example a
fitness app cluster users based on their average daily step count and heart rate to identify different fitness levels.

2. Categorical Data It contain non-numerical values like gender, product categories or answers to survey questions.
Algorithms like k-modes or hierarchical clustering are better for this. For example grouping customers based on
preferred shopping categories like "electronics" "fashion" and "home appliances."

3. Mixed Data Some datasets contain both numerical and categorical features that require hybrid approaches. For
example, clustering a customer database based on income (numerical) and shopping preferences (categorical) can use
k-prototype method.

K.NAGA JYOTHI MA,M.TECH(CS)

Applications of Cluster Analysis

 Market Segmentation: This is used to segment customers based on purchasing behavior and allow
businesses send the right offers to the right people.

 Image Segmentation: In computer vision it can be used to group pixels in an image to detect objects like
faces, cars or animals.

 Biological Classification: Scientists use clustering to group genes with similar behaviors to understand
diseases and treatments.

 Document Classification: It is used by search engines to categorize web pages for better search results.

 Anomaly Detection: Cluster Analysis is used for outlier detection to identify rare data points that do not
belong to any cluster.

Challenges in Cluster Analysis

While clustering is very useful for analysis it faces several challenges:

 Choosing the Number of Clusters: Methods like K-means requires user to specify the number of clusters
before starting which can be difficult to guess correctly.

 Scalability: Some algorithms like hierarchical clustering does not scale well with large datasets.

 Cluster Shape: Many algorithms assume clusters are round or evenly shaped which doesn’t always match
real-world data.

 Handling Noise and Outliers: They are sensitive to noise and outliers which can affect the results.

Cluster analysis is like organising a messy room—sorting items into meaningful groups making everything
easier to understand. Choosing the right clustering method depends on the dataset and goal of analysis.

K.NAGA JYOTHI MA,M.TECH(CS)

K.NAGA JYOTHI MA,M.TECH(CS) 62

Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
135 pages
Btech Cse & Aids DWDM Material - 2025
100% (1)
Btech Cse & Aids DWDM Material - 2025
45 pages
Data Flow Diagrams & UML Guide
No ratings yet
Data Flow Diagrams & UML Guide
28 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
59 pages
Cohesion With Example
No ratings yet
Cohesion With Example
7 pages
Coding and Testing
No ratings yet
Coding and Testing
12 pages
BCA VI: Data Warehousing Essentials
No ratings yet
BCA VI: Data Warehousing Essentials
149 pages
OOSE Unit 1 Notes
No ratings yet
OOSE Unit 1 Notes
21 pages
Unit 1
No ratings yet
Unit 1
27 pages
Chapter 02
No ratings yet
Chapter 02
53 pages
Difference Between Data Warehouse and An Operational Database
No ratings yet
Difference Between Data Warehouse and An Operational Database
1 page
Unified Process Model
No ratings yet
Unified Process Model
13 pages
Software Project Scheduling Explained
No ratings yet
Software Project Scheduling Explained
20 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
Implementation of Data Warehouse
No ratings yet
Implementation of Data Warehouse
11 pages
Design Patterns
No ratings yet
Design Patterns
11 pages
Data Warehousing for Decision Makers
No ratings yet
Data Warehousing for Decision Makers
31 pages
Software Testing Methodologies Overview
No ratings yet
Software Testing Methodologies Overview
147 pages
Software Engineering Course Guide
100% (1)
Software Engineering Course Guide
88 pages
Data Warehousing Concepts for BCA Students
No ratings yet
Data Warehousing Concepts for BCA Students
22 pages
UML Class Diagrams Guide
33% (3)
UML Class Diagrams Guide
26 pages
Coal Handouts
0% (1)
Coal Handouts
18 pages
Inception Phase for Next Gen POS System
No ratings yet
Inception Phase for Next Gen POS System
92 pages
Unit II: Software Requirement Analysis and Specifications
No ratings yet
Unit II: Software Requirement Analysis and Specifications
64 pages
Data Warehousing & Mining Overview
75% (4)
Data Warehousing & Mining Overview
14 pages
Case Study On Dbms & Rdbms
No ratings yet
Case Study On Dbms & Rdbms
36 pages
Intro to Object-Oriented Modeling
No ratings yet
Intro to Object-Oriented Modeling
86 pages
Software Testing & Maintenance Guide
No ratings yet
Software Testing & Maintenance Guide
37 pages
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
No ratings yet
1.disabling Interrupts:: Mutual Exclusion With Busy Waiting
2 pages
Big Data Analytics For Logistics and Transportation: Conference Paper
No ratings yet
Big Data Analytics For Logistics and Transportation: Conference Paper
7 pages
Unit-2 Business Model Canvas
No ratings yet
Unit-2 Business Model Canvas
25 pages
Software Engineering: UNIT-1
No ratings yet
Software Engineering: UNIT-1
37 pages
Unit-1 DevOps-An Overview
No ratings yet
Unit-1 DevOps-An Overview
23 pages
Anna University IT Course Textbooks & Links
No ratings yet
Anna University IT Course Textbooks & Links
3 pages
OOAD Lab Manual: UML & Diagrams
0% (1)
OOAD Lab Manual: UML & Diagrams
199 pages
Unit-3 DevOps Plays For Driving Innovation
No ratings yet
Unit-3 DevOps Plays For Driving Innovation
19 pages
LM 4 - Timestamp - Multiversion - Validation and Snapshot Isolation
No ratings yet
LM 4 - Timestamp - Multiversion - Validation and Snapshot Isolation
13 pages
Unit 3 Dw&bi
No ratings yet
Unit 3 Dw&bi
19 pages
Understanding Abstract Data Types and OOP
No ratings yet
Understanding Abstract Data Types and OOP
25 pages
Information Security Project Implementation Guide
No ratings yet
Information Security Project Implementation Guide
20 pages
UNIT-1 Principle of OOP - Lecture - Notes - 212 - 4320702
No ratings yet
UNIT-1 Principle of OOP - Lecture - Notes - 212 - 4320702
10 pages
Java Inheritance and Method Overriding
No ratings yet
Java Inheritance and Method Overriding
59 pages
Unit 2 Omputer Network Aktu
100% (1)
Unit 2 Omputer Network Aktu
30 pages
ERP Overview and Implementation Guide
0% (1)
ERP Overview and Implementation Guide
1 page
NLP Unit 5
No ratings yet
NLP Unit 5
15 pages
SE - MODULE 2 - ch2
No ratings yet
SE - MODULE 2 - ch2
18 pages
What Is A Design Pattern?: Chapter 1 Introduction
No ratings yet
What Is A Design Pattern?: Chapter 1 Introduction
20 pages
Distributed Query Optimization
86% (7)
Distributed Query Optimization
48 pages
Cocomo Model - 2
No ratings yet
Cocomo Model - 2
13 pages
Coping With Change
No ratings yet
Coping With Change
19 pages
Web Programming BCA - Unit 1 Study Materials (BHARATHIAR UNIVERSITY)
No ratings yet
Web Programming BCA - Unit 1 Study Materials (BHARATHIAR UNIVERSITY)
9 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Ise-Vii-data Warehousing and Data Mining (10is74) - Notes
100% (1)
Ise-Vii-data Warehousing and Data Mining (10is74) - Notes
143 pages
Key Trends in Data Warehousing 2023
No ratings yet
Key Trends in Data Warehousing 2023
3 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
81 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
Fundamentals of Data Science Notes (Module - 2)
No ratings yet
Fundamentals of Data Science Notes (Module - 2)
11 pages
Data Warehouse Design & Architecture
No ratings yet
Data Warehouse Design & Architecture
12 pages
Data Warehouse and Mining Overview
No ratings yet
Data Warehouse and Mining Overview
29 pages
UNIT5
No ratings yet
UNIT5
7 pages
UNIT4
No ratings yet
UNIT4
25 pages
1 ST
No ratings yet
1 ST
19 pages
UNIT3
No ratings yet
UNIT3
6 pages
R Programing Unit1,2
No ratings yet
R Programing Unit1,2
55 pages
R Programming UNIT 3,4,5
No ratings yet
R Programming UNIT 3,4,5
28 pages
Sem 3 DBMS Unit-Iii
No ratings yet
Sem 3 DBMS Unit-Iii
28 pages
Iot Unit 1,2
No ratings yet
Iot Unit 1,2
36 pages
Iot 3,4,5 Final
No ratings yet
Iot 3,4,5 Final
62 pages
Sem 3 DBMS UNIT-IV
No ratings yet
Sem 3 DBMS UNIT-IV
15 pages
Sem 3 Dbms Unit 2
No ratings yet
Sem 3 Dbms Unit 2
21 pages
Sem 3 DBMS UNIT-1
No ratings yet
Sem 3 DBMS UNIT-1
19 pages
Computer Science (083) - Only 1 Set
No ratings yet
Computer Science (083) - Only 1 Set
10 pages
Software Flaws in Quantitative Analysis
No ratings yet
Software Flaws in Quantitative Analysis
1 page
Two-Pass Macro Processor Design
No ratings yet
Two-Pass Macro Processor Design
5 pages
Addis Ababa Hospitals
No ratings yet
Addis Ababa Hospitals
26 pages
To Install Consys Master" Get Files From
No ratings yet
To Install Consys Master" Get Files From
21 pages
AnoMili Spoofing Hardening and Explainable Anomaly Detection For The 1553 Military Avionic Bus
No ratings yet
AnoMili Spoofing Hardening and Explainable Anomaly Detection For The 1553 Military Avionic Bus
16 pages
Draft Chapter 2
No ratings yet
Draft Chapter 2
5 pages
Sis 5582
No ratings yet
Sis 5582
222 pages
Petier Design Guide
No ratings yet
Petier Design Guide
3 pages
The Cultural Life of Machine Learning: An Incursion Into Critical AI Studies Jonathan Roberge Ebook Long Edition Unlock
100% (1)
The Cultural Life of Machine Learning: An Incursion Into Critical AI Studies Jonathan Roberge Ebook Long Edition Unlock
305 pages
SDLC Approaches in Software Development
No ratings yet
SDLC Approaches in Software Development
31 pages
Automation Direct Protos X Manual - Revg
No ratings yet
Automation Direct Protos X Manual - Revg
133 pages
Automated Garment Defect Detection System
No ratings yet
Automated Garment Defect Detection System
24 pages
Essential Software Quality Metrics Guide
No ratings yet
Essential Software Quality Metrics Guide
30 pages
Group Tasks For Assignment 2
No ratings yet
Group Tasks For Assignment 2
5 pages
Web Development Frameworks Overview
100% (1)
Web Development Frameworks Overview
43 pages
GDM1602S-FL-YBS LCD Module Specs
No ratings yet
GDM1602S-FL-YBS LCD Module Specs
18 pages
Playful Inventions British English Teacher B2 C1
No ratings yet
Playful Inventions British English Teacher B2 C1
10 pages
C++ Object Oriented Programming Exam Paper
No ratings yet
C++ Object Oriented Programming Exam Paper
28 pages
MTL5018 Ac
No ratings yet
MTL5018 Ac
1 page
Solaris 11.3 Network Admin Commands
No ratings yet
Solaris 11.3 Network Admin Commands
5 pages
UAN KYC Updation Process in EPFO Portal
No ratings yet
UAN KYC Updation Process in EPFO Portal
9 pages
Intel 80286 Processor Overview
No ratings yet
Intel 80286 Processor Overview
10 pages
10+ Proven Technical Interview Questions (+answers)
No ratings yet
10+ Proven Technical Interview Questions (+answers)
6 pages
Inside
No ratings yet
Inside
45 pages
Class - 3 To 5 Total - 30 Theams: Smile Education and Welfare Association Chhatrapati Sambhajinagar
No ratings yet
Class - 3 To 5 Total - 30 Theams: Smile Education and Welfare Association Chhatrapati Sambhajinagar
35 pages
SMS CDR Reports and Statistics
No ratings yet
SMS CDR Reports and Statistics
1 page
IT-402 Sample Paper V Answer Key
No ratings yet
IT-402 Sample Paper V Answer Key
9 pages
Telecom Equipment Approvals
No ratings yet
Telecom Equipment Approvals
75 pages
TETRA System Release 7.0: Mobility Management
No ratings yet
TETRA System Release 7.0: Mobility Management
40 pages