Introduction to data mining
There is a huge amount of data available in the
Information Industry.
This data is of no use until it is converted into useful
information.
It is necessary to analyze this huge amount of data and
extract useful information from it.
Extraction of information is not the only process we need
to perform; data mining also involves other processes
such as Data Cleaning, Data Integration, Data
Transformation, Pattern Evaluation and Data
Presentation.
03/29/2025 1
Once all these processes are over, we would be able to
use this information in many applications such as Fraud
Detection, Market Analysis, Production Control, Science
Exploration, etc.
Data Mining is defined as extracting information
from huge sets of data. In other words, we can say
that data mining is the procedure of mining knowledge
from data.
03/29/2025 2
03/29/2025 3
The information or knowledge extracted so can be
used for any of the following applications −
1.Market Analysis
2.Fraud Detection
3.Customer Retention
4.Production Control
5.Science Exploration
03/29/2025 4
• Data Mining Applications
Data mining is highly useful in the following
domains −
1.Market Analysis and Management
2.Corporate Analysis & Risk Management
3.Fraud Detection
03/29/2025 5
Listed below are the various fields of market where
data mining is used −
1. Customer Profiling − Data mining helps determine what kind of people
buy what kind of products.
2. Identifying Customer Requirements − Data mining helps in
identifying the best products for different customers. It uses prediction to
find the factors that may attract new customers.
3. Cross Market Analysis − Data mining performs
Association/correlations between product sales.
4. Determining Customer purchasing pattern − Data
mining helps in determining customer purchasing pattern.
5. Providing Summary Information − Data mining provides us various
multidimensional summary reports.
03/29/2025 6
Major applications of data
mining
1.Healthcare.
Data mining has a lot of promise for improving
healthcare systems.
It identifies best practices for improving treatment
and lowering costs using data and analytics.
Multi-dimensional databases, machine learning, soft
computing, data visualization, and statistics are
among the data mining techniques used by
researchers.
03/29/2025 7
2.Insurance
• Data mining helps insurance companies
understand customer purchase behaviour and
predict the insurance policy they are likely to
purchase in the future.
• They track fraudulent claim practices and
strengthen their systems to avoid them.
• Insurance companies can trace policies that get
claimed together and bundle them to provide
better services.
03/29/2025 8
3.Market basket analysis
Market basket analysis suggests that if a customer
buys a certain quantity of a particular product, they
may purchase it again or look for similar products.
Understanding this data helps retailers identify the
frequency of purchase and manage their stock
accordingly.
It also helps improve sales and manage customer
relations.
03/29/2025 9
4.Financial analysis
Banks have detailed information about their
customers, their transactions and loans.
Understanding this bulk of data allows banks to
classify customers and customise services like
loans, credit card spending limits, rewards and
provide discounts on purchases.
Identifying unusual activity in a transaction
helps track fraudulent activities and security
breaches.
03/29/2025 10
5.Intrusion detection
Data mining techniques help classify information for
intrusion section systems.
The system then generates an alarm on detecting any
foreign elements that do not fit the classification rule.
This process helps detect security breaches, attacks,
misuse and anomalies.
Data mining techniques are crucial for any business
and help protect essential information.
03/29/2025 11
6.Energy
Data mining helps track energy consumption
patterns and devise systems to increase
efficiency.
It aids in predicting power consumption in
different geographical locations.
Elaborate data mining systems provide details
on operations patterns too.
This insight later helps to optimise operations
and invest in equipment that improves
production efficiency.
03/29/2025 12
7.Retail and E-commerce
The retail and e-commerce sector collects and tracks
customer details, transactions and product sales.
It helps them identify customer purchase behaviour,
product preferences and seasonal product sales.
This data benefits organizations to forecast sales and
customize their offerings.
Efficiently using past data to make business decisions
helps retailers and e-commerce owners reduce risk and
increase profitability.
03/29/2025 13
8.Spatial data mining
Data mining facilitates the study of past data,
discovery and analysis of spatial and
geographical information.
It helps identify hotspots and unusual locations,
taking the spatial relations between objects into
account.
Additionally, their latitude, area, perimeter and
coordinates, help discover previously unknown
but potentially useful information.
03/29/2025 14
9.Biological data analysis
Using complex computational analysis data
mining facilitates the study and interpretation
of biological datasets.
It helps predict protein structure, gene
classification and analyses cell mutation.
It advances biological studies and improves
the healthcare system.
03/29/2025 15
10. Criminal investigation
The main objective of using data mining in criminal
investigation is to fasten the rate of solving a crime.
Clustering data helps group crime characteristics and
devise ways to prevent them.
Data from multiple sources is analyzed to simplify
complex relations between crime and criminal.
This helps identify patterns in crime over a time-period
or geographical location.
03/29/2025 16
11.Media
Media channels like radio, television and over-the-
top (OTT) platforms keep track of their audience to
understand consumption patterns.
Using this information, media providers make
content recommendations, change program
schedules and produce content of the preferred
genre.
Data mining helps media providers improve the
viewer experience.
03/29/2025 17
12.Advertising and marketing
• With the advent of digital marketing and data
mining technologies, marketers refine their
strategies for better engagement and track the live
results of their campaigns.
• Advertisers also use this data to profile users and
show them content or product that might interest
them.
• Data mining is widely used in digital marketing to
improve targeting and user experience.
03/29/2025 18
13.Education
Data mining is used in education to learn student
productivity and development.
It helps understand how a student is performing,
predict their future scores, identify relevant
placement opportunities and track teacher
performance.
Data mining may help derive associations between
the teaching methodologies and student
performance and identify areas of improvement.
03/29/2025 19
Data Mining Architecture
• The significant components of data mining systems are a
data source, data mining engine, data warehouse
server, the pattern evaluation module, graphical
user interface, and knowledge base.
03/29/2025 20
Data Source:
The actual source of data is the Database, data
warehouse, World Wide Web (WWW), text files, and other
documents.
We need a huge amount of historical data for data mining
to be successful.
Organizations typically store data in databases or data
warehouses.
Data warehouses may comprise one or more databases,
text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may
contain information.
Another primary source of data is the World Wide Web or
the internet.
03/29/2025 21
Before passing the data to the database or data
warehouse server, the data must be cleaned,
integrated, and selected.
As the information comes from various sources and in
different formats, it can't be used directly for the data
mining procedure because the data may not be
complete and accurate.
So, the first data requires to be cleaned and unified.
03/29/2025 22
More information than needed will be collected
from various data sources, and only the data of
interest will have to be selected and passed to
the server.
These procedures are not as easy as we think.
Several methods may be performed on the
data as part of selection, integration, and
03/29/2025 23
Database or Data Warehouse
Server:
The database or data warehouse server
consists of the original data that is
ready to be processed.
Hence, the server is cause for
retrieving the relevant data that is
based on data mining as per user
request.
03/29/2025 24
Data Mining Engine:
The data mining engine is a major component of any data
mining system.
It contains several modules for operating data mining
tasks, including association, characterization,
classification, clustering, prediction, time-series analysis,
etc.
In other words, we can say data mining is the root of our
data mining architecture.
It comprises instruments and software used to obtain
insights and knowledge from data collected from various
data sources and stored within the data warehouse.
03/29/2025 25
Pattern Evaluation Module:
• The Pattern evaluation module is primarily responsible for the
measure of investigation of the pattern by using a threshold
value.
• It collaborates with the data mining engine to focus the
search on exciting patterns.
Graphical User Interface:
• The graphical user interface (GUI) module communicates
between the data mining system and the user.
• This module helps the user to easily and efficiently use the
system without knowing the complexity of the process.
• This module cooperates with the data mining system when
the user specifies a query or a task and displays the results.
03/29/2025 26
Knowledge Base:
The knowledge base is helpful in the entire process of data
mining.
It might be helpful to guide the search or evaluate the stake
of the result patterns.
The knowledge base may even contain user views and data
from user experiences that might be helpful in the data
mining process.
The data mining engine may receive inputs from the
knowledge base to make the result more accurate and
reliable.
The pattern assessment module regularly interacts with the
knowledge base to get inputs, and also update it.
03/29/2025 27
KDD- Knowledge Discovery in
Databases
The term KDD stands for Knowledge Discovery in Databases.
It refers to the broad procedure of discovering knowledge in
data and emphasizes the high-level applications of specific
Data Mining techniques.
It is a field of interest to researchers in various fields, including
artificial intelligence, machine learning, pattern recognition,
databases, statistics, knowledge acquisition for expert systems,
and data visualization.
03/29/2025 28
The main objective of the KDD process is to
extract information from data in the
context of large databases.
It does this by using Data Mining algorithms to
identify what is deemed knowledge.
The Knowledge Discovery in Databases is
considered as a programmed, exploratory
analysis and modeling of vast data repositories.
KDD is the organized procedure of
recognizing valid, useful, and
understandable patterns from huge and
complex data sets.
03/29/2025 29
Data Mining is the root of the KDD procedure, including the inferring
of algorithms that investigate the data, develop the model, and find
previously unknown patterns.
The model is used for extracting the knowledge from the data,
analyze the data, and predict the data.
The availability and abundance of data today make knowledge
discovery and Data Mining a matter of impressive significance and
need.
In the recent development of the field, it isn't surprising that a wide
variety of techniques is presently accessible to specialists and experts.
03/29/2025 30
Classification of Data Mining Systems
• Data mining systems can be classified into different
categories based on various criteria. Some of the common
classification schemes include:
1. Classification based on the type of
knowledge mined:
• Association rule mining:
This type of mining aims to discover relationships between
different items in a large dataset.
For example, a supermarket might use association rule
mining to identify products that are frequently purchased
together, so that they can be placed close to each other on
the shelves.
03/29/2025 42
• Classification:
This type of mining aims to predict the class label of a
new data point based on its attributes.
For example, a bank might use classification to predict
whether a loan applicant is likely to default on their loan.
• Clustering:
This type of mining aims to group data points into
clusters based on their similarity.
For example, a marketing company might use clustering
to group customers into different segments based on
their demographics and purchasing behavior.
03/29/2025 43
• Regression:
This type of mining aims to model the
relationship between a dependent
variable and one or more independent
variables.
For example, a company might use
regression to predict sales based on
marketing spend and economic factors.
03/29/2025 44
2. Classification based on the type of
database mined:
• Relational database mining:
This type of mining focuses on data stored in relational databases, which
are tables with rows and columns.
• Data warehouse mining:
This type of mining focuses on data stored in data warehouses, which are
large repositories of historical data.
• Web mining:
This type of mining focuses on data extracted from the World Wide
Web, such as web pages, web logs, and social media data.
• Text mining:
This type of mining focuses on extracting knowledge from text
documents, such as emails, news articles, and books.
03/29/2025 45
3. Classification based on the techniques
used:
• Machine learning:
This is a broad field of computer science that deals with algorithms
that can learn from data without being explicitly programmed.
Machine learning techniques are widely used in data mining for
tasks such as classification, clustering, and regression.
• Statistics:
Statistical techniques are used in data mining for tasks such as data
exploration, hypothesis testing, and model evaluation.
• Data visualization:
Data visualization techniques are used in data mining to
communicate insights from the data to a human audience.
03/29/2025 46
4. Classification based on the application
domain:
• Finance:
Data mining is used in finance for tasks such as fraud detection, credit risk
assessment.
• Retail:
Data mining is used in retail for tasks such as customer
segmentation, targeted marketing, and sales forecasting.
• Healthcare:
Data mining is used in healthcare for tasks such as disease
diagnosis, patient risk prediction, and drug discovery.
• Telecommunications:
Data mining is used in telecommunications for tasks such as network fraud
detection, and traffic optimization.
03/29/2025 47
Data mining major issues and
challenges.
Data mining, the process of discovering patterns and extracting
useful information from large datasets, is a powerful tool for
businesses, researchers, and various industries.
However, it comes with its own set of issues and challenges.
Some of the major issues in data mining include:
1.Data Quality:
1.Poor data quality can significantly impact the results of data mining.
Inaccurate, incomplete, or inconsistent data can lead to unreliable
patterns and conclusions.
2.Data Privacy and Security:
1.With the increasing amount of personal and sensitive data being
collected, privacy concerns arise. Ensuring that data is handled
securely and adheres to privacy regulations is a major challenge.
03/29/2025 48
3.Data Complexity:
Datasets are becoming increasingly complex in terms of
size and dimensionality. Handling high-dimensional data
and extracting meaningful patterns from large datasets
can be computationally intensive and challenging.
4.Scalability:
As datasets grow in size, algorithms must be able to scale
to handle the increased computational demands.
Scalability is a key concern, especially for organizations
dealing with big data.
5.Lack of Domain Knowledge:
Understanding the specific domain and context of the
data is crucial for effective data mining. Lack of domain
knowledge can lead to misinterpretation of results and
the discovery of irrelevant patterns.
03/29/2025 49
6.Data Integration:
Data mining often requires combining information from various
sources. Integrating diverse datasets with different structures and
formats can be challenging, and inconsistencies may arise.
7.Bias and Fairness:
Bias in data, whether due to historical trends or sampling issues,
can lead to biased models. Ensuring fairness in the outcomes and
addressing biases is an ongoing concern in data mining, especially
in applications like predictive policing or hiring.
8.Interpretability:
Many advanced machine learning algorithms, such as deep neural
networks, are often viewed as "black boxes" because of their
complex structures. Understanding and interpreting the results of
these models can be challenging, especially in critical applications
where decision-making transparency is important.
03/29/2025 50
9.Overfitting:
1.Overfitting occurs when a model captures noise
in the training data rather than the underlying
patterns. Balancing model complexity and
generalizability is a common challenge in data
mining.
10.Ethical Concerns:
2.The ethical use of data, especially when dealing
with sensitive information, is a growing concern.
Ensuring that data mining practices adhere to
ethical standards and guidelines is crucial.
03/29/2025 51
KDD and DBMS vs
KDDdata
(Knowledge mining
Discovery in Databases) and DBMS (Database
Management System) are related concepts, and data mining is
a crucial component of the knowledge discovery process.
Database Management System (DBMS):
A DBMS is a software system designed to manage and
organize data in databases.
It provides a structured way to store, retrieve, and manage
data efficiently.
Examples of DBMS include MySQL, Oracle, SQL Server, and
PostgreSQL.
03/29/2025 52
Knowledge Discovery in Databases (KDD):
KDD is the overall process of discovering
useful knowledge from large amounts of data
stored in databases.
It involves several stages, including data
selection, preprocessing, transformation, data
mining, pattern evaluation, and knowledge
presentation.
KDD aims to extract valuable, previously
unknown information from data.
03/29/2025 53
Data Mining:
• Data mining is a specific step within the KDD
process.
• It involves the extraction of patterns or knowledge
from large sets of data.
• Techniques used in data mining include machine
learning, statistical analysis, and other advanced
analytical methods.
• Data mining helps uncover hidden patterns, trends,
and relationships in the data.
03/29/2025 54
• DBMS is the foundation for storing and
managing data, KDD is the broader process
of discovering knowledge from data, and
data mining is a specific technique within
KDD used to extract patterns and knowledge.
• Data mining relies on the data stored in
databases managed by DBMS.
03/29/2025 55
The relationship between these concepts can
be illustrated in the following sequence:
DBMS provides the data infrastructure, KDD
encompasses the entire process of knowledge
discovery, and data mining is a key technique
employed within the KDD process for pattern
extraction from the data stored in DBMS.
03/29/2025 56
Data Mining Techniques
some of the fundamental data mining techniques commonly
used across industry verticals.
1. Association rule
The association rule refers to the if-then statements that
establish correlations and relationships between two or more
data items.
The correlations are evaluated using support and confidence
metrics, wherein support determines the frequency of
occurrence of data items within the dataset.
In contrast, confidence relates to the accuracy of if-then
statements.
03/29/2025 57
For example, while tracking a customer’s behavior
when purchasing online items, an observation is
made that the customer generally buys cookies
when purchasing a coffee pack.
In such a case, the association rule establishes the
relation between two items of cookies and coffee
packs, thereby forecasting future buys whenever
the customer adds the coffee pack to the shopping
cart.
03/29/2025 58
2. Classification
The classification data mining technique classifies data items
within a dataset into different categories.
For example, we can classify vehicles into different categories,
such as sedan, hatchback, petrol, diesel, electric vehicle, etc.,
based on attributes such as the vehicle’s shape, wheel type, or
even number of seats.
When a new vehicle arrives, we can categorize it into various
classes depending on the identified vehicle attributes.
One can apply the same classification strategy to classify
customers based on their age, address, purchase history, and
social group.
Some of the examples of classification methods include
decision trees, Naive Bayes classifiers, logistic regression, and
so on.
03/29/2025 59
3. Clustering
Clustering data mining techniques group
data elements into clusters that share
common characteristics.
We can cluster data pieces into categories
by simply identifying one or more attributes.
Some of the well-known clustering
techniques are k-means clustering,
hierarchical clustering, and Gaussian
mixture models.
03/29/2025 60
4. Regression
• Regression is a statistical modeling technique using
previous observations to predict new data values.
• In other words, it is a method of determining
relationships between data elements based on the
predicted data values for a set of defined variables.
• This category’s classifier is called the ‘Continuous
Value Classifier’.
• Linear regression, multivariate regression, and
decision trees are key examples of this type.
03/29/2025 61
5. Sequence & path analysis
• One can also mine sequential data to determine
patterns, wherein specific events or data values
lead to other events in the future.
• This technique is applied for long-term data as
sequential analysis is key to identifying trends or
regular occurrences of certain events.
• For example, when a customer buys a grocery
item, you can use a sequential pattern to suggest
or add another item to the basket based on the
customer’s purchase pattern.
03/29/2025 62
6. Neural networks
Neural networks technically refer to
algorithms that mimic the human brain and
try to replicate its activity to accomplish a
desired goal or task.
These are used for several pattern
recognition applications that typically involve
deep learning techniques.
Neural networks are a consequence of
advanced machine learning research.
03/29/2025 63
7. Prediction
The prediction data mining technique is typically
used for predicting the occurrence of an event, such
as the failure of machinery or a fault in an industrial
component, a fraudulent event, or company profits
crossing a certain threshold.
Prediction techniques can help analyze trends,
establish correlations, and do pattern matching when
combined with other mining methods.
Using such a mining technique, data miners can
analyze past instances to forecast future events.
03/29/2025 64