0% found this document useful (0 votes)
23 views30 pages

Data Mining Intro, Functionalities, Issues

Data mining is the automated analysis of massive data sets to extract useful patterns and knowledge, addressing the challenge of overwhelming data growth. The knowledge discovery process includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various data types and mining functionalities, such as classification, regression, and clustering, are utilized to derive insights and support decision-making across multiple domains.

Uploaded by

Sakthivel.A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views30 pages

Data Mining Intro, Functionalities, Issues

Data mining is the automated analysis of massive data sets to extract useful patterns and knowledge, addressing the challenge of overwhelming data growth. The knowledge discovery process includes data cleaning, integration, selection, transformation, mining, evaluation, and presentation. Various data types and mining functionalities, such as classification, regression, and clustering, are utilized to derive insights and support decision-making across multiple domains.

Uploaded by

Sakthivel.A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes



Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific
simulation, …

Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
1
July 31, 2025 Data Mining: Concepts and Techniques 2
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

3
Data mining as a step in the process of knowledge
discovery

July 31, 2025 Data Mining: Concepts and Techniques 4


Knowledge Discovery
Process
 Data Cleaning: (to remove noise and
inconsistent data)

 Data integration:(where multiple data


sources may be combined)

 Data selection: (where data relevant to the


analysis task are retrieved from the
database)

 Data transformation : (where data are


transformed and consolidated into forms
appropriate for mining by performing
summary or aggregation operations)
July 31, 2025 Data Mining: Concepts and Techniques 5
Knowledge Discovery
Process(continue..)
 Data mining : (an essential process where
intelligent methods are applied to extract
data patterns)

 Pattern evaluation: (to identify the truly


interesting patterns representing
knowledge based on interesting measures)

 Knowledge presentation : (where


visualization and knowledge representation
techniques are used to present mined
knowledge to users)

July 31, 2025 Data Mining: Concepts and Techniques 6


Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
7
What kinds of Patterns can be
Mined?

Data mining functionalities are:


Characterization and Discrimination

The mining of frequent patterns,Associations

and Correlations
Classification and Regression for Predictive

Analysis
Cluster Analysis

Outlier Analysis

Two categories:
Descriptive

Predictive

July 31, 2025 Data Mining: Concepts and Techniques 8


Class/Concept Description:
Characterization and Discrimination

 Data Characterization: Summarization of


the general characteristics or features of a
target class of Data

 Ex: Summarize the characteristics of


customers who spend more than $5000 a
year

 The result is a general profile of these


customers such as that they are 40 to 50
years old, employed and have excellent
credit rating.

July 31, 2025 Data Mining: Concepts and Techniques 9


Data Discrimination
 It is a comparison of the general features of
the target class data objects against the
general features of objects from one or
multiple contrasting classes.
 Example: Compare two groups of customers
–those who shop for computer products
regularly and those who rarely shop for
such products

 The resulting description provides a general


comparative profile of these customers

July 31, 2025 Data Mining: Concepts and Techniques 10


Mining Frequent Patterns, Associations
and Correlations

 Frequent patterns: patterns that occur


frequently in data.
 Frequent itemset: set of items that often
appear together in a transactional data set.
 Frequent sequence patterns: A frequently
occurring subsequence such as the pattern
that customers tend to purchase first a
laptop,followed by a digital camera and
then a memory card.
 Frequent structured patterns

July 31, 2025 Data Mining: Concepts and Techniques 11


Mining Frequent Patterns, Associations
and Correlations(continues..)

 computer⇒antivirus software
[support=2%,confidence=60%].
 The rule A ⇒ B holds in the transaction set D
with supports s, where s is the percentage of
transactions in D that contain AUB (the union
of sets A and B, i.e., it contains every item in
A and B). This is taken to be the probability,
Support (A ⇒ B )=P(AUB)
 The rule A ⇒ B has confidence c in the
transaction set D, where c is the percentage
of transactions in D containing A that also
contain B. This is taken to be the conditional
probability,P(A|B) .
 Confidence (A ⇒ B)= P(B|A)
July 31, 2025 Data Mining: Concepts and Techniques 12
Mining Frequent Patterns, Associations
and Correlations(continues..)

 Single dimensional Association rule


 buys(X,”Computer”) ⇒ buys(X,”Computer”)

[ support=1%, confidence=50%]

Multidimensional association rule:


age(X,”20..29”) ˄ income(X,”40K..49K) ⇒
buys(X,”laptop”)
[support=2%,confidence=60%]

July 31, 2025 Data Mining: Concepts and Techniques 13


Classification and Regression for Predictive
Analysis

 Classification is the process of finding a


model that describes and distinguishes data
classes or concepts
 The model are derived based on the
analysis of a set of training data
 The model is used to predict the class label
or objects for which class label is unknown.

July 31, 2025 Data Mining: Concepts and Techniques 14


Classification and Regression for Predictive
Analysis(continue..)

 The derived model may be represented in


various forms:
 Decision Trees
 IF-THEN Rules
 Neural Networks

July 31, 2025 Data Mining: Concepts and Techniques 15


Decision Tree

July 31, 2025 Data Mining: Concepts and Techniques 16


IF –THEN rules

July 31, 2025 Data Mining: Concepts and Techniques 17


Neural Networks

July 31, 2025 Data Mining: Concepts and Techniques 18


Regression Models
 Regression used to predict missing or
unavailable numerical data values rather
than discrete class labels
 Regression analysis is a statistical
methodology that is most often used for
numerical prediction

July 31, 2025 Data Mining: Concepts and Techniques 19


Cluster Analysis

July 31, 2025 Data Mining: Concepts and Techniques 20


Cluster Analysis(Continue..)
 Unsupervised learning (i.e., Class label is
unknown)
 Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns
 Principle: Maximizing intra-class similarity &
minimizing interclass similarity

Data Mining: Concepts and


July 31, 2025 Techniques 21
Outlier Analysis

July 31, 2025 Data Mining: Concepts and Techniques 22


Outlier Analysis(Continue..)
 Outlier analysis
 Outlier: A data object that does not comply
with the general behavior of the data
 Noise or exception? ― One person’s
garbage could be another person’s treasure
 Useful in fraud detection, rare events
analysis

Data Mining: Concepts and


July 31, 2025 Techniques 23
Are All Patterns Interesting?
 Data Mining may generate thousands of
patterns but not all of them are interesting.
 A pattern is interesting if it is
 easily understood by humans
 valid on new or test data with some
degree of certainty
 Potentially useful
 Novel
 Validates a hypothesis that a user seeks to
confirm
 An interesting pattern represents
knowledge
Data Mining: Concepts and
July 31, 2025 Techniques 24
Are All Patterns Interesting?
(Continue..)
 Objective measures:
 These are based on the structure of

discovered patterns and the statistics


underlying them.
Eg: Support,confidence etc
(Rules that do not satisfy a threshold are
considered uninteresting)
Subjective measures:
 Reflect the needs and interests of particular

user
 Based on user beliefs in the data

 Objective and subjective measures need to

be combined Data Mining: Concepts and


July 31, 2025 Techniques 25
Are All Patterns Interesting?
(Continue..)
 Find all the interesting
patterns:Completeness
 Unrealistic and inefficient
 User provided constraints and interesting
measures should be used.
 Search for only interesting patterns:An
optimization problem
 highly desirable
 No need to search through the generated
patterns to identify truly interesting ones
 Measures can be used to rank the
discovered patterns according their
interestingness
Data Mining: Concepts and
July 31, 2025 Techniques 26
Which Technologies are
used?

Data Mining: Concepts and


July 31, 2025 Techniques 27
What kinds of Applications are
targeted?

 Business intelligence
 Web Search Engines

Data Mining: Concepts and


July 31, 2025 Techniques 28
Major Issues in Data Mining
(1)
 Mining Methodology

Mining various and new kinds of knowledge

Mining knowledge in multi-dimensional space

Data mining: An interdisciplinary effort

Boosting the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pattern evaluation and pattern- or constraint-guided mining
 User Interaction

Interactive mining

Incorporation of background knowledge

Presentation and visualization of data mining results

29
Major Issues in Data Mining
(2)

 Efficiency and Scalability


 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

30

You might also like