0% found this document useful (0 votes)

12 views31 pages

Data Mining (Introduction)

Data mining is the automated analysis of massive data sets to extract useful patterns and knowledge, addressing the challenge of overwhelming data growth. It involves processes such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation, forming the core of the knowledge discovery process (KDD). Applications span various fields including business intelligence, web analysis, and medical data analysis, with ongoing challenges related to methodology, efficiency, and societal impacts.

Uploaded by

vishalmishra622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views31 pages

Data Mining (Introduction)

Uploaded by

vishalmishra622

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Mining

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific
simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
What Is Data Mining?
Data mining is the principle of sorting through large
amounts of data and picking out relevant
information.

In other words,
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Other names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
Some Definitions
 Data : Data are any facts, numbers, or text that
can be processed by a computer.
 operational or transactional data such as, sales,
cost, inventory, payroll, and accounting
 nonoperational data, such as industry sales, forecast
data, and macro economic data
 meta data - data about the data itself, such as
logical database design or data dictionary definitions

 Information: The patterns, associations, or

relationships among all this data can provide
information.
Definitions Continued..
 Knowledge: Information can be converted into knowledge about
historical patterns and future trends. For example, summary
information on retail supermarket sales can be analyzed in
terms of promotional efforts to provide knowledge of consumer
buying behavior. Thus, a manufacturer or retailer could
determine which items are most susceptible to promotional
efforts.

 Data Warehouses: Data warehousing is defined as a process of

centralized data management and retrieval.
Data Warehouse
example
Data Rich, Information Poor
A Web Mining Framework
 Web mining usually involves
 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining process
Knowledge discovery
from data
KDD process includes

 Data cleaning (to remove noise and inconsistent data)

 Data integration (where multiple data sources may be combined)

 Data selection (where data relevant to the analysis task are retrieved
from the database)

 Data transformation (where data are transformed or consolidated into

forms appropriate for mining by performing summary or aggregation
operations)
KDD continued….
 data mining (an essential process where intelligent
methods are applied in order to extract data
patterns.

 pattern evaluation (to identify the truly interesting

patterns representing knowledge based on some
interestingness measures)

 knowledge presentation (where visualization and

knowledge representation techniques are used to
present the mined knowledge to the user)

Data mining is a core of knowledge discovery process

Knowledge Discovery (KDD) Process
 Data mining—core of
knowledge discovery Pattern Evaluation
process

Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-

Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation

Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities

Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications

 Relational database, data warehouse, transactional database

 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Functionalities/Techniques:

 Concept/Class Description: Characterization and

Discrimination
 Mining Frequent Patterns, Associations and
correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
Concept/Class Description:
Characterization and
Discrimination
 Data Characterization: A data mining system should be
able to produce a description summarizing the
characteristics of customers.

 Example: The characteristics of customers who spend

more than $1000 a year at (some store called ) All
Electronics. The result can be a general profile such as
age, employment status or credit ratings.
Characterization and
Discrimination
continued…
 Data Discrimination: It is a comparison of the general
features of targeting class data objects with the general
features of objects from one or a set of contrasting classes.
User can specify target and contrasting classes.
 Example: The user may like to compare the general
features of software products whose sales increased by
10% in the last year with those whose sales decreased by
about 30% in the same duration.
Mining Frequent Patterns,
Associations and
correlations
Frequent Patterns : as the name suggests patterns that occur
frequently in data.
Association Analysis: from marketing perspective, determining
which items are frequently purchased together within the same
transaction.

Example: An example is mined from the (some store) All

Electronic transactional database.

buys (X, “Computers”)  buys (X, “software”) [Support = 1%,
confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a
50% chance that he/she will buy software as well.
 Support = 1%, means that 1% of all the transactions under
analysis showed that computer and software were purchased
together.
Mining Frequent Patterns,
Associations and
correlations
 Another example:

 Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD

Player”) [Support = 2%, confidence = 60% ]

 Customers between 20 to 29 years of age with an income

$20000-$29000. There is 60% chance they will purchase
CD Player and 2% of all the transactions under analysis
showed that this age group customers with that range of
income bought CD Player.
Classification and
Prediction
 Classification is the process of finding a model that describes
and distinguishes data classes or concepts for the purpose of
being able to use the model to predict the class of objects
whose class label is unknown.

 Classification model can be represented in various forms such

as
 IF-THEN Rules

 A decision tree

 Neural network
Classification Model
Cluster Analysis

 Clustering analyses data objects without consulting a known

class label.

 Example: Cluster analysis can be performed on All Electronics

customer data in order to identify homogeneous
subpopulations of customers. These clusters may represent
individual target groups for marketing. The figure on next
slide shows a 2-D plot of customers with respect to customer
locations in a city.
Cluster Analysis
Outlier Analysis
 Outlier Analysis : A database may contain data objects that
do not comply with the general behavior or model of the
data. These data objects are outliers.

 Example: Use in finding Fraudulent usage of credit cards.

Outlier Analysis may uncover Fraudulent usage of credit
cards by detecting purchases of extremely large amounts for
a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase or
the purchase frequency.
Evolution Analysis
 Evolution Analysis: Data evolution analysis describes and models
regularities or trends for objects whose behavior changes over
time.

 Example: Time-series data. If the stock market data (time-series) of

the last several years available from the New York Stock exchange
and one would like to invest in shares of high-tech industrial
companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the
stocks of particular companies. Such regularities may help predict
future trends in stock market prices, contributing to one’s decision-
making regarding stock investments.
Applications of Data Mining
 Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing( used by
retailers to increase sales by better understanding customer
purchasing patterns).
 Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
 Data mining and software engineering
 From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
Major Issues in Data Mining
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
Continued……
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

Data Mining Overview by Archana Ketkar
No ratings yet
Data Mining Overview by Archana Ketkar
24 pages
Data Mining and Knowledge Discovery Guide
No ratings yet
Data Mining and Knowledge Discovery Guide
21 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
CSM6404 DM L1
No ratings yet
CSM6404 DM L1
29 pages
Unit 1
No ratings yet
Unit 1
59 pages
01 - Introduction To Datamining
No ratings yet
01 - Introduction To Datamining
19 pages
Introduction to Data Mining Basics
No ratings yet
Introduction to Data Mining Basics
43 pages
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
No ratings yet
To Data Mining: Motivation: "Necessity Is The Mother of Invention"
14 pages
INTRODUCTION Data Mining
No ratings yet
INTRODUCTION Data Mining
43 pages
CH 2
No ratings yet
CH 2
37 pages
Data Mining
No ratings yet
Data Mining
88 pages
Introduction
No ratings yet
Introduction
60 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
History and Patterns in Data Mining
No ratings yet
History and Patterns in Data Mining
25 pages
Data Warehouse and Data Mining Overview
No ratings yet
Data Warehouse and Data Mining Overview
55 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
43 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Unit III
No ratings yet
Unit III
101 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
No ratings yet
FALLSEM2025 26 - VL - ISWE209L - 00100 - TH - 2025 07 31 - Course Material For Module 1
31 pages
Data Mining Concepts and Applications
No ratings yet
Data Mining Concepts and Applications
38 pages
Data Mining Concepts and Applications
No ratings yet
Data Mining Concepts and Applications
27 pages
Lecture 6 Compress
No ratings yet
Lecture 6 Compress
9 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
32 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
38 pages
Data Mining for Computer Science Students
No ratings yet
Data Mining for Computer Science Students
52 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
Data Mining: Concepts & Techniques Overview
No ratings yet
Data Mining: Concepts & Techniques Overview
29 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
27 pages
Data Mining for Analysts & Businesses
No ratings yet
Data Mining for Analysts & Businesses
15 pages
Data Mining for Business Insights
100% (3)
Data Mining for Business Insights
11 pages
Unit - I
No ratings yet
Unit - I
22 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Unit-1 (Data Mining)
No ratings yet
Unit-1 (Data Mining)
13 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
45 pages
Introduction
No ratings yet
Introduction
46 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
32 pages
Data Mining: Concepts and Applications
No ratings yet
Data Mining: Concepts and Applications
35 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Data Mining: Applications and Techniques
No ratings yet
Data Mining: Applications and Techniques
60 pages
Chapater 1 Data Mining 2025
No ratings yet
Chapater 1 Data Mining 2025
7 pages
Chap1 Introduction
No ratings yet
Chap1 Introduction
58 pages
KDD Process
No ratings yet
KDD Process
56 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
02 DM BI Data Mining
No ratings yet
02 DM BI Data Mining
66 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Chapter 1 Intro
No ratings yet
Chapter 1 Intro
23 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
27 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
27 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
224 pages
Chap 1
No ratings yet
Chap 1
32 pages
Data Analysis & Mining for Business Decisions
No ratings yet
Data Analysis & Mining for Business Decisions
100 pages
Session 3 - Research Design
No ratings yet
Session 3 - Research Design
41 pages
Introduction To Databases
No ratings yet
Introduction To Databases
61 pages
DBMS
No ratings yet
DBMS
15 pages
Vishal Accounts
No ratings yet
Vishal Accounts
6 pages
Case Study Normalization JOINS
No ratings yet
Case Study Normalization JOINS
9 pages
Nytro Izar I Se en Sds
No ratings yet
Nytro Izar I Se en Sds
31 pages
WPS Canada Inc. Invoice Statement
No ratings yet
WPS Canada Inc. Invoice Statement
1 page
Inventory Estimation and LCNRV Sample Problems
No ratings yet
Inventory Estimation and LCNRV Sample Problems
3 pages
The True History of The Conquest of New Spain Bernal Diaz Del Castillo
100% (1)
The True History of The Conquest of New Spain Bernal Diaz Del Castillo
1,190 pages
Guest Worker Programs and Circular Migration: What Works?
No ratings yet
Guest Worker Programs and Circular Migration: What Works?
17 pages
Grade Card
No ratings yet
Grade Card
2 pages
Lesson Plan in Mapeh 9
No ratings yet
Lesson Plan in Mapeh 9
4 pages
Transducers Micro-Project Report
No ratings yet
Transducers Micro-Project Report
2 pages
Weekly Log
No ratings yet
Weekly Log
4 pages
SEHH2031 Revision Session Ch5-6 Solution
No ratings yet
SEHH2031 Revision Session Ch5-6 Solution
2 pages
RA.9165 Summary
100% (9)
RA.9165 Summary
5 pages
Dps Raipur Holiday Homework 2013
100% (1)
Dps Raipur Holiday Homework 2013
6 pages
Presentation On Collective Investment Schemes
No ratings yet
Presentation On Collective Investment Schemes
64 pages
Food For Work Program in Bangladesh
No ratings yet
Food For Work Program in Bangladesh
31 pages
Spanish Nationality and Pronouns
No ratings yet
Spanish Nationality and Pronouns
24 pages
Education Project Progress Report
No ratings yet
Education Project Progress Report
6 pages
A.K. Roy vs. Union of India: Ordinance Review
No ratings yet
A.K. Roy vs. Union of India: Ordinance Review
11 pages
Mastering Relative Clauses
No ratings yet
Mastering Relative Clauses
3 pages
Kuns Werkboek GR 6 Afr Kwart1
No ratings yet
Kuns Werkboek GR 6 Afr Kwart1
45 pages
Malnutrition, Undernutrition and Overnutrition Affect Health and Wellness
No ratings yet
Malnutrition, Undernutrition and Overnutrition Affect Health and Wellness
3 pages
Mamala Prayer Camp Development Proposal
No ratings yet
Mamala Prayer Camp Development Proposal
3 pages
CHN Midterms Notes
No ratings yet
CHN Midterms Notes
11 pages
Detailed Design Study Report of New Bohol Airport Construction and Sustainable Environment Protection Project. Final Report
No ratings yet
Detailed Design Study Report of New Bohol Airport Construction and Sustainable Environment Protection Project. Final Report
311 pages
Mgt402 Solved Final Term Papers With Reference
100% (1)
Mgt402 Solved Final Term Papers With Reference
6 pages
Crim Law 1 Case Digests Part 1
No ratings yet
Crim Law 1 Case Digests Part 1
27 pages
Test Accoring ISO
No ratings yet
Test Accoring ISO
6 pages
Anachronauts - "Pilot"
No ratings yet
Anachronauts - "Pilot"
37 pages
Advanced Argument Strategies Guide
No ratings yet
Advanced Argument Strategies Guide
6 pages
Company List by Din
No ratings yet
Company List by Din
216 pages
Miracles and Healing Today
No ratings yet
Miracles and Healing Today
7 pages

Data Mining (Introduction)

Uploaded by

Data Mining (Introduction)

Uploaded by

Data Mining

Why Data Mining?

 Information: The patterns, associations, or

 Data Warehouses: Data warehousing is defined as a process of

Data Preprocessing/Integration, Data Warehouses

 Data cleaning (to remove noise and inconsistent data)

 Data integration (where multiple data sources may be combined)

 Data transformation (where data are transformed or consolidated into

 pattern evaluation (to identify the truly interesting

 knowledge presentation (where visualization and

Data mining is a core of knowledge discovery process

Input Data Data Pre- Data Post-

Data integration Pattern discovery Pattern evaluation

 This is a view from typical machine learning and statistics communities

 Relational database, data warehouse, transactional database

 Concept/Class Description: Characterization and

 Example: The characteristics of customers who spend

Example: An example is mined from the (some store) All

Electronic transactional database.

 Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD

 Customers between 20 to 29 years of age with an income

 Classification model can be represented in various forms such

 Clustering analyses data objects without consulting a known

 Example: Cluster analysis can be performed on All Electronics

 Example: Use in finding Fraudulent usage of credit cards.

 Example: Time-series data. If the stock market data (time-series) of

You might also like