Data Mining
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
What Is Data Mining?
Data mining is the principle of sorting through large
amounts of data and picking out relevant
information.
In other words,
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Other names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
Some Definitions
Data : Data are any facts, numbers, or text that
can be processed by a computer.
operational or transactional data such as, sales,
cost, inventory, payroll, and accounting
nonoperational data, such as industry sales, forecast
data, and macro economic data
meta data - data about the data itself, such as
logical database design or data dictionary definitions
Information: The patterns, associations, or
relationships among all this data can provide
information.
Definitions Continued..
Knowledge: Information can be converted into knowledge about
historical patterns and future trends. For example, summary
information on retail supermarket sales can be analyzed in
terms of promotional efforts to provide knowledge of consumer
buying behavior. Thus, a manufacturer or retailer could
determine which items are most susceptible to promotional
efforts.
Data Warehouses: Data warehousing is defined as a process of
centralized data management and retrieval.
Data Warehouse
example
Data Rich, Information Poor
A Web Mining Framework
Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining process
Knowledge discovery
from data
KDD process includes
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be combined)
Data selection (where data relevant to the analysis task are retrieved
from the database)
Data transformation (where data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations)
KDD continued….
data mining (an essential process where intelligent
methods are applied in order to extract data
patterns.
pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
knowledge presentation (where visualization and
knowledge representation techniques are used to
present the mined knowledge to the user)
Data mining is a core of knowledge discovery process
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery Pattern Evaluation
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
KDD Process: A Typical View from ML
and Statistics
Input Data Data Pre- Data Post-
Processing Mining Processin
g
Data integration Pattern discovery Pattern evaluation
Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………
This is a view from typical machine learning and statistics communities
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: On What Kinds of
Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-
sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Why Confluence of Multiple Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-
bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Functionalities/Techniques:
Concept/Class Description: Characterization and
Discrimination
Mining Frequent Patterns, Associations and
correlations
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
Concept/Class Description:
Characterization and
Discrimination
Data Characterization: A data mining system should be
able to produce a description summarizing the
characteristics of customers.
Example: The characteristics of customers who spend
more than $1000 a year at (some store called ) All
Electronics. The result can be a general profile such as
age, employment status or credit ratings.
Characterization and
Discrimination
continued…
Data Discrimination: It is a comparison of the general
features of targeting class data objects with the general
features of objects from one or a set of contrasting classes.
User can specify target and contrasting classes.
Example: The user may like to compare the general
features of software products whose sales increased by
10% in the last year with those whose sales decreased by
about 30% in the same duration.
Mining Frequent Patterns,
Associations and
correlations
Frequent Patterns : as the name suggests patterns that occur
frequently in data.
Association Analysis: from marketing perspective, determining
which items are frequently purchased together within the same
transaction.
Example: An example is mined from the (some store) All
Electronic transactional database.
buys (X, “Computers”) buys (X, “software”) [Support = 1%,
confidence = 50% ]
X represents customer
confidence = 50% , if a customer buys a computer there is a
50% chance that he/she will buy software as well.
Support = 1%, means that 1% of all the transactions under
analysis showed that computer and software were purchased
together.
Mining Frequent Patterns,
Associations and
correlations
Another example:
Age (X, 20…29) ^ income (X, 20K-29K) buys(X, “CD
Player”) [Support = 2%, confidence = 60% ]
Customers between 20 to 29 years of age with an income
$20000-$29000. There is 60% chance they will purchase
CD Player and 2% of all the transactions under analysis
showed that this age group customers with that range of
income bought CD Player.
Classification and
Prediction
Classification is the process of finding a model that describes
and distinguishes data classes or concepts for the purpose of
being able to use the model to predict the class of objects
whose class label is unknown.
Classification model can be represented in various forms such
as
IF-THEN Rules
A decision tree
Neural network
Classification Model
Cluster Analysis
Clustering analyses data objects without consulting a known
class label.
Example: Cluster analysis can be performed on All Electronics
customer data in order to identify homogeneous
subpopulations of customers. These clusters may represent
individual target groups for marketing. The figure on next
slide shows a 2-D plot of customers with respect to customer
locations in a city.
Cluster Analysis
Outlier Analysis
Outlier Analysis : A database may contain data objects that
do not comply with the general behavior or model of the
data. These data objects are outliers.
Example: Use in finding Fraudulent usage of credit cards.
Outlier Analysis may uncover Fraudulent usage of credit
cards by detecting purchases of extremely large amounts for
a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase or
the purchase frequency.
Evolution Analysis
Evolution Analysis: Data evolution analysis describes and models
regularities or trends for objects whose behavior changes over
time.
Example: Time-series data. If the stock market data (time-series) of
the last several years available from the New York Stock exchange
and one would like to invest in shares of high-tech industrial
companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the
stocks of particular companies. Such regularities may help predict
future trends in stock market prices, contributing to one’s decision-
making regarding stock investments.
Applications of Data Mining
Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing( used by
retailers to increase sales by better understanding customer
purchasing patterns).
Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
Data mining and software engineering
From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
Major Issues in Data Mining
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
Continued……
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining