Data Mining
Md Tabrez Nafis
Department of Computer Science & Engineering
JAMIA HAMDARD, New Delhi
1
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
2
Evolution of Sciences
Before 1600, empirical science
1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
1950s-1990s, computational science
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
1990-now, data science
The flood of data from new scientific instruments and simulations
The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
5
Knowledge Discovery (KDD) Process
Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
6
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
7
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
8
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of
data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, social networks
Heterogeneous databases
Spatial, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
9
Database Processing vs. Data Mining
Processing
Query Query
Well defined Poorly defined
SQL No precise query
language
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of database – Not a subset of database
10
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than Rs. 10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
11
Architecture of Data Mining System
This is the information
of domain we are
mining like concept
Communicates between users and data mining hierarchies, to organize
system. Visualizes results or perform attributes onto various
exploration on data and schemas. levels of abstraction
Tests for interestingness of a pattern
Performs functionalities like characterization,
association, classification, prediction etc. Also contains user
beliefs, which can be
Is responsible for fetching relevant data based used to access
on user request interestingness of
pattern or thresholds
This is usually the source of data.
The data may require cleaning and
integration.
Architecture of data mining system
Basic Data Mining Tasks
Classification maps data into predefined
groups or classes
Supervised learning
Prediction
Regression
Clustering groups similar data together into
clusters.
Unsupervised learning
Segmentation
Partitioning
13
Basic Data Mining Tasks (cont’d)
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
14
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
15
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
16
Data Mining Functionalities
Multidimensional concept description: Characterization and
discrimination
Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
Frequent patterns, association, correlation vs. causality
Bread Butter [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars
based on (gas mileage)
Predict some unknown or missing numerical values
17
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
18
Why Data Mining?—Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis
19
Ex. 1: Market Analysis and Management
Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysis—Find associations/co-relations between product sales,
& predict based on such association
Customer profiling—What types of customers buy what products (clustering
or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary information
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
20
Ex. 2: Corporate Analysis & Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
group customers into classes and a class-based pricing procedure
set pricing strategy in a highly competitive market
21
Ex. 3: Fraud Detection & Mining Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier analysis
Applications: Health care, retail, credit card service, telecomm.
Auto insurance: ring of collisions
Money laundering: suspicious monetary transactions
Medical insurance
Professional patients, ring of doctors, and ring of references
Unnecessary or correlated screening tests
Telecommunications: phone-call fraud
Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest
employees
22
Mining for Knowledge
Knowledge in the form of rules
If <condition_1>&<condition_2>& …&<condition_n> Then
<conclusion>
Types of knowledge
Association
Presence of one set of items/attributes implies presence of
another set.
Classification
Given examples of objects belonging to different groups,
develop profile of each group in terms of attributes of the
objects.
Clustering.
Unsupervised grouping of similar records based on attributes.
Prediction (temporal and spatial).
Historical records collected at fixed period of time.
23
Mining Association Rules
The presence of one set of items in a transaction
implies the presence of another set of items
30% of people who buy bread also buy butter.
The presence of an attribute value in a record
implies the presence of another
60% of patients with these symptoms also have that
symptom.
24
Data Mining Functionalities:
Mining Frequent Patterns
Frequent patterns are the patterns that occur 8
frequently in the data. Patterns can include
itemsets, sequences and subsequences.
A frequent itemset refers to a set of items that
often appear together in a transactional data set.
ex: bread and milk
Data Mining Functionalities:
Mining Frequent Patterns
Association Rules 9
buys(X, “computer”)=>buys(X, “software”) [support =1%, confidence = 50%]
age(X, “20..29”)^income(X, “40K..49K”)=>buys(X, “laptop”)
if a customer buys a computer, there is a 50% chance that he will buy software as well
Single Dimension Association Rule 1% of all the transactions under analysis show
that computer and software are purchased together
[support = 2%, confidence = 60%]
Multi-Dimension Association Rule
Association rules are discarded as uninteresting if they do not satisfy minimum support threshold and minimum confidence threshold
An Example Association Rule
Mobile Telecom Data
Provided by a telecom company.
Over 200 relational tables and transactional data
of over 30,000 records.
Example of a discovered association rules
60% who call from New Delhi call to Mumbai.
77% whose average call duration is greater
than 5 minutes make an average of over 80
phone calls per month.
27
Data Mining Functionalities:
Classification and Prediction
10
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts. The model is derived based on the
analysis of a set of training data and is used to predict the class label of objects.
Representation of Derived model
IF-THEN Rules
Decision Tree
Neural Network
Data Mining Functionalities:
Classification and Prediction
11 or
Prediction values continuous valued functions, i.e. it is used to predict missing
unavailable numeric data values rather than class labels.
Prediction can be used for both numeric prediction and class label prediction.
Regression analysis is a statistical method used numeric prediction.
Classification and regression may need to be preceded by relevance analysis,
which attempts to identify attributes that are significantly relevant to the
classification and regression process. Such attributes will be selected for the
classification and regression process. Other attributes, which are irrelevant, can
then be excluded from consideration
Mining Classification Rules
Patient Records
Symptoms, Diseases
Recovered
Never
Recover Recover
? ed
Not
recover?
30
An Example of Classification
Credit card data
Each transaction contains transaction date, amount, and a set of
items purchased, etc.
Each customer record contains gender, age, education
background, etc.
Example of rules discovered:
IF use of card >= 9 months continuously & no. of transaction <= 2
THEN Cash Advance = Yes.
Actionable item:
Promote credit services to potential customers who requires cash
advance.
31
Data Mining Functionalities:
Cluster Analysis
Clustering analyzes data objects without consulting
12
class labels.
Clustering can be used to generate class labels for
a group of data which did not exist at the
beginning.
The objects are clustered or grouped based on the
principle of maximizing the intra-class similarity and
minimizing the inter-class similarity.
Discovering Clusters
Dividing them up into groups according to similarity
33
34
Classification ≠Clustering
Classification
What is the difference
between Good & Bad
Good Customers Bad Customers
Clustering
How can I group the
customers
35
Discovering Sequential Patterns
People who have purchased a VCR are three
times more likely to purchase a camcorder
two to four months after the purchase.
If the price of Stock A increases by more than
10% and the price of Stock B decreases by
less than 2% today, then the price of Stock C
will increase by 5% two days later.
36
An Example of Sequential Pattern
Mining
Electricity consumption data:
A set of time series each associated with an
industrial user.
Each time series represents an electricity load
profile of a user at a certain premise.
Reading of electricity load taken every 30 min.
The Goal
Identify companies with similar electricity load
profiles using data mining.
37
Web Log Mining
Web Servers register a log entry for every single
access they get.
A huge number of accesses (hits) are registered and
collected in an ever-growing web log.
Web log mining:
Understand general access patterns and trends.
Better structure and grouping of resource providers.
Adaptive Sites -- Web site restructures itself automatically.
Personalization.
Target customers for electronic commerce
Identify potential prime advertisement locations
38
An Example of Web Log Mining
Given a web access log file
Provided by an airline company.
The Goal
Analysis user access pattern
e.g. Page A --> Page B --> Page C --> …
Which page the viewer will arrive after accessing certain URLs.
Results:
IF Page = Destination Information & Next Page = Flight
Schedules THEN Next Page = XxxAir Travel Packages
IF Day of week = Wed. & Time = Non-office hour
THEN duration = long
Actionable Items
Golden time for advertisements is on Wed. during non-office
hour.
39
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
summarization, classification, regression, association, clustering
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
40
Are All the “Discovered” Patterns Interesting?
Data mining may generate thousands of patterns: Not all of them
are interesting
Suggested approach: Human-centered, query-based, focused mining
41
Requirements and Challenges
Variety of data types.
Noisy and incomplete data
The interestingness problem.
Different kinds of knowledge.
Different levels of abstraction.
Expression and visualization of data mining
results.
Efficiency and scalability of data mining
algorithms.
42
Thank You
43