0% found this document useful (0 votes)
12 views31 pages

Data Mining (Introduction)

Data mining is the automated analysis of massive data sets to extract useful patterns and knowledge, addressing the challenge of overwhelming data growth. It involves processes such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation, forming the core of the knowledge discovery process (KDD). Applications span various fields including business intelligence, web analysis, and medical data analysis, with ongoing challenges related to methodology, efficiency, and societal impacts.

Uploaded by

vishalmishra622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views31 pages

Data Mining (Introduction)

Data mining is the automated analysis of massive data sets to extract useful patterns and knowledge, addressing the challenge of overwhelming data growth. It involves processes such as data cleaning, integration, selection, transformation, mining, evaluation, and presentation, forming the core of the knowledge discovery process (KDD). Applications span various fields including business intelligence, web analysis, and medical data analysis, with ongoing challenges related to methodology, efficiency, and societal impacts.

Uploaded by

vishalmishra622
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Mining

Why Data Mining?


 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems,
Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific
simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
What Is Data Mining?
Data mining is the principle of sorting through large
amounts of data and picking out relevant
information.

In other words,
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Other names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
Some Definitions
 Data : Data are any facts, numbers, or text that
can be processed by a computer.
 operational or transactional data such as, sales,
cost, inventory, payroll, and accounting
 nonoperational data, such as industry sales, forecast
data, and macro economic data
 meta data - data about the data itself, such as
logical database design or data dictionary definitions

 Information: The patterns, associations, or


relationships among all this data can provide
information.
Definitions Continued..
 Knowledge: Information can be converted into knowledge about
historical patterns and future trends. For example, summary
information on retail supermarket sales can be analyzed in
terms of promotional efforts to provide knowledge of consumer
buying behavior. Thus, a manufacturer or retailer could
determine which items are most susceptible to promotional
efforts.

 Data Warehouses: Data warehousing is defined as a process of


centralized data management and retrieval.
Data Warehouse
example
Data Rich, Information Poor
A Web Mining Framework
 Web mining usually involves
 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into
knowledge-base
Data Mining in Business
Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining process
Knowledge discovery
from data
KDD process includes

 Data cleaning (to remove noise and inconsistent data)

 Data integration (where multiple data sources may be combined)

 Data selection (where data relevant to the analysis task are retrieved
from the database)

 Data transformation (where data are transformed or consolidated into


forms appropriate for mining by performing summary or aggregation
operations)
KDD continued….
 data mining (an essential process where intelligent
methods are applied in order to extract data
patterns.

 pattern evaluation (to identify the truly interesting


patterns representing knowledge based on some
interestingness measures)

 knowledge presentation (where visualization and


knowledge representation techniques are used to
present the mined knowledge to the user)

Data mining is a core of knowledge discovery process


Knowledge Discovery (KDD) Process
 Data mining—core of
knowledge discovery Pattern Evaluation
process

Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
KDD Process: A Typical View from ML
and Statistics

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern interpretation
Dimension reduction Clustering Pattern visualization
Outlier analysis
…………

 This is a view from typical machine learning and statistics communities


Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Learning Data Mining

Pattern
Recognition Other
Algorithm Disciplines
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications

 Relational database, data warehouse, transactional database


 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data
 Multimedia database
 Text databases
 The World-Wide Web
Why Confluence of Multiple Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Functionalities/Techniques:

 Concept/Class Description: Characterization and


Discrimination
 Mining Frequent Patterns, Associations and
correlations
 Classification and Prediction
 Cluster Analysis
 Outlier Analysis
 Evolution Analysis
Concept/Class Description:
Characterization and
Discrimination
 Data Characterization: A data mining system should be
able to produce a description summarizing the
characteristics of customers.

 Example: The characteristics of customers who spend


more than $1000 a year at (some store called ) All
Electronics. The result can be a general profile such as
age, employment status or credit ratings.
Characterization and
Discrimination
continued…
 Data Discrimination: It is a comparison of the general
features of targeting class data objects with the general
features of objects from one or a set of contrasting classes.
User can specify target and contrasting classes.
 Example: The user may like to compare the general
features of software products whose sales increased by
10% in the last year with those whose sales decreased by
about 30% in the same duration.
Mining Frequent Patterns,
Associations and
correlations
Frequent Patterns : as the name suggests patterns that occur
frequently in data.
Association Analysis: from marketing perspective, determining
which items are frequently purchased together within the same
transaction.

Example: An example is mined from the (some store) All

Electronic transactional database.


buys (X, “Computers”)  buys (X, “software”) [Support = 1%,
confidence = 50% ]
 X represents customer
 confidence = 50% , if a customer buys a computer there is a
50% chance that he/she will buy software as well.
 Support = 1%, means that 1% of all the transactions under
analysis showed that computer and software were purchased
together.
Mining Frequent Patterns,
Associations and
correlations
 Another example:

 Age (X, 20…29) ^ income (X, 20K-29K)  buys(X, “CD


Player”) [Support = 2%, confidence = 60% ]

 Customers between 20 to 29 years of age with an income


$20000-$29000. There is 60% chance they will purchase
CD Player and 2% of all the transactions under analysis
showed that this age group customers with that range of
income bought CD Player.
Classification and
Prediction
 Classification is the process of finding a model that describes
and distinguishes data classes or concepts for the purpose of
being able to use the model to predict the class of objects
whose class label is unknown.

 Classification model can be represented in various forms such


as
 IF-THEN Rules

 A decision tree

 Neural network
Classification Model
Cluster Analysis

 Clustering analyses data objects without consulting a known


class label.

 Example: Cluster analysis can be performed on All Electronics


customer data in order to identify homogeneous
subpopulations of customers. These clusters may represent
individual target groups for marketing. The figure on next
slide shows a 2-D plot of customers with respect to customer
locations in a city.
Cluster Analysis
Outlier Analysis
 Outlier Analysis : A database may contain data objects that
do not comply with the general behavior or model of the
data. These data objects are outliers.

 Example: Use in finding Fraudulent usage of credit cards.


Outlier Analysis may uncover Fraudulent usage of credit
cards by detecting purchases of extremely large amounts for
a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be
detected with respect to the location and type of purchase or
the purchase frequency.
Evolution Analysis
 Evolution Analysis: Data evolution analysis describes and models
regularities or trends for objects whose behavior changes over
time.

 Example: Time-series data. If the stock market data (time-series) of


the last several years available from the New York Stock exchange
and one would like to invest in shares of high-tech industrial
companies. A data mining study of stock exchange data may
identify stock evolution regularities for overall stocks and for the
stocks of particular companies. Such regularities may help predict
future trends in stock market prices, contributing to one’s decision-
making regarding stock investments.
Applications of Data Mining
 Web page analysis: from web page classification, clustering
to PageRank & HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing( used by
retailers to increase sales by better understanding customer
purchasing patterns).
 Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
 Data mining and software engineering
 From major dedicated data mining systems/tools (e.g., SAS,
MS SQL-Server Analysis Manager, Oracle Data Mining Tools)
to invisible data mining
Major Issues in Data Mining
 Mining Methodology
 Mining various and new kinds of knowledge
 Mining knowledge in multi-dimensional space
 Data mining: An interdisciplinary effort
 Boosting the power of discovery in a networked environment
 Handling noise, uncertainty, and incompleteness of data
 Pattern evaluation and pattern- or constraint-guided mining
 User Interaction
 Interactive mining
 Incorporation of background knowledge
 Presentation and visualization of data mining results
Continued……
 Efficiency and Scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, stream, and incremental mining methods
 Diversity of data types
 Handling complex types of data
 Mining dynamic, networked, and global data repositories
 Data mining and society
 Social impacts of data mining
 Privacy-preserving data mining
 Invisible data mining

You might also like