Datamining & Warehousing
Unit 1 – Part 2
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Agenda
■ Technology Used
■ Kind of Applications
■ Major Issues in Data Mining
■ Summary
Sri Ramakrishna College of Arts & Science
2
Data Mining - Technologies Used
Pattern
Machine Recogniti
Statistics
Learning on
Database Visualizat
Systems Data ion
Mining
Datawar
ehouse Algorith
ms
Informati
on High-Perfor
Applicati
mance
Retrieval ons Computing
Sri Ramakrishna College of Arts & Science 3
Data Mining - Technologies Used
1. Statistics - The collection, analysis, interpretation or
explanation, and presentation of data.
A statistical model is a set of mathematical functions that describe
■
the behavior of the objects in a target class in terms of random
variables and their associated probability distributions
Statistics research develops tools for prediction and forecasting
■
using data and statistical models. Statistical methods can be used
to summarize or describe a collection of data.
A statistical hypothesis test (sometimes called confirmatory data
■
analysis
Sri Ramakrishna College of Arts & Science 4
Data Mining - Technologies Used
2. Machine learning is a technique for computer programs to
automatically learn to recognize complex patterns and make
intelligent decisions based on data.
Sri Ramakrishna College of Arts & Science 5
Data Mining - Technologies Used
■ Supervised learning - is defined as classification, learning comes from
the labeled examples in the training data set.
■ Unsupervised learning is defined as clustering, the learning process is
unsupervised since the input examples are not class labeled, clustering to
discover classes within the data
■ Semi-supervised learning is a class of machine learning techniques that
make use of both labeled and unlabeled examples when learning a model.
■ Active learning is a machine learning approach that lets users play an
active role in the learning process. The goal is to optimize the model quality
by actively acquiring knowledge from human users, given a constraint on
how many examples they can be asked to label
Sri Ramakrishna College of Arts & Science 6
Data Mining - Technologies Used
For classification and clustering
tasks, machine learning research
often focuses on the accuracy of the
model.
▪In addition to accuracy, data
mining research places strong
emphasis on the efficiency and
scalability of mining methods on
large data sets.
▪Ways to handle complex types of
data and explore new, alternative
methods.
Sri Ramakrishna College of Arts & Science 7
Data Mining - Technologies Used
3. Database Systems and Data Warehouses:
■ Database systems research focuses on the creation, maintenance, and use of
databases for organizations and end-users.
■ A data warehouse integrates data originating from multiple sources and various
timeframes. It consolidates data in multidimensional space to form partially
materialized data cubes.
■ The data cube model not only facilitates OLAP in multidimensional databases
but also promotes multidimensional data mining
Sri Ramakrishna College of Arts & Science 8
Data Mining - Technologies Used
4. Information retrieval (IR):
■ It is the science of searching for documents or information in documents.
■ Documents can be text or multimedia, reside on the Web.
■ Differences between traditional information retrieval and database systems:
(1) the data under search are unstructured;
(2) the queries are formed mainly by keywords, which do not have complex structures
■ Digital libraries, digital governments, and health care information systems have
huge data, effective search and analysis have raised many challenging issues in
data mining.
■ Hence text mining and multimedia data mining, integrated with information
retrieval methods, have become increasingly important.
Sri Ramakrishna College of Arts & Science 9
Applications of Data Mining
■ Data mining has seen great successes in many applications.
■ To demonstrate the importance of applications as a major dimension in data
mining research and development, discussed as two highly successful and
popular application examples of data mining.
10
Applications of Data Mining
1. Business Intelligence:
■ Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations.
Examples:
■Reporting,
■Online analytical processing,
■Business performance management,
■Competitive intelligence,
■Benchmarking,
■ To perform effective market analysis, compare customer feedback on similar
products, discover the strengths and weaknesses of their competitors, retain
highly valuable customers, and make smart business decisions.
■ Online analytical processing tools in business intelligence rely on data
warehousing and multidimensional datamining.
Sri Ramakrishna College of Arts & Science 11
Applications of Data Mining
■ The core of predictive analytics in business
intelligence:
■ Classification and prediction techniques
■ Clustering in customer relationship management,
groups customers based on their similarities.
■ Characterization mining techniques, understand
features of each customer group and develop
customized customer reward programs.
Sri Ramakrishna College of Arts & Science 12
Applications of Data Mining
2. Web Search Engines:
■ It is a specialized computer server that searches for information on the Web,
contain web pages, images, and other types of files.
■ Search engines operate algorithmically or by a mixture of algorithmic and
human input
■ Web search engines uses data mining techniques:
■ crawling (e.g., deciding which pages should be crawled and the crawling frequencies)
■ indexing (e.g., selecting pages to be indexed and deciding to which extent the index
should be constructed), and
■ searching (e.g., deciding how pages should be ranked)
Sri Ramakrishna College of Arts & Science 13
Applications of Data Mining
Challenges of Web Search Engines:
1. Handle a huge and ever-growing amount of data.
■ computer clouds, consist of thousands or even hundreds of thousands of computers that
collaboratively mine the huge amount of data.
2. Web search engines often have to deal with online data
■ A search engine afford constructing a model offline on huge data sets - construct a query
classifier that assigns a search query to predefined categories based on the query topic
(Apple)
■ Maintaining and incrementally updating a model on fast growing data streams.
3. Web search engines deal with queries that are asked only a very small
number of times
■ The total number of queries asked can be huge, most of the queries may be asked only
once or a few times. Such severely skewed data are challenging for many data mining and
machine learning methods
Sri Ramakrishna College of Arts & Science 14
Major issues of Data Mining
The major issues in data mining research, partitioned into five groups
Sri Ramakrishna College of Arts & Science 15
Major issues of Data Mining
1. Mining Methodology
Sri Ramakrishna College of Arts & Science 16
Major issues of Data Mining
1. Mining Methodology:
■ Mining various and new kinds of knowledge:
Due to the diversity of applications, new mining tasks continue to emerge,
making data mining a dynamic and fast-growing field.
■ Mining knowledge in multidimensional space:
Interesting patterns can be searched among combinations of dimensions
(attributes) at varying levels of abstraction. Such mining is known as
(exploratory) multidimensional data mining
■ Data mining—an interdisciplinary effort.
■ To mine data with natural language text, fuse data mining methods with
methods of information retrieval and natural language processing
■ The mining of software bugs in large programs, called as bug mining,
benefits from the incorporation
Sri Ramakrishna College of Arts & Science 17
Major issues of Data Mining
Mining Methodology:
■ Boosting the power of discovery in a networked environment of software
engineering knowledge into the data mining process:
■ Semantic links across multiple data objects can be used, Knowledge derived in one set of
objects can be used to boost the discovery of knowledge in a “related” or semantically
linked set of objects.
■ Handling uncertainty, noise, or incompleteness of data:
Errors and noise may confuse the data mining process, leading to the derivation
of erroneous patterns.
■ Pattern evaluation and pattern- or constraint-guided mining:
Techniques are needed to assess the interestingness of discovered patterns
based on subjective measures.
Sri Ramakrishna College of Arts & Science 18
Major issues of Data Mining
2. User Interaction
Flexible user Constraints -Query languages -adopt expressive
interfaces and Rules users to pose knowledge
an exploratory Pattern evaluation – ad hoc Queries representations,
mining environment search toward - Optimization of the - user-friendly interfaces,
- Sample interesting patterns. processing and visualization
-Explore techniques.
-Estimate
-Dynamic change
Sri Ramakrishna College of Arts & Science 19
Major issues of Data Mining
extract information from - first partition the data into - a distributed and
huge amounts of data “pieces.” collaborative way
- Efficiency, - Each piece is processed, - promote incremental
- Scalability, in parallel, by searching data mining
-Performance, for patterns
-optimization,
Sri Ramakrishna College of Arts & Science 20
Major issues of Data Mining
4. Diversity of Datatypes
Sri Ramakrishna College of Arts & Science 21
Major issues of Data Mining
5. Data Mining and Society
The improper
disclosure or use of
data and the potential
Poses the risk of violation of individual
disclosing an privacy and data
individual’s personal protection rights
information. are areas of
Studies on concern that
privacy-preserving data need to be addressed.
publishing and
data mining are ongoing.
Data mining results
obtained through
mouse clicking.
Intelligent search
engines and
Internet-based stores
perform such
invisible data mining
Sri Ramakrishna College of Arts & Science 22
Summary
■ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
■ A natural evolution of database technology, in great demand, with
wide applications
■ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
■ Mining can be performed in a variety of data
■ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
■ Data mining technologies and applications
■ Major issues in data mining
23