PAPER PRENSENTATION
ON
DATA MINING & DATA WAREHOUSE
Presented by:
P.v.surendranathReddy Y.AnilkumarReddy
4th B.TECH (IT) 4th B.TECH (IT)
[email protected] [email protected]CELL NO: 9866857915 CELL NO: 9703779377
VAAGDEVI INSTITUTE OF TECHNOLOGY AND SCIENCES
PEDDASETTIPALLI (VILL), PRODDATUR,
KADAPA (DT), ANDHRA PRADESH.
ABSTRACT: databases do well is where the data is
In today's fast-paced, most appropriately managed, as flat
information-based economy, lists having simple data types,
companies must be able to integrate involving few associations with data in
vast amounts of heterogeneous data other lists. When dealing with data that
and applications from disparate must be kept in complex
sources in order to support strategic IT interdependent structures or when data
initiatives such as Business must be rapidly retrieved by following
Intelligence, Business Process paths of associations rather than by
Management, Business Process simply walking down simple lists, the
Reengineering, Business Activity relational database begins to show
Monitoring and Business Performance characteristics such as multiple-index
Management. Since its inception, It has management and traversal and
continued to build on its unique complex normalized schema
software architecture to make the structures. These impediments, along
integration process easier to learn and with limits in row length or table size,
use, faster to implement and maintain, can, in some cases, represent such
and operate at the best performance profound encumbrances that an
possible- in other words, Simply Faster RDBMS must be regarded as
Integration. impractical for certain data
management tasks. Although leading
Relational database RDBMS vendors have been
management systems (RDBMSs) are introducing features that enable their
designed to store data according to the products to support data outside the
most efficient method of data relational paradigm, the fundamental
cataloging, which is that defined by means of management and access of
mathematical set theory as expressed such data remains relational and, for
in the relational paradigm. In many the most part, SQL based. This fact
cases, however, the most efficient will continue to make RDBMS
method for cataloging data is not the products unnecessarily difficult to set
most efficient method for storing and up and manage, and too inefficient, for
retrieving such data. Where relational some kinds of databases.
An Introduction toData Mining "Which clients are most likely to
Data mining, the extraction of respond to my next promotional
hidden predictive information from mailing, and why?"
large databases, is a powerful new
technology with great potential to help This paper provides an
companies focus on the most important introduction to the basic technologies
information in their data warehouses. of data mining. Examples of profitable
Data mining tools predict future trends applications illustrate its relevance to
and behaviors, allowing businesses to today’s business environment as well
make proactive, knowledge-driven as a basic description of how data
decisions. The automated, prospective warehouse architectures can evolve to
analyses offered by data mining move deliver the value of data mining to end
beyond the analyses of past events users.The past two decades has seen a
provided by retrospective tools typical dramatic increase in the amount of
of decision support systems. Data information or data being stored in
mining tools can answer business electronic format. This accumulation
questions that traditionally were too of data has taken place at an explosive
time consuming to resolve. They scour
databases for hidden patterns, finding
predictive information that experts
may miss because it lies outside their
expectations. Data mining techniques
can be implemented rapidly on existing
software and hardware platforms to
enhance the value of existing
information resources, and can be
rate.
integrated with new products and
systems as they are brought on-line.
When implemented on high
performance client/server or parallel
processing computers, data mining
Figure 1 shows the data explosion. and
tools can analyze massive databases to
the Growing Base of Data
deliver answers to questions such as,
management, fraud detection, new
product rollout, and so on.
The term data mining has been
stretched beyond its limits to apply to
any form of data analysis. Some of the
numerous definitions of Data Mining,
Data storage became easier as the
or Knowledge Discovery in Databases
availability of large amounts of
are:
computing power at low cost ie the
cost of processing power and storage is
Data Mining, or Knowledge
falling, made data cheap.
Discovery in Databases (KDD) as it is
also known, is the nontrivial extraction
An Architecturefor Data of implicit, previously unknown, and
potentially useful information from
Mining
data. This encompasses a number of
To best apply these advanced
different technical approaches, such as
techniques, they must be fully
clustering, data summarization,
integrated with a data warehouse as
learning classification rules, finding
well as flexible interactive business
dependency net works, analyzing
analysis tools. Many data mining tools
changes, and detecting anomalies.
currently operate outside of the
warehouse, requiring extra steps for
Data mining is the search for
extracting, importing, and analyzing
relationships and global patterns that
the data. Furthermore, when new
exist in large databases but are `hidden'
insights require operational
among the vast amount of data, such as
implementation, integration with the
a relationship between patient data and
warehouse simplifies the application of
their medical diagnosis. These
results from data mining. The resulting
relationships represent valuable
analytic data warehouse can be applied
knowledge about the database and the
to improve business processes
objects in the database and, if the
throughout the organization, in areas
database is a faithful mirror, of the real
such as promotional campaign
world registered by the database
The following diagram summarizes the in data mining and knowledge
some of the stages/processes identified discovery
The phases depicted start with the raw research. The data is made useable
data and finish with the extracted and navigable.
knowledge which was acquired as a Data mining: this stage is
result of the following stages: concerned with the extraction of
patterns from the data. A pattern
Selection: Selecting or segmenting can be defined as given a set of
the data according to some criteria facts (data) F, a language L, and
e.g. all those people who own a some measure of certainty C a
car, in this way subsets of the data pattern is a statement S in L that
can be determined. describes relationships among a
Preprocessing: This is the data subset Fs of F with a certainty c
cleansing stage where certain such that S is simpler in some
information is removed which is sense than the enumeration of all
deemed unnecessary and may slow the facts in Fs.
down queries for example Applications of Data mining
unnecessary to note the sex of a Data mining has many and varied
patient when studying pregnancy. fields of application some of which are
Also the data is reconfigured to listed below.
ensure a consistent format as there 11. Retail/Marketing
is a possibility of inconsistent Identify buying patterns from
formats because the data is drawn customers
from several sources e.g. sex may Find associations among
recorded as f or m and also as 1 or customer demographic
0. characteristics
Market basket analysis
Transformation: The data is not
22. Banking
merely transferred across but
Detect patterns of fraudulent
transformed in that overlays may
credit card use
added such as the demographic
Identify `loyal' customers
overlays commonly used in market
Predict customers likely to more attributes that denote the class of
change their credit card a tuple and these are known as
affiliation predicted attributes whereas the
Determine credit card spending remaining attributes are called
by customer groups predicting attributes. A combination of
33. Insurance and Health Care: values for the predicted attributes
Claims analysis - i.e which defines a class.
medical procedures are claimed 1
together 22. Associations:
Predict which customers will
buy new policies Given a collection of items and
Identify behaviour patterns of a set of records, each of which contain
risky customers some number of items from the given
14. Medicine collection, an association function is an
Characterise patient behaviour operation against this set of records
to predict office visits which return affinities or patterns that
exist among the collection of items.
Identify successful medical
These patterns can be expressed by
therapies for different illnesses
rules such as "72% of all the records
that contain items A, B and C also
Data Mining Functions
contain items D and E." The specific
Data mining methods may be
percentage of occurrences (in this case
classified by the function they perform
72) is called the confidence factor of
or according to the class of application
the rule. Also, in this rule, A,B and C
they can be used in. Some of the main
are said to be on an opposite side of the
techniques used in data mining are…
rule to D and E. Associations can
involve any number of items on either
11. Classification
side of the rule.
Data mine tools have to infer a
Comprehensive data
model from the database, and in the
warehouses that integrate operational
case of supervised learning this
data with customer, supplier, and
requires the user to define one or more
market information have resulted in an
classes. The database contains one or
explosion of information. Competition
requires timely and sophisticated strategy can only be defeated. So it is
analysis on an integrated view of the said that victorious warriors win first
data. However, there is a growing gap and then go to war, while defeated
between more powerful storage and warriors go to war first and then seek
retrieval systems and the users’ ability to win. It is obvious to anyone that
to effectively analyze and act on the culls through the voluminous
information they contain. Both information technology (I/T) literature,
relational and OLAP technologies have attends industry seminars, user group
tremendous capabilities for navigating meetings or expositions, reads the ever
massive data warehouses, but brute accelerating new product
force navigation of data is not enough. announcements of I/T vendors, or
A new technological leap is needed to listens to the advice of industry gurus
structure and prioritize information for and analysts, that there are four
specific end-user problems. The data subjects that overwhelmingly dominate
mining tools can make this leap. I/T industry attention as we move into
Quantifiable business benefits have the late 1990s:
been proven through the integration of
data mining with current information Why we need Data
systems, and new products are on the Warehousing
horizon that will bring this integration
Data mining potential can be
to an even wider audience of users.
enhanced if the appropriate data has
been collected and stored in a data
Data Warehousing warehouse. A data warehouse is a
Introduction relational database management
When your strategy is deep and system (RDMS) designed specifically
far reaching, then what you gain by to meet the needs of transaction
your calculations is much, so you can processing systems. It can be loosely
win before you even fight. When your defined as any centralized data
strategic thinking is shallow and near- repository which can be queried for
sighted, then what you gain by your business benefit but this will be more
calculations is little, so you lose before clearly defined later.
you do battle. Much strategy prevails
over little strategy, so those with no
Data warehousing is a new instead of application e.g. an
powerful technique making it possible insurance company using a data
to extract archived operational data and warehouse would organize their
overcome inconsistencies between data by customer, premium,
different legacy data formats. As well and claim, instead of by
as integrating data throughout an different products (auto, life,
enterprise, regardless of location, etc.). The data organized by
format, or communication subject contain only the
requirements it is possible to information necessary for
incorporate additional or expert decision support processing.
information. It is, the logical link Integrated: When data resides
between what the managers see in their in many separate applications
decision support EIS applications and in the operational environment,
the company's operational activities encoding of data is often
inconsistent. For instance, in
In other words the data one application, gender might
warehouse provides data that is already be coded as "m" and "f" in
transformed and summarized, therefore another by 0 and 1. When data
making it an appropriate environment are moved from the operational
for more efficient DSS and EIS environment into the data
applications. warehouse, they assume a
consistent coding convention
Characteristics of A Data e.g. gender data is transformed
Warehouse to "m" and "f".
According to Bill Inmon, Time-Variant: The data
author of Building the Data Warehouse warehouse contains a place for
and the guru who is widely considered storing data that are five to 10
to be the originator of the data years old, or older, to be used
warehousing concept, there are for comparisons, trends, and
generally four characteristics that forecasting. These data are not
describe a data warehouse: updated.
Non-Volatile: Data are not
Subject-Oriented: Data are updated or changed in any way
organized according to subject
once they enter the data execute these functions. The
warehouse, but are only loaded information that describes the model
and accessed. and definition of the source data
elements is called "metadata". The
Processes In Data Warehousing metadata is the means by which the
The first phase in data end-user finds and understands the data
warehousing is to "insulate" your in the warehouse and is an important
current operational information, i.e. to part of the warehouse. The metadata
preserve the security and integrity of should at the very least contain;The
mission-critical OLTP applications, structure of the data
while giving you access to the broadest The algorithm used for
possible base of data. The resulting summarization;
database or data warehouse may The mapping from the
consume hundreds of gigabytes - or operational environment to the
even terabytes - of disk space, what is data warehouse.
required then are efficient techniques
for storing and retrieving massive Data cleansing is an important
amounts of information. Increasingly, aspect of creating an efficient data
large organizations have found that warehouse in that it is the removal of
only parallel processing systems offer certain aspects of operational data,
sufficient bandwidth. such as low-level transaction
information, which slow down the
The data warehouse thus query times. The cleansing stage has to
retrieves data from a variety of be as dynamic as possible to
heterogeneous operational databases. accommodate all types of queries even
The data is then transformed and those which may require low-level
delivered to the data warehouse/store information. Data should be extracted
based on a selected model (or mapping from production sources at regular
definition). The data transformation intervals and differences between
and movement processes are executed various styles of data collection.
whenever an update to the warehouse Pooled centrally but the cleansing
data is required so there should some process has to remove duplication and
form of automation to manage and reconcile
measured in hundreds of
The current detail data is central in millions of rows and gigabytes
importance as it: per hour and must not
Reflects the most recent artificially constrain the volume
happenings, which are usually of data required by the
the most interesting; business.
It is voluminous as it is stored Load Processing: Many steps
at the lowest level of must be taken to load new or
granularity; updated data into the data
It is always (almost) stored on warehouse including data
disk storage which is fast to conversions, filtering,
access but expensive and reformatting, integrity checks,
complex to manage physical storage, indexing, and
Uses of Data Warehousing metadata update. These steps
Retail: Analysis of scanner must be executed as a single,
check-out data Tracking, seamless unit of work.
analysis, and tuning of sales Data Quality Management:
promotions and so on… The shift to fact-based
Telecommunications Analysis management demands the
of: call volumes, equipment, highest data quality. The
sales, customer, and warehouse must ensure local
profitability costs Inventory, consistency, global
consistency, and referential
Criteria for a Data Warehouse integrity despite "dirty" sources
The criteria for data warehouse and massive database size.
RDBMS are as follows: While loading and preparation
are necessary steps, they are
Load Performance: Data not sufficient. Query
warehouses require incremental throughput is the measure of
loading of new data on a success for a data warehouse
periodic basis within narrow application. As more questions
time windows; performance of are answered, analysts are
the load process should be
catalysed to ask more creative advantage. The business need to build,
and insightful questions. compound, and sustain advantage is
the most fundamental and dominant
Query Performance - Fact-
business need and it is insatiable.
based management and ad-hoc
Advantage is built through deep and
analysis must not be slowed or
far-reaching strategic thinking. The
inhibited by the performance of
strategic ideas that support data
the data warehouse RDBMS;
warehousing as a strategic initiative are
large, complex queries for key
learning, maneuverability, prescience,
business operations must
and foreknowledge. Data warehousing
complete in seconds not days
meets the fundamental business needs
to compete in a superior manner across
the elementary strategic dimension of
CONCLUSION time. Data warehousing is a rare
instance of a rising tide strategy. A
Our strategic analysis of data rising tide strategy occurs when an
warehousing is as follows: Strategy is action yields tremendous
about, and only about, building
Leverage. Data warehousing raises the NY: John Wiley & Sons, Inc.,
ability of all employees to serve their 1998), Pp. 87-100
customers and out-think their
43. Len Silverston, W. H. Inmon, and
competitors.
Kent Graziano, The Data Model
Resource Book (New York, NY:
REFERENCES
John Wiley & Sons, Inc., 1997)
1
21. Ralph Kimball, The Data 54. Douglas Hackney, Understanding
Warehouse Toolkit (New York, and Implementing Successful Data
NY: John Wiley & Sons, Inc., Marts (Reading, MA: Addison-
1996), Pp. 15-16 Wesley, 1997), Pp. 52-54, 183-84,
257, 307-309
32. W. H. Inmon, Claudia Imhoff, and
5. White Paper, available at
Ryan Sousa, Corporate
http://www.informatica.com.
InformatioFactory (New York,
6. Hackney, op. cit.
67. Informatica, op. cit.
7/