GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
BA5021 DATA MINING FOR BUSINESS INTELLIGENCE
UNIT I
INTRODUCTION
Syllabus:
Data mining, Text mining, Web mining, Spatial mining, Process mining, BI
process- Private and Public intelligence, Strategic assessment of
implementing BI
1.1. DATA MINING
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems,
Web, computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific
simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
―Necessity is the mother of invention‖—Data mining—Automated
analysis of massive data sets
What is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
1 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
Watch out: Is everything ―data mining‖?
Simple search and query processing
(Deductive) expert systems
Knowledge Discovery in DB (KDD) Process
This is a view from typical database systems and data
warehousing communities
Data mining plays an essential role in the knowledge discovery
process
2 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Knowledge discovery as a process involves in the following steps:
1. Data cleaning
To remove noise and inconsistent data
2. Data integration
where multiple data sources may be combined
3. Data selection
where data relevant to the analysis task are retrieved from
the database
4. Data transformation
where data are transformed or consolidated into forms
appropriate for mining by performing summary or
aggregation operations, for instance
5. Data mining
an essential process where intelligent methods are applied in
order to extract data patterns
6. Pattern evaluation
To identify the truly interesting patterns representing
knowledge based on some interestingness measures
7. Knowledge presentation
where visualization and knowledge representation
techniques are used to present the mined knowledge to the
user
3 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
KDD Process: A Typical View from ML and Statistics
Data Mining in Business Intelligence
4 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Architecture of typical DM Systems
Based on KDD’s view, the architecture of a typical data mining system
may have the following major components
Database, data warehouse or other information repository:
This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information repositories.
Data cleaning and data integration techniques may be
performed on the data.
Database or data warehouse server:
The database or data warehouse server is responsible for
fetching the relevant data, based on the user’s data mining
request.
Knowledge base:
This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.
5 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of
abstraction.
Other examples of domain knowledge are additional
interestingness constraints or thresholds, and metadata
Data mining engine:
This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.
Pattern evaluation module :
This component typically employs interestingness measures
and interacts with the data mining modules so as to focus the
search toward interesting patterns.
It may use interestingness thresholds to filter out discovered
patterns.
User interface:
This module communicates between users and the data
mining system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data
mining results
Data Mining: On What Kinds of Data?
There are no. of data stores on which data mining can be performed:
Relational database
Data warehouse
Transactional database
6 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Text databases & WWW
Relational database
A relational database is a collection of tables, each of which
is assigned a unique name.
Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values
Data warehouse
A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
that usually resides at a single site.
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading,
and periodic data refreshing.
Figure: Data Warehouse
7 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
To facilitate decision making, the data in a data warehouse are
organized around major subjects, such as customer, item,
supplier, and activity.
A data warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema.
Each cell stores the value of some aggregate measure, such as
count or sales amount.
The actual physical structure of a data warehouse may be a
relational data store or a multidimensional data cube.
A data cube provides a multidimensional view of data and
allows the pre computation and fast accessing of summarized
data.
A data cube for summarized sales data of All Electronics is
presented in Figure.
The cube has three dimensions:
address (with city values Chicago, New York, Toronto,
Vancouver),
time (with quarter values Q1, Q2, Q3, Q4), and
item(with item type values home entertainment,
computer, phone, security).
The aggregate value stored in each cell of the cube is sales
amount (in thousands).
A data warehouse collects information about subjects that
span an entire organization, and thus its scope is enterprise-
wide.
A data mart, on the other hand, is a department subset of a
data warehouse. It focuses on selected subjects, and thus its
scope is department-wide
8 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
By providing multidimensional data views and the pre
computation of summarized data, data warehouse systems are
well suited for on-line analytical processing, or OLAP.
9 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Transactional database
In general, a transactional database consists of a file where each
record represents a transaction.
A transaction typically includes a unique transaction identity
number (trans ID) and a list of the items making up the transaction
(such as items purchased in a store).
Advanced database and information repository:
Object-Relational Databases
Constructed based on an object-relational data model.
This model extends the relational model by providing a rich data
type for handling complex objects and object orientation.
Object-relational data model inherits the essential concepts of
object-oriented databases, where, in general terms, each entity
is considered as an object.
Temporal Databases, Sequence Databases, and Time-Series
Databases
A temporal database typically stores relational data that include
time-related attributes.
These attributes may involve several timestamps, each having
different semantics
Temporal Databases
A temporal database typically stores relational data that
include time-related attributes.
10 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
These attributes may involve several timestamps, each
having different semantics
Sequence Databases
A sequence database stores sequences of ordered events,
with or without a concrete notion of time.
Examples include customer shopping sequences, Web click
streams, and biological sequences
Time-Series Databases
A time-series database stores sequences of values or events
obtained over repeated measurements of time (e.g., hourly,
daily, weekly).
Examples include data collected from the stock exchange,
inventory control, and the observation of natural phenomena
(like temperature and wind).
Spatial Databases
Spatial databases contain spatial-related information.
Examples include geographic (map) databases, very large-
scale integration (VLSI) or computed-aided design databases,
and medical and satellite image databases.
Text Databases and Multimedia Databases
Text databases are databases that contain word descriptions
for objects.
These word descriptions are usually not simple keywords but
rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages,
summary reports, notes, or other documents.
Multimedia databases store image, audio, and video data. They
are used in applications such as picture content-based retrieval,
voice-mail systems, video-on-demand systems, the World Wide
Web, and speech-based user interfaces that recognize spoken
commands
11 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Data Streams
Data flow in and out of an observation platform (or window)
dynamically.
Such data streams have the following unique features: huge
or possibly infinite volume, dynamically changing, flowing in
and out in a fixed order, allowing only one or a small number
of scans, and demanding fast (often real-time) response
time.
Examples : scientific and engineering data, time-series data,
and data produced in other dynamic environments, etc.
DATA MINING CONCEPTS AND APPLICATIONS- Data Mining
Definitions, Characteristics, and Benefits:
Data mining is a term used to describe discovering or "mining"
knowledge from large amounts of data.
Technically speaking, data mining is a process that uses
statistical, mathematical, and artificial intelligence techniques
to extract and identify useful information and subsequent
knowledge (or patterns) from large sets of data.
These patterns can be in the form of business rules, affinities,
correlations, trends, or prediction models
Most literature defines data mining as "the nontrivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data stored in structured databases,"
where the data are organized in records structured by
categorical, ordinal, and continuous variables.
The meanings of the key terms are as follows:
Nontrivial means that some experimentation-type search or
inference is involved ;
Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.
12 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Novel means that the patterns are not previously known
Potentially useful means that the discovered patterns should lead
to some benefit to the user or task.
Ultimately understandable means that the pattern should make
business sense that leads to the user saying "mmm!
Data mining is not a new discipline, but rather a new definition for
the use of many disciplines.
Data mining is tightly positioned at the intersection of many
disciplines, including statistics, artificial intelligence, machine
learning, management science, information systems, and
databases.
13 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Major characteristics and objectives of data mining:
Data are often buried deep within very large databases
The data are cleansed and consolidated into a data warehouse.
Data may be presented in a variety of formats.
The data mining environment is usually a client/ server architecture
/ Web-based information systems architecture.
Sophisticated new tools, including advanced visualization tools,
help to remove the information buried in corporate files or archival
public records.
data miners are exploring the usefulness of soft data
The miner is often an end user, empowered by data drills and
other power query tools to ask ad hoc questions and obtain
answers quickly
Data mining tools are readily combined with sp read sheets and
other software development tools.
It is sometimes necessary to use parallel processing for DM
A Simple Taxonomy of Data
Data refers to a collection of facts usually obtained as the result
of experiences, observations, or experiments.
Data may consist of numbers, letters, words, images, voice
recordings, and so on as measurements of a set of variables.
Data are often viewed as the lowest level of abstraction from
which information and then knowledge is derived.
At the highest level of abstraction, one can classify data as
structured and unstructured
Structured data is what data mining algorithms use, and can be
classified as categorical or numeric.
The categorical data can be subdivided into nominal or ordinal
data,
14 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
whereas numeric data can be subdivided into interval or ratio
Categorical data
represent the labels of multiple classes used to divide a variable
into specific groups.
Examples of categorical variables include sex, age group, and
educational level.
Nominal data
contain measurements of simple codes assigned to objects as
labels, which are not measurements.
For example, the variable marital status can be generally
categorized as (1) single, (2) married, and (3) divorced.
Nominal data can be represented with binomial values having
two possible values (e.g., yes/ no, true/ false, good/ bad), or
multinomial values having three or more possible values.
Ordinal data
contain codes assigned to objects or events as labels that also
represent the rank order among them.
For example, the variable credit score can be generally
categorized as (1) low, (2) medium, or (3) high.
15 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Similar ordered relationships can be seen in variables such as
age group (i.e., child, young, middle-aged, elderly)
Numeric data
represent the numeric values of specific variables.
Examples of numerically valued variables include age, number
of children, total household income
Interval data
are variables that can be measured on interval scales.
A common example of interval scale measurement is
temperature on the Celsius scale.
Ratio data
include measurement variables commonly found in the physical
sciences and engineering. Mass, length, time, plane angle,
energy, and electric charge.
How Data Mining Works?
Using existing and relevant data, data mining builds models to
identify patterns among the attributes presented in the data set.
Models are the mathematical representations that identify the
patterns among the attributes of the objects described in the data
set.
Some of these patterns are explanatory (explaining the
interrelationships and affinities among the attributes), whereas
others are predictive (foretelling future values of certain
attributes).
In general, data mining seeks to identify four major types of
patterns:
1. Associations
find the commonly co-occurring groupings of things, such as
beer and diapers going together in market-basket analysis.
16 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Predictions
tell the nature of future occurrences of certain events based
on what has happened in the past, such as predicting the winner of the
Game or forecasting the absolute temperature of a particular day.
3. Clusters
identify natural groupings of things based on their known
characteristics, such as assigning customers in different segments
based on their demographics and past purchase behaviors.
4. Sequential relationships
discover time-ordered events, such as predicting that an
existing banking customer who already has a checking account will open
a savings account followed by an investment account within a year.
Data mining tasks can be classified into three main categories:
prediction,
association, and
clustering.
Based on the way in which the patterns are extracted from the
historical data, the learning algorithms of data mining methods
can be classified as either
supervised or
unsupervised.
Supervised learning algorithms - the training data
includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as
the class attribute (i.e. , output variable or result
variable).
Unsupervised learning algorithm - the training data
includes only the descriptive attributes.
17 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
PREDICTION
Prediction is commonly referred to as the act of telling about the
future.
It differs from simple guessing by taking into account the
experiences, opinions, and other relevant information in
conducting the task of foretelling.
A term that is commonly associated with prediction is forecasting.
Prediction is largely experience and opinion based, forecasting is
data and model based.
That is, in order of increasing reliability, one might list the relevant
terms as guessing, predicting, and forecasting, respectively.
18 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLASSIFICATION (supervised induction)
The objective of classification is to analyze the historical data
stored in a database and automatically generate a model that can
predict future behavior.
This induced model consists of generalizations over the records of
a training dataset, which help distinguish predefined classes.
The hope is that the model can then be used to predict the classes
of other unclassified records and, more important, to accurately
predict actual future events.
Common classification tools include neural networks and decision
trees, logistic regression and discriminate analysis.
Emerging tools such as rough sets, support vector machines, and
genetic algorithms
Neural networks
Involve the development of mathematical structures
(somewhat resembling the biological neural networks
in human brain) that have the capability to learn from
past experiences presented in the form of well-
structured datasets
Decision trees
Classify data into a finite number of classes based on
the values of the input variables.
Decision trees are essentially a hierarchy of if-then
statements
Faster than neural networks.
They are most appropriate for categorical and interval
data.
Therefore, incorporating continuous variables into a
decision tree framework requires discretization -
converting continuous valued numerical variables to
ranges and categories
19 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLUSTERING
Clustering partitions a collection of things (e.g., objects and
events presented in a structured dataset) into segments (or
natural groupings) whose members share similar characteristics.
Unlike in classification, in clustering, the class labels are unknown.
As the selected algorithms go through the dataset, identifying the
commonalities of things based on their characteristics, the clusters
are established.
Because the clusters are determined using a heuristic-type
algorithm, and because different algorithms may end up with
different sets of clusters for the same dataset.
It may be necessary for an expert to interpret, and potentially
modify, the suggested clusters before the results of clustering
techniques are put to actual use.
After reasonable clusters have been identified, they can be used to
classify and interpret new data.
The goal of clustering is to create groups so that the members
within each group have maximum similarity and the members
across groups have minimum similarity.
The most commonly used clustering techniques include k-means
(from statistics) and self-organizing maps (from machine
learning), which is a unique neural network architecture developed
by Kohonen.
ASSOCIATIONS
Associations, or association rule learning in data mining, is a
popular and well-researched technique for discovering interesting
relationships among variables in large databases.
In the context of the retail industry, association rule mining is often
called market-basket analysis.
20 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Two commonly used derivatives of association rule mining are link
analysis and sequence mining.
Link analysis - the linkage among many objects of interest
is discovered automatically, such as the link between Web
pages and referential relationships among groups of
academic publication authors.
Sequence mining - relationships are examined in terms of
their order of occurrence to identify associations over time
HYPOTHESIS- OR DISCOVERY-DRIVEN DATA MINING
Data mining can be hypothesis driven or discovery driven.
Hypothesis-driven data mining begins with a proposition by the
user, who then seeks to validate the truthfulness of the proposition.
For example, a marketing manager may begin with the following
proposition: ―Are DVD player sales related to sales of television
sets?‖
Discovery-driven data mining finds patterns, associations, and
other relationships hidden within datasets. It can uncover facts that
an organization had not previously known or even contemplated
DATA MINING APPLICATIONS
• Customer relationship management.
• Banking.
• Retailing and logistics.
• Manufacturing and production.
• Brokerage and securities trading.
• Insurance.
• Computer hardware and software.
• Government and defense.
• Travel industry (airlines, hotels/resorts, rental car companies).
21 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
• Health care.
• Medicine.
• Entertainment industry.
• Homeland security and law enforcement.
• Sports.
Data mining has become a popular tool in addressing many complex
businesses issues.
1. Customer relationship management.
Customer relationship management (CRM) is the new and
emerging extension of traditional marketing.
The goal of CRM is to create one-on-one relationships with
customers by developing an intimate understanding of their needs
and wants.
As businesses build relationships with their customers over time
through a variety of transactions (e.g., product inquiries, sales,
service requests, warranty calls)
When combined with demographic and socioeconomic attributes,
this information-rich data can be used to
(1) identify most likely responders / buyers of new
products/services (i.e., customer profiling);
(2) understand the root causes of customer attrition in order
to improve customer retention (i.e., churn analysis);
(3) discover time-variant associations between products and
services to maximize sales and customer value;
(4) identify the most profitable customers and their
preferential needs to strengthen relationships and to
maximize sales.
22 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Banking.
Data mining can help banks with the following:
(1) automating the loan application process by accurately
predicting the most probable defaulters;
(2) detecting fraudulent credit card and online-banking
transactions;
(3) identifying ways to maximize customer value by selling them
products and services that they are most likely to buy;
(4) optimizing the cash return by accurately forecasting the cash
flow on banking entities (e.g., ATM machines, banking branches).
3. Retailing and logistics.
In the retailing industry, data mining can be used to
(1) predict accurate sales volumes at specific retail locations in
order to determine correct inventory levels;
(2) identify sales relationships between different products (with
market-basket analysis) to improve the store layout and optimize
sales promotions;
(3) forecast consumption levels of different product types (based
on seasonal and environmental conditions) to optimize logistics
and hence maximize sales;
(4) discover interesting patterns in the movement of products
(especially for the products that have a limited shelf life because
they are prone to expiration, perishability, and contamination) in a
supply chain by analyzing sensory and RFID data.
4. Manufacturing and production.
Manufacturers can use data mining to
(1) predict machinery failures before they occur through the use of
sensory data (enabling what is called condition-based
maintenance);
23 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
(2) identify anomalies and commonalities in production systems to
optimize manufacturing capacity; and
(3) discover novel patterns to identify and improve product quality
5. Brokerage and securities trading.
Brokers and traders use data mining to
(1) predict when and how much certain bond prices will change;
(2) forecast the range and direction of stock fluctuations;
(3) assess the effect of particular issues and events on overall
market movements; and
(4) identify and prevent fraudulent activities in securities trading
6. Insurance.
The insurance industry uses data mining techniques to
(1) forecast claim amounts for property and medical coverage
costs for better business planning;
(2) determine optimal rate plans based on the analysis of claims
and customer data;
(3) predict which customers are more likely to buy new policies
with special features; and
(4) identify and prevent incorrect claim payments and fraudulent
activities.
7. Computer hardware and software.
Data mining can be used to
(1) predict disk drive failures well before they actually occur;
(2) identify and filter unwanted Web content and e-mail messages;
(3) detect and prevent computer network security bridges; and
(4) identify potentially unsecure software products.
24 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
8. Government and defense.
Data mining also has a number of military applications. It can be
used to
(1) forecast the cost of moving military personnel and equipment;
(2) predict an adversary’s moves to develop more successful
strategies for military engagements;
(3) predict resource consumption for better planning and
budgeting; and
(4) identify classes of unique experiences, strategies, and lessons
learned from military operations for better knowledge sharing
9. Travel industry (airlines, hotels / resorts, rental car companies).
Data mining has a variety of uses in the travel industry. It is
successfully used to
(1) predict sales of different services (seat types in airplanes, room
types in hotels/resorts, car types in rental car companies) in order
to optimally price services to maximize revenues as a function of
time-varying transactions (commonly referred to as yield
management);
(2) forecast demand at different locations to better allocate limited
organizational resources;
(3) identify the most profitable customers and provide them with
personalized services to maintain their repeat business; and
(4) retain valuable employees by identifying and acting on the root
causes for attrition
10. Health care.
Data mining has a number of health care applications. It can be
used to
(1) identify people without health insurance and the factors
underlying this undesired phenomenon;
25 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
(2) identify novel cost-benefit relationships between different
treatments to develop more effective strategies;
(3) forecast the level and the time of demand at different service
locations to optimally allocate organizational resources; and
(4) understand the underlying reasons for customer and employee
attrition.
11. Medicine.
(1) identify novel patterns to improve survivability of patients with
cancer;
(2) predict success rates of organ transplantation patients to
develop better donor-organ matching policies;
(3) identify the functions of different genes in the human
chromosome (known as genomics);
(4) discover the relationships between symptoms and illnesses to
help medical professionals make informed and correct decisions in
a timely manner.
12. Entertainment industry.
Data mining is successfully used by the entertainment industry to
(1) analyze viewer data to decide what programs to show during
prime time and how to maximize returns by knowing where to
insert advertisements;
(2) predict the financial success of movies before they are
produced to make investment decisions and to optimize the
returns;
(3) forecast the demand at different locations and different times to
better schedule entertainment events and to optimally allocate
resources; and
(4) develop optimal pricing policies to maximize revenues.
26 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
13. Homeland security and law enforcement.
Data mining has a number of homeland security and law
enforcement applications. Data mining is often used to
(1) identify patterns of terrorist behaviors
(2) discover crime patterns (e.g., locations, timings, criminal
behaviors, and other related attributes) to help solve criminal
cases in a timely manner;
(3) predict and eliminate potential biological and chemical attacks
to a nation’s critical infrastructure by analyzing special-purpose
sensory data; and
(4) identify and stop malicious attacks on critical information
infrastructures (often called information warfare).
14.Sports.
Data mining was used to improve the performance of National
(1) Basketball Association (NBA) teams in the United States.
(2) The NBA developed Advanced Scout, a PC-based data mining
application that coaching staff use to discover interesting patterns
in basketball game data.
(3) The pattern interpretation is facilitated by allowing the user to
relate patterns to videotape.
(4) See Bhandari et al.
(5) (1997) for details.
DATA MINING PROCESS
In order to systematically carry out data mining projects, a general
process is usually followed.
Based on best practices, data mining researchers and practitioners
have proposed several processes (workflows or simple step-by-
step approaches)
27 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
One such standardized process, the most popular one, Cross-
Industry Standard Process for Data Mining—CRISP-DM
CRISP - DM is a sequence of six steps.
Starts with a good understanding of the business and ends with
the deployment of the solution that satisfied the specific business
need.
Even though these steps are sequential in nature, there is usually
a great deal of backtracking.
Because the data mining is driven by experience and
experimentation, depending on the problem situation and the
knowledge/experience of the analyst, the whole process can be
very iterative
28 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Step 1: Business Understanding
A thorough understanding of the managerial need for new
knowledge and an explicit specification of the business objective
Specific questions such as
―What are the common characteristics of the customers we
have lost to our competitors recently?‖ or
―What are typical profiles of our customers, and how much
value does each of them provide to us?‖
need to be addressed.
Then a project plan for finding such knowledge is developed that
specifies the people responsible for collecting the data, analyzing
the data, and reporting the findings.
At this early stage, a budget to support the study should also be
established
Step 2: Data Understanding
Different business tasks require different sets of data.
The main activity of the data mining process is to identify the
relevant data from many available databases.
Some key points must be considered in the data identification
and selection phase.
First and foremost, the analyst should be clear and concise about
the description of the data mining task so that the most relevant
data can be identified.
For example, a retail data mining project may seek to identify
spending behaviors of female shoppers, who purchase seasonal
clothes, based on their demographics, credit card transactions,
and socioeconomic attributes.
Furthermore, the analyst should build an intimate understanding of
the data sources
29 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Example:
where the relevant data are stored and in what form;
what the process of collecting the data is—automated versus
manual;
who the collectors of it are;
and how often it is updated etc
In order to better understand the data, the analyst often uses a
variety of statistical and graphical techniques
Such as simple statistical summaries of each variable (e.g., for
numeric variables, the average, minimum/maximum, median,
standard deviation are among the calculated measures, etc)
Data sources for data selection can vary.
Normally, data sources for business applications include
demographic data (such as income, education, number of
households, and age),
sociographic data (such as hobby, club membership, and
entertainment),
transactional data (sales record, credit card spending, and issued
checks), and so on.
Data can be categorized as quantitative and qualitative.
Step 3: Data Preparation
The purpose of data preparation (or more commonly called as data
preprocessing) is to take the data identified in the previous step
and prepare them for analysis by data mining methods.
Data are generally
Incomplete (lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data),
Noisy (containing errors or outliers), and
30 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Inconsistent (containing discrepancies in codes or names).
The four main steps needed to convert the raw, real-world data
into minable datasets.
Data Collection / Selection
Data Cleaning
Data Transformation
Data Reduction
1. Data Collection / Selection
The relevant data are collected from the identified
sources
The necessary records and variables are selected, and
The records coming from multiple data sources are
integrated
31 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Data Cleaning
The data are cleaned (this step is also known as
data scrubbing).
The values in the dataset are identified and dealt
with.
Missing values - need to be imputed (filled with a
most probable value) or ignored;
Noisy values - (i.e., the outliers) and smooth them
out.
Inconsistencies - (unusual values within a
variable) in the data should be handled using
domain knowledge and/or expert opinion.
3. Data Transformation
Data are transformed for better processing.
For instance, in many cases, the data are
normalized in order to mitigate the potential bias of
one variable (having large numeric values, such as
for household income)
Another transformation that takes place is
discretization and/or aggregation.
In some cases, the numeric variables are converted
to categorical values (e.g., low, medium, and high);
In other cases, a nominal variable’s unique value
range is reduced to a smaller set using concept
hierarchies.
Even though data miners like to have large
datasets, too much data is also a problem.
One can visualize the data commonly used in data
mining projects as a flat file consisting of two
dimensions:
32 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
In some cases (e.g., image processing and genome
projects with complex microarray data), the number
of variables can be rather large, and the analyst
must reduce the number down to a manageable
size.
The variables are treated as different dimensions
that describe the phenomenon from different
perspectives, in data mining, this process is
commonly called dimensional reduction.
Step 4: Modeling Building
In this step, various modeling techniques are selected and applied
to an already prepared dataset in order to address the specific
business need.
Depending on the business need, the data mining task can be of a
prediction (either classification or regression), an association, or a
clustering type.
Each of these data mining tasks can use a variety of data mining
methods and algorithms.
Some of these data mining methods and some of the most popular
algorithms - decision trees for classification, k-means for
clustering, and the Apriori algorithm for association rule mining.
Step 5: Testing and Evaluation
The developed models are assessed and evaluated for their
accuracy and generality.
This step assesses the degree to which the selected model (or
models) meets the business objectives and, if so, to what extent
Another option is to test the developed model(s) in a real-world
scenario if time and budget constraints permit.
Even though the outcome of the developed models are expected
to relate to the original business objectives,
The testing and evaluation step is a critical and challenging task.
33 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
No value is added by the data mining task until the business value
obtained from discovered knowledge patterns is identified and
recognized.
Determining the business value from discovered knowledge
patterns is somewhat similar to playing with puzzles.
The success of this identification operation depends on the
interaction among data analysts, business analysts, and
decision makers
Step 6: Deployment
Development and assessment of the models is not the end of the
data mining project.
The knowledge gained from such exploration will need to be
organized and presented in a way that the end user can
understand and benefit from it.
Depending on the requirements, the deployment phase can be as
simple as generating a report or as complex as implementing a
repeatable data mining process across the enterprise.
In many cases, it is the customer, not the data analyst, who carries
out the deployment steps.
The deployment step may also include maintenance activities for
the deployed models
34 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Other Data Mining Standardized Processes and Methodologies
Ranking of Data Mining Processes and Methodologies
35 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
DATA MINING METHODS
A variety of methods are available for performing data mining
studies, including classification, regression, clustering, and
association.
Most data mining software tools employ more than one technique
(or algorithm) for each of these methods.
1. Classification
Classification is perhaps the most frequently used data mining
method
A popular member of the machine-learning family of techniques,
Classification learns patterns from past data (a set of information—
traits, variables, features—on characteristics of the previously
labeled items, objects, or events) in order to place new instances
(with unknown labels) into their respective groups or classes.
For example, one could use classification to predict whether the
weather on a particular day will be ―sunny,‖ ―rainy,‖ or ―cloudy.‖
Popular classification tasks include credit approval (i.e., good or
bad credit risk),
store location (e.g., good, moderate, bad), target marketing (e.g.,
likely customer, no hope),
fraud detection (i.e., yes, no), and telecommunication (e.g., likely
to turn to another phone company, yes/no).
If what is being predicted is a class label (e.g., ―sunny,‖ ―rainy,‖ or
―cloudy‖), the prediction problem is called a classification.
whereas if it is a numeric value (e.g., temperature such as 68°F),
the prediction problem is called a regression.
The most common two-step methodology of classification-type
prediction involves
36 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
model development/training and
model testing/deployment.
In the model development phase, a collection of input data,
including the actual class labels, is used.
After a model has been trained, the model is tested against the
holdout sample for accuracy assessment to predict classes of new
data instances (where the class label is unknown)
Several factors are considered in assessing the model, including the
following:
1. Predictive accuracy.
The model’s ability to correctly predict the class label of new or
previously unseen data.
Prediction accuracy is the most commonly used assessment
factor
To compute this measure, actual class labels of a test dataset
are matched against the class labels predicted by the model.
The accuracy can then be computed as the accuracy rate,
which is the percentage of test dataset samples correctly
classified by the model
2. Speed.
The computational costs involved in generating and using the
model, where faster is deemed to be better.
3. Robustness.
The model’s ability to make reasonably accurate predictions,
given noisy data or data with missing and erroneous values.
4. Scalability.
The ability to construct a prediction model efficiently given a
rather large amount of data.
37 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
5. Interpretability.
The level of understanding and insight provided by the model
(e.g., how and/or what the model concludes on certain
predictions).
Estimating the True Accuracy of Classification Models
In classification problems, the primary source for accuracy
estimation is the confusion matrix (also called a classification
matrix or a contingency table).
The numbers along the diagonal (L – R) correct decisions, and the
numbers outside this diagonal represent the errors.
38 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
SIMPLE SPLIT
The simple split partitions the data into two mutually exclusive
subsets called a training set and a test set (or holdout set).
Two-thirds of the data - training set ; remaining one-third - test
set.
Training set - used by the inducer (model builder), and the built
classifier is then tested on the test set
K-FOLD CROSS-VALIDATION (rotation estimation)
The complete dataset is randomly split into k mutually exclusive
subsets of approximately equal size.
The classification model is trained and tested k times.
39 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Each time, it is trained on all but one fold and then tested on the
remaining single fold.
ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES
1. Leave-one-out.
similar to the k-fold cross-validation
every data point is used for testing once
This is a time consuming methodology, but for small datasets,
sometimes it is a viable option.
2. Bootstrapping.
With bootstrapping, a fixed number of instances from the
original data are sampled (with replacement) for training and the
rest of the dataset is used for testing.
This process is repeated as many times as desired.
3. Jackknifing.
Similar to the leave-one-out methodology;
The accuracy is calculated by leaving one sample out at each
iteration of the estimation process.
4. Area under the ROC curve.
The area under the ROC curve is a graphical assessment
technique
true positive rate is plotted on the Y-axis and false positive rate
is plotted on the X-axis.
The area under the ROC curve determines the accuracy
measure of a classifier: A value of 1 indicates a perfect
classifier whereas 0.5 indicates no better than random chance;
40 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLASSIFICATION TECHNIQUES
Decision tree analysis.
Statistical analysis.
Neural networks.
Case-based reasoning
Bayesian classifiers
Genetic algorithms
Rough sets
2. Cluster Analysis
Data mining method for classifying items, events, or concepts into
common groupings called clusters.
The method is commonly used in biology, medicine, genetics,
social network analysis, anthropology, archaeology, astronomy,
character recognition, and even in management information
system development.
Cluster analysis is an exploratory data analysis tool for solving
classification problems.
The objective is to sort cases (e.g., people, things, events) into
groups, or clusters, so that the degree of association is strong
among members of the same cluster and weak among members
of different clusters.
Each cluster describes the class to which its members belong.
Cluster analysis results may be used to:
Identify a classification scheme (e.g., types of customers)
Suggest statistical models to describe populations
Indicate rules for assigning new cases to classes for identification,
targeting, and diagnostic purposes
Provide measures of definition, size, and change in what were
previously broad concepts
41 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Find typical cases to label and represent classes
Decrease the size and complexity of the problem space for other
data mining methods
Identify outliers in a specific domain (e.g., rare-event detection)
DETERMINING THE OPTIMAL NUMBER OF CLUSTERS
Clustering algorithms usually require one to specify the number of
clusters to find.
If this number is not known from prior knowledge, it should be
chosen in some way
The following are among the most commonly referenced ones:
Look at the percentage of variance explained as a function of the
number of clusters; that is, choose a number of clusters so that
adding another cluster would not give much better modeling of the
data.
Set the number of clusters to (n/2)1/2, where n is the number of
data points.
Use the Akaike Information Criterion, which is a measure of the
goodness of fit
Use Bayesian Information Criterion, which is a model-selection
criterion to determine the number of clusters
ANALYSIS METHODS
Statistical methods
Neural networks
Fuzzy logic
Each of these methods generally works with one of two general method
classes:
Divisive.
All items start in one cluster and are broken apart
42 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Agglomerative
all items start in individual clusters, and the clusters are
joined together.
K-MEANS CLUSTERING ALGORITHM
The k-means clustering algorithm (where k stands for the
predetermined number of clusters) is arguably the most referenced
clustering algorithm.
It has its roots in traditional statistical analysis. As the name
implies, the algorithm assigns each data point (customer, event,
object, etc.) to the cluster whose center (also called centroid) is the
nearest.
The center is calculated as the average of all the points in the
cluster; that is, its coordinates are the arithmetic mean for each
dimension separately over all the points in the cluster.
Initialization step: Choose the number of clusters (i.e., the value
of k).
Step 1: Randomly generate k random points as initial cluster
centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Recompute the new cluster centers.
Repetition step: Repeat steps 2 and 3 until some convergence
criterion is met (usually that the assignment of points to clusters
becomes stable).
43 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Association Rule Mining
Association rule mining is a popular data mining method
Association rule mining aims to find interesting relationships
(affinities) between variables (items) in large databases.
It is commonly called a market-basket analysis.
The main idea in market basket analysis is to identify strong
relationships among different products (or services) that are
usually purchased together
The outcome of the analysis is invaluable information that can be
used to better understand customer-purchase behavior in order to
maximize the profit from business transactions.
A business can take advantage of such knowledge by
(1) putting the items next to each other to make it more convenient
for the customers to pick them
(2) promoting the items as a package (do not put one on sale if
the other(s) is on sale); and
(3) placing them apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing
and buying other items
Applications of market-basket analysis include cross- marketing,
cross-selling, store design, catalog design,e-commerce site
design, optimization of online advertising, product pricing, and
sales/promotion configuration.
―Are all association rules interesting and useful?‖
In order to answer such a question, association rule mining uses
two common metrics: support and confidence
Several algorithms are available for generating association rules.
Some well-known algorithms include
44 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Apriori,
Eclat, and
FP-Growth.
These algorithms only do half the job, which is to identify the
frequent itemsets in the database.
A frequent itemset is an arbitrary number of items that frequently
go together in a transaction
Once the frequent itemsets are identified, they need to be
converted into rules with antecedent and consequent parts.
APRIORI ALGORITHM
The Apriori algorithm is the most commonly used algorithm to
discover association rules.
Given a set of itemsets (e.g., sets of retail transactions, each
listing individual items purchased)
The algorithm attempts to find subsets that are common to at
least a minimum number of the itemsets (i.e., complies with a
minimum support).
Apriori uses a bottom-up approach, where frequent subsets are
extended one item at a time (a method known as candidate
generation, whereby the size of frequent subsets increases
from one-item subsets to two-item subsets, then three-item
subsets, etc.),
Groups of candidates at each level are tested against the data
for minimum support. The algorithm terminates when no further
successful extensions are found.
45 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
TEXT MINING
Text mining (text data mining or knowledge discovery in textual
databases)
It is the semi automated process of extracting patterns from large
amounts of unstructured data sources.
Text mining is the same as data mining;
But with text mining, the input to the process is a collection of
unstructured (or less structured) data files such as Word
documents, PDF files, text excerpts, XML files, and so on.
Text mining has two main steps
1. Imposing structure to the text-based data sources
2. Extracting relevant information and knowledge from this
structured text-based data using data mining techniques and
tools
TEXT MINING CONCEPTS AND DEFINITIONS
Benefits of text mining
Law (court orders)
Academic research (research articles)
Finance (quarterly reports)
46 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Medicine (discharge summaries)
Biology (molecular interactions)
Technology (patent files), and
Marketing (customer comments)
Example
Free-form text-based interactions with customers in the form of
complaints
Electronic communications and e-mail.
Used to classify and filter junk e-mail
Used to automatically prioritize e-mail based on importance
level as well as to generate automatic responses
Application areas of text mining:
Information extraction - Identification of key phrases and
relationships
Topic tracking - Based on a user profile and documents that a
user views, text mining can predict other documents
Summarization - To save time on the part of the reader.
Categorization - Identifying the main themes of a document and
then placing them into a predefined set of categories based on
those themes.
Clustering - Grouping similar documents without having a
predefined set of categories.
Concept linking - Connects related documents by identifying their
shared concepts
Question answering - Finding the best answer to a given
question through knowledge-driven pattern matching.
47 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
NATURAL LANGUAGE PROCESSING
NLP is an important component of text mining
NLP is a subfield of artificial intelligence and computational
linguistics.
NLP studies the problem of ―understanding‖ the natural human
language
NLP Converts the human language (such as textual documents)
into more formal representations (in the form of numeric and
symbolic data) that are easier for computer programs to
manipulate.
The goal of NLP is to move beyond syntax-driven text
manipulation (which is often called ―word counting‖) to a true
understanding and processing of natural language
natural human language is vague and that a true understanding of
meaning requires extensive knowledge of a topic
Challenges associated with the implementation of NLP
Part-of-speech tagging - It is difficult to mark up terms in a text as
corresponding to a particular part of speech (such as nouns, verbs,
adjectives, and adverbs)
Text segmentation - Some written languages, such as Chinese,
Japanese, and Thai, do not have single-word boundaries. In these
instances, the text-parsing task requires the identification of word
boundaries, which is often a difficult task.
Word sense disambiguation - Many words have more than one
meaning. Selecting the meaning that makes the most sense can
only be accomplished by taking into account the context within
which the word is used.
Syntactic ambiguity - The grammar for natural languages is
ambiguous; that is, multiple possible sentence structures often
need to be considered.
48 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Imperfect or irregular input - Foreign or regional accents and
vocal impediments in speech and typographical or grammatical
errors in texts make the processing of the language an even more
difficult task.
Speech acts - A sentence can often be considered an action by
the speaker.
The sentence structure alone may not contain enough
information to define this action.
For example, ―Can you pass the class?‖ requests a
simple yes/no answer, whereas
―Can you pass the salt?‖ is a request for a physical
action to be performed..
WordNet is a laboriously hand-coded database of English words,
their definitions, sets of synonyms, and various semantic relations
between synonym sets.
It is a major resource for NLP applications, but it has proven to be
very expensive to build and maintain manually
An important area of CRM, where NLP is making a significant
impact, is sentiment analysis.
Sentiment analysis is a technique used to detect favorable and
unfavorable opinions toward specific products and services using a
large numbers of textual data sources (customer feedback in the
form of Web postings).
NLP has successfully been applied to a variety of tasks via
computer programs to automatically process natural human
language.
Following are among the most popular of these tasks:
1. Information retrieval.
The science of searching for relevant documents, finding
specific information within them, and generating metadata as to their
contents.
49 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Information extraction.
A type of information retrieval whose goal is to automatically
extract structured information, such as categorized and contextually and
semantically well-defined data from a certain domain, using unstructured
machine readable documents.
3. Named-entity recognition.
Also known as entity identification and entity extraction, this
subtask of information extraction seeks to locate and classify atomic
elements in text into predefined categories
4. Question answering.
The task of automatically answering a question posed in
natural language;
To find the answer to a question, the computer program may
use either a prestructured database or a collection of natural language
documents (a text corpus such as the World Wide Web).
5. Automatic summarization.
The creation of a shortened version of a textual document by
a computer program that contains the most important points of the
original document.
6. Natural language generation.
Systems convert information from computer databases into
readable human language.
7. Natural language understanding.
Systems convert samples of human language into more
formal representations that are easier for computer programs to
manipulate.
8. Machine translation.
The automatic translation of one human language to another
50 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
9. Foreign language reading.
A computer program that assists a nonnative language
speaker to read a foreign language with correct pronunciation and
accents on different parts of the words.
10. Foreign language writing.
A computer program that assists a nonnative language user
in writing in a foreign language.
11. Speech recognition.
Converts spoken words to machine-readable input.
12. Text-to-speech.
Also called speech synthesis, a computer program
automatically converts normal language text into human speech.
13. Text proofing.
A computer program reads a proof copy of a text in order to
detect and correct any errors
14. Optical character recognition.
The automatic translation of images of handwritten,
typewritten, or printed text (usually captured by a scanner) into machine
editable textual documents
TEXT MINING APPLICATIONS
1. Marketing Applications
Text mining can be used to increase cross-selling and up-selling
by analyzing the unstructured data generated by call centers
blogs, user reviews of products at independent Web sites, and
discussion board postings are a gold mine of customer sentiments
Text Mining used to predict customer perceptions and subsequent
purchasing behavior
51 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Security Applications
ECHELON surveillance system – It is assumed to be capable of
identifying the content of telephone calls, faxes, e-mails, and other
types of data, intercepting information sent via satellites, public
switched telephone networks.
EUROPOL developed an integrated system capable of accessing,
storing, and analyzing vast amounts of structured and unstructured
data sources in order to track transnational organized crime.
The U.S. (FBI) and the (CIA), are jointly developed a
supercomputer data and text mining system. The system is
expected to create a gigantic data warehouse along with a variety
of data and text mining modules to meet the knowledge-discovery
needs of federal, state, and local law enforcement agencies.
Text mining is in the area of deception detection
3. Biomedical Applications
Experimental techniques such as DNA microarray analysis, serial
analysis of gene expression (SAGE), and mass spectrometry
proteomics, among others, are generating large amounts of data
related to genes and proteins.
Knowing the location of a protein within a cell can help to
determine its potential as a drug target
4. Academic Applications
Text Mining provides semantic cues to machines to answer
specific queries.
52 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
TEXT MINING PROCESS
Context diagram for the text mining process
As the context diagram indicates:
The input into the text-based knowledge discovery process is the
unstructured as well as structured data collected, stored, and
made available to the process.
The output of the process is the context-specific knowledge that
can be used for decision making.
The controls, also called the constraints (inward connection to the
top edge of the box), of the process include
software and hardware limitations,
privacy issues, and the
53 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
difficulties related to processing of the text
The mechanisms of the process include proper techniques,
software tools, and domain expertise.
The text mining process can be broken down into three
consecutive tasks,
Task 1 : Establish the Corpus
Task 2: Create the Term–Document Matrix
Task 3: Extract Knowledge
each of which has specific inputs to generate certain outputs.
The three steps Text Mining Processes
Task 1: Establish the corpus
Collect all relevant unstructured data
(e.g., textual documents, XML files, emails, Web pages,
short notes, voice recordings…)
Digitize, standardize the collection
(e.g., all in ASCII text files)
Place the collection in a common place
(e.g., in a flat file, or in a directory as separate files)
54 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Task 2: Create the Term–by–Document Matrix
The digitized and organized documents (the corpus) are used to
create the term–document matrix (TDM).
In the TDM, rows represent the documents and columns represent
the terms.
The relationships between the terms and documents are
characterized by indices (i.e., a relational measure that can be as
simple as the number of occurrences of the term in respective
documents).
nt g
e e rin
Terms k g em ine
t ris a na e ng e nt
en tm are pm
tm c elo
es je ftw v P
Documents inv pro so de SA ...
Document 1 1 1
Document 2 1
Document 3 3 1
Document 4 1
Document 5 2 1
Document 6 1 1
...
Should all terms be included?
Stop words, include words
Synonyms, homonyms
Stemming
What is the best representation of the indices (values in
cells)?
Row counts; binary frequencies; log frequencies;
Inverse document frequency
55 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
TDM is a sparse matrix. How can we reduce the
dimensionality of the TDM?
Manual – a domain expert goes through it
Eliminate terms with very few occurrences in very
few documents (?)
Transform the matrix using singular value
decomposition (SVD)
SVD is similar to principle component analysis
Task 3: Extract patterns/knowledge
Classification (text categorization)
Clustering (natural groupings of text)
Improve search recall
Improve search precision
Scatter/gather
Query-specific clustering
Association
Trend Analysis (…)
TEXT MINING TOOLS
Following are some of the popular text mining tools, which we
classify as
Commercial software tools
Free software tools
1. Commercial Software Tools
The following are some of the most popular software tools used for
text mining.
Note that many companies offer demonstration versions of their
products on their Web sites.
1. ClearForest offers text analysis and visualization tools
(clearforest.com).
56 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. IBM Intelligent Miner Data Mining Suite, now fully integrated
into IBM’s InfoSphere Warehouse software, includes data and
text mining tools (ibm.com).
3. Megaputer Text Analyst offers semantic analysis of free-form
text, summarization, clustering, navigation, and natural language
retrieval with search dynamic refocusing (megaputer.com).
4. SAS Text Miner provides a rich suite of text processing and
analysis tools (sas.com).
5. SPSS Text Mining for Clementine extracts key concepts,
sentiments, and relationships from call-center notes, blogs, e-
mails, and other unstructured data and converts them to a
structured format for predictive modeling (spss.com).
6. The Statistica Text Mining engine provides easy-to-use text
mining functionally with exceptional visualization capabilities
(statsoft.com).
7. VantagePoint provides a variety of interactive graphical views
and analysis tools with powerful capabilities to discover knowledge
from text databases(vpvp.com).
8. The WordStat analysis module from Provalis Research analyzes
textual information such as responses to open-ended questions
and interviews (provalisresearch.com).
2. Free Software Tools
Free software tools, some of which are open source, are available from
a number of nonprofit organizations:
1. GATE is a leading open source toolkit for text mining. It has a
free open source framework (or SDK) and graphical development
environment (gate.ac.uk).
2. RapidMiner has a community edition of its software that includes
text mining modules (rapid-i.com).
3. LingPipe is a suite of Java libraries for the linguistic analysis of
human language (alias-i.com/lingpipe).
57 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
4. S-EM (Spy-EM) is a text classification system that learns from
positive and unlabeled examples (cs.uic.edu/~liub/S-EM/S-EM-
download.html).
5. Vivisimo/Clusty is a Web search and text-clustering engine
(clusty.com).
WEB MINING OVERVIEW
Web mining (or Web data mining) is the process of discovering
intrinsic relationships (i.e., interesting and useful information) from
Web data, which are expressed in the form of textual, linkage, or
usage information.
The Web is perhaps the world’s largest data and text repository.
The amount of information on the Web is growing rapidly every
day.
A lot of interesting information can be found online:
whose homepage is linked to which other pages
how many people have links to a specific Web page
how a particular site is organized
Web also poses great challenges for effective and efficient knowledge
discovery:
1. The Web is too big for effective data mining
2. The Web is too complex.
Web pages lack a unified structure They contain far more
authoring style and content variation
3. The Web is too dynamic.
Not only does the Web grow rapidly, but its content is constantly
being updated. Blogs, news stories, stock market results, weather
reports
58 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
4. The Web is not specific to a domain.
Web users have very different backgrounds, interests, and usage
purposes.
5. The Web has everything.
Only a small portion of the information on the Web is truly relevant
or useful to someone
Three main areas of Web mining:
Web content mining
Web structure mining
Web usage mining.
1. WEB CONTENT MINING
Web content mining refers to the extraction of useful information
from Web pages.
The documents may be extracted in some machine-readable
format so that automated techniques can generate some
information about the Web pages.
Web crawlers are used to read through the content of a Web site
automatically.
59 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
The information gathered may include document characteristics
similar to what is used in text mining, but it may include additional
concepts such as the document hierarchy
Web content mining can also be used to enhance the results
produced by search engines
In addition to text, Web pages also contain hyperlinks pointing one
page to another.
Hyperlinks contain a significant amount of hidden human
annotation
When a Web page developer includes a link pointing to another
Web page, this can be regarded as the developer’s endorsement
of the other page.
Therefore, the vast amount of Web linkage information provides a
rich collection of information about the relevance, quality, and
structure of the Web’s contents.
A search on the Web to obtain information on a specific topic
usually returns a few relevant, high-quality Web pages and a larger
number of unusable Web pages.
Use of an index based on authoritative will improve the search
results and ranking of relevant pages
The idea of authority stems from earlier information retrieval work
using citations among journal articles to evaluate the impact of
research papers
There are significant differences between the citations in research
articles and hyperlinks on Web pages:
not every hyperlink represents an endorsement
one authority will rarely have its Web page point to rival
authorities in the same domain
authoritative pages are seldom particularly descriptive
60 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
The structure of Web hyperlinks has led to another important
category of Web pages called a hub.
A hub is one or more Web pages that provide a collection of
links to authoritative pages.
Hub pages provide link to a collection of prominent sites on a
specific topic of interest.
A hub could be a list of recommended links on an individual’s
homepage, recommended reference sites on a course Web page
Hyperlink-induced topic search (HITS)
HITS is a link analysis algorithm that rates Web pages using the
hyperlink information contained within them.
The HITS algorithm collects a base document set for a specific
query. It then recursively calculates the hub and authority values
for each document.
To gather the base document set, a root set that matches the
query is fetched from a search engine.
2.WEB STRUCTURE MINING
It is the process of extracting useful information from the links
embedded in Web documents.
It is used to identify authoritative pages and hubs.
Just as links going to a Web page may indicate a site’s popularity
(or authority), links within the Web page (or the compete Web site)
may indicate the depth of coverage of a specific topic.
Analysis of links is very important in understanding the
interrelationships among large numbers of Web pages, leading to
a better understanding of a specific Web community
61 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
3.WEB USAGE MINING
Web usage mining is the extraction of useful information from data
generated through Web page visits and transactions.
Three types of data are generated through Web page visits:
1. Automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies
2. User profiles
3. Metadata, such as page attributes, content attributes, and
usage data
Analysis of the information collected by Web servers can help us
better understand user behavior. Analysis of this data is often
called click stream analysis.
By using the data and text mining techniques, a company might be
able to determine interesting patterns.
Click stream Analysis:
Useful in determining where to place online advertisements.
Click stream analysis might also be useful for knowing when
visitors access a site.
Process of extracting knowledge from clickstream data
62 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Applications of Web mining:
1. Determine the lifetime value of clients.
2. Design cross-marketing strategies across products.
3. Evaluate promotional campaigns.
4. Target electronic ads and coupons at user groups based on
user access patterns.
5. Predict user behavior based on previously learned rules and
users’ profiles.
6. Present dynamic information to users based on their interests
and profiles
Web usage mining software
SPATIAL DATA MINING
Spatial data mining is the process of discovering interesting, useful, non-
trivial patterns from large spatial datasets
63 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
64 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
65 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
66 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
67 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
68 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
69 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
70 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
71 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
72 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
73 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
PROCESS MINING
―The idea of process mining is to discover, monitor and improve
real processes (i.e., not assumed processes) by extracting
knowledge from event logs readily available in today’s
(information) systems.
Process mining includes (automated) process discovery (i.e.,
extracting process models from an event log), conformance
checking (i.e., monitoring deviations by comparing model and log),
social network/organizational mining, automated construction of
simulation models, model extension, model repair, case prediction,
and history-based recommendations.‖
Events and event logs:
It is assumed that an event refers to a process activity or a task,
which is a well-defined step in the process and is related to a
particular case, i.e. process instance.
Another assumption is that these events are ordered.
The case or process instance is a specific occurrence or execution
of a business process, while activity is an operation, part of a case,
that is being executed.
An event log stores information about cases and activities, but also
information about event performers, event timestamps (moment
when the event is triggered) or data elements recorded with the
event
Process mining activities such as extracting and filtering data from
information systems are not trivial.
Data may be distributed over a variety of sources, event data may
be incomplete, an event log may contain outliers, logs may contain
events at different level of granularity, etc.
Process Mining Manifesto gives following guidelines referring to
the event data:
events should be trustworthy,
74 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
event logs should be complete,
any recorded event should have well-defined semantics
the event data should be safe
Process mining types
Three process mining types:
discovery,
conformance and
enhancement.
1. Process discovery:
A process discovery technique produces a process model from
an event log, without using any a-priori information about the
process and it is the most eminent process mining technique.
2. Conformance
Conformance compares an existing process model with an
event log of the same process
It is used to check if reality, as recorded in the log, conforms to
the model and vice versa.
Conformance checking can be used to:
check the quality of documented processes (asses
whether they describe reality accurately);
to identify deviating cases and understand what they have
in common; for auditing purposes;
to judge the quality of a discovered process model
3. Enhancement :
Enhancement extends or improves an existing process model
using information about the actual process recorded in event
log, with the aim of changing or extending the a-priori model.
75 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
For instance, by using time stamps in the event log one can
extend the model to show bottlenecks, service levels,
throughput times and frequencies
Process mining software tools and techniques:
Many contemporary process mining software tools were developed
and are continuously improved, such as: (Celonis Gmbh), Disco
(Fluxicon), EDS (StereoLOGIC Ltd), Fujitsu (Fujitsu Ltd) Icaro
(Icaro Tech), Icris (Icris), LANA (Lana Labs), Minit (Gradient ECM),
myInvenio (Cognitive Technology), ProcessGold (Processgold
International B.V.), ProM (Open Source, hosted at TU/e), ProM
Lite (Open Source hosted at TU/e), QPR (QPR), RapidProM
(Open Source hosted at TU/e), Rialto (Exeura), SNP (SNP
Schneider-Neureither & Partner AG), ARIS PPM ( Software AG).
Currently, the most prominent, open-source tool is ProM (Process
Mining Framework) , as it offers a variety of plug-ins that enable
application of various algorithms and latest developments in
process mining research.
Three main categories of process mining algorithms:
Deterministic algorithms,
Heuristic algorithms and
Genetic algorithms.
Deterministic algorithms always generate repeatable models, as
all of the data has to be known and process mining output is
constant for the given input of variables
1.2. BUSINESS INTELLIGENCE PROCESS
What is Business Intelligence?
BI(Business Intelligence) is a set of processes, architectures, and
technologies that convert raw data into meaningful information that
drives profitable business actions.
It is a suite of software and services to transform data into
actionable intelligence and knowledge.
76 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
BI has a direct impact on organization's strategic, tactical and
operational business decisions.
BI supports fact-based decision making using historical data
rather than assumptions and gut feeling.
BI tools perform data analysis and create reports, summaries,
dashboards, maps, graphs, and charts to provide users with
detailed intelligence about the nature of the business.
Why is BI important?
Measurement: creating KPI (Key Performance Indicators) based
on historic data
Identify and set benchmarks for varied processes.
With BI systems organizations can identify market trends and
spot business problems that need to be addressed.
BI helps on data visualization that enhances the data quality and
thereby the quality of decision making.
BI systems can be used not just by enterprises but SME (Small
and Medium Enterprises)
How Business Intelligence systems are implemented?
Here are the steps:
Step 1:
Raw Data from corporate databases is extracted. The data could
be spread across multiple systems heterogeneous systems.
Step 2:
The data is cleaned and transformed into the data warehouse. The
table can be linked, and data cubes are formed.
Step 3:
Using BI system the user can ask quires, request ad-hoc reports
or conduct any other analysis
77 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Examples of Business Intelligence System used in Practice
In an Online Transaction Processing (OLTP) system information
that could be fed into product database could be
add a product line
change a product price
Correspondingly, in a Business Intelligence system query that
would be executed for the product subject area could be did the
addition of new product line or change in product price increase
revenues
In an advertising database of OLTP system query that could be
executed
Changed in advertisement options
Increase radio budget
Four types of BI users
Following given are the four key players who are used Business
Intelligence System:
1. The Professional Data Analyst:
78 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
The data analyst is a statistician who always needs to drill deep
down into data. BI system helps them to get fresh insights to develop
unique business strategies.
2. The IT users:
The IT user also plays a dominant role in maintaining the BI
infrastructure.
3. The head of the company:
CEO or CXO can increase the profit of their business by improving
operational efficiency in their business.
4. The Business Users:
Business intelligence users can be found from across the
organization. There are mainly two types of business users
Casual business intelligence user
The power user.
Advantages of Business Intelligence
Here are some of the advantages of using Business Intelligence System:
1. Boost productivity
With a BI program, It is possible for businesses to create reports
with a single click thus saves lots of time and resources. It also allows
employees to be more productive on their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make
it possible to identify any areas which need attention.
3. Fix Accountability :
BI system assigns accountability in the organization as there must
be someone who should own accountability and ownership for the
organization's performance against its set goals.
79 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
4. It gives a bird's eye view:
BI system also helps organizations as decision makers get an
overall bird's eye view through typical BI features like dashboards and
scorecards.
5. It streamlines business processes:
BI takes out all complexity associated with business processes. It
also automates analytics by offering predictive analysis, computer
modeling, benchmarking and other methodologies.
6. It allows for easy analytics:
BI software has democratized its usage, allowing even
nontechnical or non-analysts users to collect and process data quickly.
This also allows putting the power of analytics from the hand's many
people.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-
sized enterprises. The use of such type of system may be expensive for
routine business transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of
datawarehouse. It can be so complex that it can make business
techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in
consideration the buying competence of rich firms. Therefore, BI system
is yet not affordable for many small and medium size companies.
4. Time Consuming Implementation
It takes almost one and half year for data warehousing system to be
completely implemented. Therefore, it is a time-consuming process.
80 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Trends in Business Intelligence
Artificial Intelligence:
Gartner' report indicates that AI and machine learning now take on
complex tasks done by human intelligence. This capability is being
leveraged to come up with real-time data analysis and dashboard
reporting.
Collaborative BI:
BI software combined with collaboration tools, including social
media, and other latest technologies enhance the working and sharing
by teams for collaborative decision making.
Embedded BI:
Embedded BI allows the integration of BI software or some of its
features into another business application for enhancing and extending
it's reporting functionality.
Cloud Analytics:
BI applications will be soon offered in the cloud, and more
businesses will be shifting to this technology. As per their predictions
within a couple of years, the spending on cloud-based analytics will grow
4.5 times faster.
Prepared by,
D.DURAI KUMAR,
Head Of the Department,
Department Of Information Technology,
GTEC.
81 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE