0% found this document useful (0 votes)

42 views81 pages

Unit 1 Notes DM

The document provides an introduction to data mining for business intelligence, covering its necessity due to the explosive growth of data and the process of knowledge discovery from databases (KDD). It outlines the steps involved in the KDD process, the architecture of data mining systems, and the types of data suitable for mining, including relational and transactional databases. Additionally, it discusses the characteristics, definitions, and objectives of data mining, emphasizing its interdisciplinary nature and the importance of extracting useful patterns from large datasets.

Uploaded by

worlddependsonme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views81 pages

Unit 1 Notes DM

Uploaded by

worlddependsonme

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

BA5021 DATA MINING FOR BUSINESS INTELLIGENCE

UNIT I

INTRODUCTION

Syllabus:
Data mining, Text mining, Web mining, Spatial mining, Process mining, BI
process- Private and Public intelligence, Strategic assessment of
implementing BI

1.1. DATA MINING

Why Data Mining?

 The Explosive Growth of Data: from terabytes to petabytes

 Data collection and data availability

 Automated data collection tools, database systems,

Web, computerized society

 Major sources of abundant data

 Business: Web, e-commerce, transactions, stocks, …

 Science: Remote sensing, bioinformatics, scientific

simulation, …

 Society and everyone: news, digital cameras, YouTube

 We are drowning in data, but starving for knowledge!

 ―Necessity is the mother of invention‖—Data mining—Automated

analysis of massive data sets

What is Data Mining?

 Data mining (knowledge discovery from data)

 Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from
huge amount of data

1 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Data mining: a misnomer?

 Alternative names

 Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.

 Watch out: Is everything ―data mining‖?

 Simple search and query processing

 (Deductive) expert systems

Knowledge Discovery in DB (KDD) Process

 This is a view from typical database systems and data

warehousing communities

 Data mining plays an essential role in the knowledge discovery

process

2 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Knowledge discovery as a process involves in the following steps:

1. Data cleaning

To remove noise and inconsistent data

2. Data integration

where multiple data sources may be combined

3. Data selection

where data relevant to the analysis task are retrieved from

the database

4. Data transformation

where data are transformed or consolidated into forms

appropriate for mining by performing summary or
aggregation operations, for instance

5. Data mining

an essential process where intelligent methods are applied in

order to extract data patterns

6. Pattern evaluation

To identify the truly interesting patterns representing

knowledge based on some interestingness measures

7. Knowledge presentation

where visualization and knowledge representation

techniques are used to present the mined knowledge to the
user

3 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

KDD Process: A Typical View from ML and Statistics

Data Mining in Business Intelligence

4 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Architecture of typical DM Systems

Based on KDD’s view, the architecture of a typical data mining system

may have the following major components

 Database, data warehouse or other information repository:

 This is one or a set of databases, data warehouses,

spreadsheets, or other kinds of information repositories.

 Data cleaning and data integration techniques may be

performed on the data.

 Database or data warehouse server:

 The database or data warehouse server is responsible for

fetching the relevant data, based on the user’s data mining
request.

 Knowledge base:

 This is the domain knowledge that is used to guide the

search or evaluate the interestingness of resulting patterns.

5 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Such knowledge can include concept hierarchies, used to

organize attributes or attribute values into different levels of
abstraction.

 Other examples of domain knowledge are additional

interestingness constraints or thresholds, and metadata

 Data mining engine:

 This is essential to the data mining system and ideally

consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis.

 Pattern evaluation module :

 This component typically employs interestingness measures

and interacts with the data mining modules so as to focus the
search toward interesting patterns.

 It may use interestingness thresholds to filter out discovered

patterns.

 User interface:

 This module communicates between users and the data

mining system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate data
mining results

Data Mining: On What Kinds of Data?

There are no. of data stores on which data mining can be performed:

 Relational database

 Data warehouse

 Transactional database

6 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Advanced database and information repository

 Spatial and temporal data

 Time-series data

 Stream data

 Multimedia database

 Text databases & WWW

 Relational database

 A relational database is a collection of tables, each of which

is assigned a unique name.
 Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
 Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values
 Data warehouse
 A data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
that usually resides at a single site.
 Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data loading,
and periodic data refreshing.

Figure: Data Warehouse

7 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 To facilitate decision making, the data in a data warehouse are

organized around major subjects, such as customer, item,
supplier, and activity.

 A data warehouse is usually modeled by a multidimensional

database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema.

 Each cell stores the value of some aggregate measure, such as

count or sales amount.

 The actual physical structure of a data warehouse may be a

relational data store or a multidimensional data cube.

 A data cube provides a multidimensional view of data and

allows the pre computation and fast accessing of summarized
data.

 A data cube for summarized sales data of All Electronics is

presented in Figure.

 The cube has three dimensions:

 address (with city values Chicago, New York, Toronto,

Vancouver),

 time (with quarter values Q1, Q2, Q3, Q4), and

 item(with item type values home entertainment,

computer, phone, security).

 The aggregate value stored in each cell of the cube is sales

amount (in thousands).

 A data warehouse collects information about subjects that

span an entire organization, and thus its scope is enterprise-
wide.

 A data mart, on the other hand, is a department subset of a

data warehouse. It focuses on selected subjects, and thus its
scope is department-wide

8 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 By providing multidimensional data views and the pre

computation of summarized data, data warehouse systems are
well suited for on-line analytical processing, or OLAP.

9 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Transactional database
 In general, a transactional database consists of a file where each
record represents a transaction.

 A transaction typically includes a unique transaction identity

number (trans ID) and a list of the items making up the transaction
(such as items purchased in a store).

Advanced database and information repository:

 Object-Relational Databases

 Constructed based on an object-relational data model.

 This model extends the relational model by providing a rich data
type for handling complex objects and object orientation.
 Object-relational data model inherits the essential concepts of
object-oriented databases, where, in general terms, each entity
is considered as an object.

 Temporal Databases, Sequence Databases, and Time-Series

Databases

 A temporal database typically stores relational data that include

time-related attributes.
 These attributes may involve several timestamps, each having
different semantics

 Temporal Databases

 A temporal database typically stores relational data that

include time-related attributes.

10 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 These attributes may involve several timestamps, each

having different semantics

 Sequence Databases

 A sequence database stores sequences of ordered events,

with or without a concrete notion of time.

 Examples include customer shopping sequences, Web click

streams, and biological sequences

 Time-Series Databases
 A time-series database stores sequences of values or events
obtained over repeated measurements of time (e.g., hourly,
daily, weekly).
 Examples include data collected from the stock exchange,
inventory control, and the observation of natural phenomena
(like temperature and wind).

 Spatial Databases

 Spatial databases contain spatial-related information.

 Examples include geographic (map) databases, very large-
scale integration (VLSI) or computed-aided design databases,
and medical and satellite image databases.

 Text Databases and Multimedia Databases

 Text databases are databases that contain word descriptions

for objects.
 These word descriptions are usually not simple keywords but
rather long sentences or paragraphs, such as product
specifications, error or bug reports, warning messages,
summary reports, notes, or other documents.
 Multimedia databases store image, audio, and video data. They
are used in applications such as picture content-based retrieval,
voice-mail systems, video-on-demand systems, the World Wide
Web, and speech-based user interfaces that recognize spoken
commands

11 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Data Streams

 Data flow in and out of an observation platform (or window)

dynamically.
 Such data streams have the following unique features: huge
or possibly infinite volume, dynamically changing, flowing in
and out in a fixed order, allowing only one or a small number
of scans, and demanding fast (often real-time) response
time.
 Examples : scientific and engineering data, time-series data,
and data produced in other dynamic environments, etc.

DATA MINING CONCEPTS AND APPLICATIONS- Data Mining

Definitions, Characteristics, and Benefits:

 Data mining is a term used to describe discovering or "mining"

knowledge from large amounts of data.

 Technically speaking, data mining is a process that uses

statistical, mathematical, and artificial intelligence techniques
to extract and identify useful information and subsequent
knowledge (or patterns) from large sets of data.

 These patterns can be in the form of business rules, affinities,

correlations, trends, or prediction models

 Most literature defines data mining as "the nontrivial process of

identifying valid, novel, potentially useful, and ultimately
understandable patterns in data stored in structured databases,"
where the data are organized in records structured by
categorical, ordinal, and continuous variables.

The meanings of the key terms are as follows:

 Nontrivial means that some experimentation-type search or

inference is involved ;

 Valid means that the discovered patterns should hold true on new
data with sufficient degree of certainty.

12 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Novel means that the patterns are not previously known

 Potentially useful means that the discovered patterns should lead

to some benefit to the user or task.

 Ultimately understandable means that the pattern should make

business sense that leads to the user saying "mmm!

 Data mining is not a new discipline, but rather a new definition for
the use of many disciplines.

 Data mining is tightly positioned at the intersection of many

disciplines, including statistics, artificial intelligence, machine
learning, management science, information systems, and
databases.

13 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Major characteristics and objectives of data mining:

 Data are often buried deep within very large databases

 The data are cleansed and consolidated into a data warehouse.

 Data may be presented in a variety of formats.

 The data mining environment is usually a client/ server architecture

/ Web-based information systems architecture.

 Sophisticated new tools, including advanced visualization tools,

help to remove the information buried in corporate files or archival
public records.

 data miners are exploring the usefulness of soft data

 The miner is often an end user, empowered by data drills and

other power query tools to ask ad hoc questions and obtain
answers quickly

 Data mining tools are readily combined with sp read sheets and
other software development tools.

 It is sometimes necessary to use parallel processing for DM

A Simple Taxonomy of Data

 Data refers to a collection of facts usually obtained as the result

of experiences, observations, or experiments.

 Data may consist of numbers, letters, words, images, voice

recordings, and so on as measurements of a set of variables.

 Data are often viewed as the lowest level of abstraction from

which information and then knowledge is derived.

 At the highest level of abstraction, one can classify data as

structured and unstructured

 Structured data is what data mining algorithms use, and can be

classified as categorical or numeric.

 The categorical data can be subdivided into nominal or ordinal

data,
14 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 whereas numeric data can be subdivided into interval or ratio

 Categorical data

 represent the labels of multiple classes used to divide a variable

into specific groups.

 Examples of categorical variables include sex, age group, and

educational level.

 Nominal data

 contain measurements of simple codes assigned to objects as

labels, which are not measurements.
 For example, the variable marital status can be generally
categorized as (1) single, (2) married, and (3) divorced.
 Nominal data can be represented with binomial values having
two possible values (e.g., yes/ no, true/ false, good/ bad), or
multinomial values having three or more possible values.

 Ordinal data

 contain codes assigned to objects or events as labels that also

represent the rank order among them.

 For example, the variable credit score can be generally

categorized as (1) low, (2) medium, or (3) high.

15 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Similar ordered relationships can be seen in variables such as

age group (i.e., child, young, middle-aged, elderly)

 Numeric data

 represent the numeric values of specific variables.

 Examples of numerically valued variables include age, number
of children, total household income

 Interval data

 are variables that can be measured on interval scales.

 A common example of interval scale measurement is
temperature on the Celsius scale.

 Ratio data

 include measurement variables commonly found in the physical

sciences and engineering. Mass, length, time, plane angle,
energy, and electric charge.

How Data Mining Works?

 Using existing and relevant data, data mining builds models to

identify patterns among the attributes presented in the data set.

 Models are the mathematical representations that identify the

patterns among the attributes of the objects described in the data
set.

 Some of these patterns are explanatory (explaining the

interrelationships and affinities among the attributes), whereas
others are predictive (foretelling future values of certain
attributes).

In general, data mining seeks to identify four major types of

patterns:

1. Associations

find the commonly co-occurring groupings of things, such as

beer and diapers going together in market-basket analysis.

16 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Predictions

tell the nature of future occurrences of certain events based

on what has happened in the past, such as predicting the winner of the
Game or forecasting the absolute temperature of a particular day.

3. Clusters

identify natural groupings of things based on their known

characteristics, such as assigning customers in different segments
based on their demographics and past purchase behaviors.

4. Sequential relationships

discover time-ordered events, such as predicting that an

existing banking customer who already has a checking account will open
a savings account followed by an investment account within a year.

 Data mining tasks can be classified into three main categories:

 prediction,

 association, and

 clustering.

 Based on the way in which the patterns are extracted from the
historical data, the learning algorithms of data mining methods
can be classified as either

 supervised or

 unsupervised.

 Supervised learning algorithms - the training data

includes both the descriptive attributes (i.e.,
independent variables or decision variables) as well as
the class attribute (i.e. , output variable or result
variable).
 Unsupervised learning algorithm - the training data
includes only the descriptive attributes.

17 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

PREDICTION

 Prediction is commonly referred to as the act of telling about the

future.

 It differs from simple guessing by taking into account the

experiences, opinions, and other relevant information in
conducting the task of foretelling.

 A term that is commonly associated with prediction is forecasting.

 Prediction is largely experience and opinion based, forecasting is

data and model based.

 That is, in order of increasing reliability, one might list the relevant
terms as guessing, predicting, and forecasting, respectively.

18 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLASSIFICATION (supervised induction)

 The objective of classification is to analyze the historical data

stored in a database and automatically generate a model that can
predict future behavior.

 This induced model consists of generalizations over the records of

a training dataset, which help distinguish predefined classes.

 The hope is that the model can then be used to predict the classes
of other unclassified records and, more important, to accurately
predict actual future events.

 Common classification tools include neural networks and decision

trees, logistic regression and discriminate analysis.

 Emerging tools such as rough sets, support vector machines, and

genetic algorithms

 Neural networks

 Involve the development of mathematical structures

(somewhat resembling the biological neural networks
in human brain) that have the capability to learn from
past experiences presented in the form of well-
structured datasets

 Decision trees

 Classify data into a finite number of classes based on

the values of the input variables.
 Decision trees are essentially a hierarchy of if-then
statements
 Faster than neural networks.
 They are most appropriate for categorical and interval
data.
 Therefore, incorporating continuous variables into a
decision tree framework requires discretization -
converting continuous valued numerical variables to
ranges and categories

19 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLUSTERING

 Clustering partitions a collection of things (e.g., objects and

events presented in a structured dataset) into segments (or
natural groupings) whose members share similar characteristics.

 Unlike in classification, in clustering, the class labels are unknown.

 As the selected algorithms go through the dataset, identifying the

commonalities of things based on their characteristics, the clusters
are established.

 Because the clusters are determined using a heuristic-type

algorithm, and because different algorithms may end up with
different sets of clusters for the same dataset.

 It may be necessary for an expert to interpret, and potentially

modify, the suggested clusters before the results of clustering
techniques are put to actual use.

 After reasonable clusters have been identified, they can be used to

classify and interpret new data.

 The goal of clustering is to create groups so that the members

within each group have maximum similarity and the members
across groups have minimum similarity.

 The most commonly used clustering techniques include k-means

(from statistics) and self-organizing maps (from machine
learning), which is a unique neural network architecture developed
by Kohonen.

ASSOCIATIONS

 Associations, or association rule learning in data mining, is a

popular and well-researched technique for discovering interesting
relationships among variables in large databases.

 In the context of the retail industry, association rule mining is often

called market-basket analysis.

20 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Two commonly used derivatives of association rule mining are link

analysis and sequence mining.

 Link analysis - the linkage among many objects of interest

is discovered automatically, such as the link between Web
pages and referential relationships among groups of
academic publication authors.

 Sequence mining - relationships are examined in terms of

their order of occurrence to identify associations over time

HYPOTHESIS- OR DISCOVERY-DRIVEN DATA MINING

 Data mining can be hypothesis driven or discovery driven.

 Hypothesis-driven data mining begins with a proposition by the

user, who then seeks to validate the truthfulness of the proposition.

 For example, a marketing manager may begin with the following

proposition: ―Are DVD player sales related to sales of television
sets?‖

 Discovery-driven data mining finds patterns, associations, and

other relationships hidden within datasets. It can uncover facts that
an organization had not previously known or even contemplated

DATA MINING APPLICATIONS

• Customer relationship management.

• Banking.

• Retailing and logistics.

• Manufacturing and production.

• Brokerage and securities trading.

• Insurance.

• Computer hardware and software.

• Government and defense.

• Travel industry (airlines, hotels/resorts, rental car companies).

21 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

• Health care.

• Medicine.

• Entertainment industry.

• Homeland security and law enforcement.

• Sports.

Data mining has become a popular tool in addressing many complex

businesses issues.

1. Customer relationship management.

 Customer relationship management (CRM) is the new and

emerging extension of traditional marketing.

 The goal of CRM is to create one-on-one relationships with

customers by developing an intimate understanding of their needs
and wants.

 As businesses build relationships with their customers over time

through a variety of transactions (e.g., product inquiries, sales,
service requests, warranty calls)

 When combined with demographic and socioeconomic attributes,

this information-rich data can be used to

(1) identify most likely responders / buyers of new

products/services (i.e., customer profiling);

(2) understand the root causes of customer attrition in order

to improve customer retention (i.e., churn analysis);

(3) discover time-variant associations between products and

services to maximize sales and customer value;

(4) identify the most profitable customers and their

preferential needs to strengthen relationships and to
maximize sales.

22 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Banking.

 Data mining can help banks with the following:

(1) automating the loan application process by accurately

predicting the most probable defaulters;

(2) detecting fraudulent credit card and online-banking

transactions;

(3) identifying ways to maximize customer value by selling them

products and services that they are most likely to buy;

(4) optimizing the cash return by accurately forecasting the cash

flow on banking entities (e.g., ATM machines, banking branches).

3. Retailing and logistics.

 In the retailing industry, data mining can be used to

(1) predict accurate sales volumes at specific retail locations in

order to determine correct inventory levels;

(2) identify sales relationships between different products (with

market-basket analysis) to improve the store layout and optimize
sales promotions;

(3) forecast consumption levels of different product types (based

on seasonal and environmental conditions) to optimize logistics
and hence maximize sales;

(4) discover interesting patterns in the movement of products

(especially for the products that have a limited shelf life because
they are prone to expiration, perishability, and contamination) in a
supply chain by analyzing sensory and RFID data.

4. Manufacturing and production.

 Manufacturers can use data mining to

(1) predict machinery failures before they occur through the use of
sensory data (enabling what is called condition-based
maintenance);

23 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

(2) identify anomalies and commonalities in production systems to

optimize manufacturing capacity; and

(3) discover novel patterns to identify and improve product quality

5. Brokerage and securities trading.

 Brokers and traders use data mining to

(1) predict when and how much certain bond prices will change;

(2) forecast the range and direction of stock fluctuations;

(3) assess the effect of particular issues and events on overall

market movements; and

(4) identify and prevent fraudulent activities in securities trading

6. Insurance.

 The insurance industry uses data mining techniques to

(1) forecast claim amounts for property and medical coverage

costs for better business planning;

(2) determine optimal rate plans based on the analysis of claims

and customer data;

(3) predict which customers are more likely to buy new policies
with special features; and

(4) identify and prevent incorrect claim payments and fraudulent

activities.

7. Computer hardware and software.

 Data mining can be used to

(1) predict disk drive failures well before they actually occur;

(2) identify and filter unwanted Web content and e-mail messages;

(3) detect and prevent computer network security bridges; and

(4) identify potentially unsecure software products.

24 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

8. Government and defense.

 Data mining also has a number of military applications. It can be

used to

(1) forecast the cost of moving military personnel and equipment;

(2) predict an adversary’s moves to develop more successful

strategies for military engagements;

(3) predict resource consumption for better planning and

budgeting; and

(4) identify classes of unique experiences, strategies, and lessons

learned from military operations for better knowledge sharing

9. Travel industry (airlines, hotels / resorts, rental car companies).

 Data mining has a variety of uses in the travel industry. It is

successfully used to

(1) predict sales of different services (seat types in airplanes, room

types in hotels/resorts, car types in rental car companies) in order
to optimally price services to maximize revenues as a function of
time-varying transactions (commonly referred to as yield
management);

(2) forecast demand at different locations to better allocate limited

organizational resources;

(3) identify the most profitable customers and provide them with
personalized services to maintain their repeat business; and

(4) retain valuable employees by identifying and acting on the root

causes for attrition

10. Health care.

 Data mining has a number of health care applications. It can be

used to

(1) identify people without health insurance and the factors

underlying this undesired phenomenon;

25 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

(2) identify novel cost-benefit relationships between different

treatments to develop more effective strategies;

(3) forecast the level and the time of demand at different service
locations to optimally allocate organizational resources; and

(4) understand the underlying reasons for customer and employee

attrition.

11. Medicine.

(1) identify novel patterns to improve survivability of patients with

cancer;

(2) predict success rates of organ transplantation patients to

develop better donor-organ matching policies;

(3) identify the functions of different genes in the human

chromosome (known as genomics);

(4) discover the relationships between symptoms and illnesses to

help medical professionals make informed and correct decisions in
a timely manner.

12. Entertainment industry.

 Data mining is successfully used by the entertainment industry to

(1) analyze viewer data to decide what programs to show during

prime time and how to maximize returns by knowing where to
insert advertisements;

(2) predict the financial success of movies before they are

produced to make investment decisions and to optimize the
returns;

(3) forecast the demand at different locations and different times to

better schedule entertainment events and to optimally allocate
resources; and

(4) develop optimal pricing policies to maximize revenues.

26 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

13. Homeland security and law enforcement.

 Data mining has a number of homeland security and law

enforcement applications. Data mining is often used to

(1) identify patterns of terrorist behaviors

(2) discover crime patterns (e.g., locations, timings, criminal

behaviors, and other related attributes) to help solve criminal
cases in a timely manner;

(3) predict and eliminate potential biological and chemical attacks

to a nation’s critical infrastructure by analyzing special-purpose
sensory data; and

(4) identify and stop malicious attacks on critical information

infrastructures (often called information warfare).

14.Sports.

 Data mining was used to improve the performance of National

(1) Basketball Association (NBA) teams in the United States.

(2) The NBA developed Advanced Scout, a PC-based data mining

application that coaching staff use to discover interesting patterns
in basketball game data.

(3) The pattern interpretation is facilitated by allowing the user to

relate patterns to videotape.

(4) See Bhandari et al.

(5) (1997) for details.

DATA MINING PROCESS

 In order to systematically carry out data mining projects, a general

process is usually followed.

 Based on best practices, data mining researchers and practitioners

have proposed several processes (workflows or simple step-by-
step approaches)

27 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 One such standardized process, the most popular one, Cross-

Industry Standard Process for Data Mining—CRISP-DM

 CRISP - DM is a sequence of six steps.

 Starts with a good understanding of the business and ends with

the deployment of the solution that satisfied the specific business
need.

 Even though these steps are sequential in nature, there is usually

a great deal of backtracking.

 Because the data mining is driven by experience and

experimentation, depending on the problem situation and the
knowledge/experience of the analyst, the whole process can be
very iterative

28 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Step 1: Business Understanding

 A thorough understanding of the managerial need for new

knowledge and an explicit specification of the business objective

 Specific questions such as

―What are the common characteristics of the customers we

have lost to our competitors recently?‖ or

―What are typical profiles of our customers, and how much

value does each of them provide to us?‖

need to be addressed.

 Then a project plan for finding such knowledge is developed that

specifies the people responsible for collecting the data, analyzing
the data, and reporting the findings.

 At this early stage, a budget to support the study should also be

established

Step 2: Data Understanding

 Different business tasks require different sets of data.

 The main activity of the data mining process is to identify the

relevant data from many available databases.

 Some key points must be considered in the data identification

and selection phase.

 First and foremost, the analyst should be clear and concise about
the description of the data mining task so that the most relevant
data can be identified.

 For example, a retail data mining project may seek to identify

spending behaviors of female shoppers, who purchase seasonal
clothes, based on their demographics, credit card transactions,
and socioeconomic attributes.

 Furthermore, the analyst should build an intimate understanding of

the data sources

29 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Example:

 where the relevant data are stored and in what form;

 what the process of collecting the data is—automated versus

manual;

 who the collectors of it are;

 and how often it is updated etc

 In order to better understand the data, the analyst often uses a

variety of statistical and graphical techniques

 Such as simple statistical summaries of each variable (e.g., for

numeric variables, the average, minimum/maximum, median,
standard deviation are among the calculated measures, etc)

 Data sources for data selection can vary.

 Normally, data sources for business applications include

demographic data (such as income, education, number of
households, and age),

sociographic data (such as hobby, club membership, and

entertainment),

transactional data (sales record, credit card spending, and issued

checks), and so on.

 Data can be categorized as quantitative and qualitative.

Step 3: Data Preparation

 The purpose of data preparation (or more commonly called as data

preprocessing) is to take the data identified in the previous step
and prepare them for analysis by data mining methods.

 Data are generally

Incomplete (lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data),

Noisy (containing errors or outliers), and

30 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Inconsistent (containing discrepancies in codes or names).

 The four main steps needed to convert the raw, real-world data
into minable datasets.

 Data Collection / Selection

 Data Cleaning

 Data Transformation

 Data Reduction

1. Data Collection / Selection

 The relevant data are collected from the identified

sources

 The necessary records and variables are selected, and

 The records coming from multiple data sources are

integrated

31 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Data Cleaning

 The data are cleaned (this step is also known as

data scrubbing).

 The values in the dataset are identified and dealt

with.

 Missing values - need to be imputed (filled with a

most probable value) or ignored;

 Noisy values - (i.e., the outliers) and smooth them

out.

 Inconsistencies - (unusual values within a

variable) in the data should be handled using
domain knowledge and/or expert opinion.

3. Data Transformation

 Data are transformed for better processing.

 For instance, in many cases, the data are

normalized in order to mitigate the potential bias of
one variable (having large numeric values, such as
for household income)

 Another transformation that takes place is

discretization and/or aggregation.

 In some cases, the numeric variables are converted

to categorical values (e.g., low, medium, and high);

 In other cases, a nominal variable’s unique value

range is reduced to a smaller set using concept
hierarchies.

 Even though data miners like to have large

datasets, too much data is also a problem.

 One can visualize the data commonly used in data

mining projects as a flat file consisting of two
dimensions:
32 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 In some cases (e.g., image processing and genome

projects with complex microarray data), the number
of variables can be rather large, and the analyst
must reduce the number down to a manageable
size.

 The variables are treated as different dimensions

that describe the phenomenon from different
perspectives, in data mining, this process is
commonly called dimensional reduction.

Step 4: Modeling Building

 In this step, various modeling techniques are selected and applied

to an already prepared dataset in order to address the specific
business need.
 Depending on the business need, the data mining task can be of a
prediction (either classification or regression), an association, or a
clustering type.
 Each of these data mining tasks can use a variety of data mining
methods and algorithms.
 Some of these data mining methods and some of the most popular
algorithms - decision trees for classification, k-means for
clustering, and the Apriori algorithm for association rule mining.

Step 5: Testing and Evaluation

 The developed models are assessed and evaluated for their

accuracy and generality.
 This step assesses the degree to which the selected model (or
models) meets the business objectives and, if so, to what extent
 Another option is to test the developed model(s) in a real-world
scenario if time and budget constraints permit.
 Even though the outcome of the developed models are expected
to relate to the original business objectives,

 The testing and evaluation step is a critical and challenging task.

33 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 No value is added by the data mining task until the business value
obtained from discovered knowledge patterns is identified and
recognized.

 Determining the business value from discovered knowledge

patterns is somewhat similar to playing with puzzles.

 The success of this identification operation depends on the

interaction among data analysts, business analysts, and
decision makers

Step 6: Deployment

 Development and assessment of the models is not the end of the

data mining project.

 The knowledge gained from such exploration will need to be

organized and presented in a way that the end user can
understand and benefit from it.

 Depending on the requirements, the deployment phase can be as

simple as generating a report or as complex as implementing a
repeatable data mining process across the enterprise.

 In many cases, it is the customer, not the data analyst, who carries
out the deployment steps.

 The deployment step may also include maintenance activities for

the deployed models

34 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Other Data Mining Standardized Processes and Methodologies

Ranking of Data Mining Processes and Methodologies

35 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

DATA MINING METHODS

 A variety of methods are available for performing data mining

studies, including classification, regression, clustering, and
association.

 Most data mining software tools employ more than one technique
(or algorithm) for each of these methods.

1. Classification

 Classification is perhaps the most frequently used data mining

method

 A popular member of the machine-learning family of techniques,

 Classification learns patterns from past data (a set of information—

traits, variables, features—on characteristics of the previously
labeled items, objects, or events) in order to place new instances
(with unknown labels) into their respective groups or classes.

 For example, one could use classification to predict whether the

weather on a particular day will be ―sunny,‖ ―rainy,‖ or ―cloudy.‖

 Popular classification tasks include credit approval (i.e., good or

bad credit risk),

 store location (e.g., good, moderate, bad), target marketing (e.g.,

likely customer, no hope),

 fraud detection (i.e., yes, no), and telecommunication (e.g., likely

to turn to another phone company, yes/no).

 If what is being predicted is a class label (e.g., ―sunny,‖ ―rainy,‖ or

―cloudy‖), the prediction problem is called a classification.

 whereas if it is a numeric value (e.g., temperature such as 68°F),

the prediction problem is called a regression.

 The most common two-step methodology of classification-type

prediction involves

36 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 model development/training and

 model testing/deployment.

 In the model development phase, a collection of input data,

including the actual class labels, is used.
 After a model has been trained, the model is tested against the
holdout sample for accuracy assessment to predict classes of new
data instances (where the class label is unknown)

Several factors are considered in assessing the model, including the

following:

1. Predictive accuracy.

 The model’s ability to correctly predict the class label of new or

previously unseen data.

 Prediction accuracy is the most commonly used assessment

factor

 To compute this measure, actual class labels of a test dataset

are matched against the class labels predicted by the model.

 The accuracy can then be computed as the accuracy rate,

which is the percentage of test dataset samples correctly
classified by the model

2. Speed.
 The computational costs involved in generating and using the
model, where faster is deemed to be better.
3. Robustness.

 The model’s ability to make reasonably accurate predictions,

given noisy data or data with missing and erroneous values.

4. Scalability.

 The ability to construct a prediction model efficiently given a

rather large amount of data.

37 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

5. Interpretability.
 The level of understanding and insight provided by the model
(e.g., how and/or what the model concludes on certain
predictions).

Estimating the True Accuracy of Classification Models

 In classification problems, the primary source for accuracy

estimation is the confusion matrix (also called a classification
matrix or a contingency table).

 The numbers along the diagonal (L – R) correct decisions, and the

numbers outside this diagonal represent the errors.

38 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

SIMPLE SPLIT

 The simple split partitions the data into two mutually exclusive
subsets called a training set and a test set (or holdout set).

 Two-thirds of the data - training set ; remaining one-third - test

set.

 Training set - used by the inducer (model builder), and the built
classifier is then tested on the test set

K-FOLD CROSS-VALIDATION (rotation estimation)

 The complete dataset is randomly split into k mutually exclusive

subsets of approximately equal size.

 The classification model is trained and tested k times.

39 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Each time, it is trained on all but one fold and then tested on the
remaining single fold.

ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES

1. Leave-one-out.

 similar to the k-fold cross-validation

 every data point is used for testing once

 This is a time consuming methodology, but for small datasets,

sometimes it is a viable option.

2. Bootstrapping.

 With bootstrapping, a fixed number of instances from the

original data are sampled (with replacement) for training and the
rest of the dataset is used for testing.

 This process is repeated as many times as desired.

3. Jackknifing.

 Similar to the leave-one-out methodology;

 The accuracy is calculated by leaving one sample out at each

iteration of the estimation process.

4. Area under the ROC curve.

 The area under the ROC curve is a graphical assessment

technique

 true positive rate is plotted on the Y-axis and false positive rate
is plotted on the X-axis.

 The area under the ROC curve determines the accuracy

measure of a classifier: A value of 1 indicates a perfect
classifier whereas 0.5 indicates no better than random chance;

40 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

CLASSIFICATION TECHNIQUES

 Decision tree analysis.

 Statistical analysis.

 Neural networks.

 Case-based reasoning

 Bayesian classifiers

 Genetic algorithms

 Rough sets

2. Cluster Analysis

 Data mining method for classifying items, events, or concepts into

common groupings called clusters.

 The method is commonly used in biology, medicine, genetics,

social network analysis, anthropology, archaeology, astronomy,
character recognition, and even in management information
system development.

 Cluster analysis is an exploratory data analysis tool for solving

classification problems.

 The objective is to sort cases (e.g., people, things, events) into

groups, or clusters, so that the degree of association is strong
among members of the same cluster and weak among members
of different clusters.

 Each cluster describes the class to which its members belong.

Cluster analysis results may be used to:

 Identify a classification scheme (e.g., types of customers)

 Suggest statistical models to describe populations
 Indicate rules for assigning new cases to classes for identification,
targeting, and diagnostic purposes
 Provide measures of definition, size, and change in what were
previously broad concepts

41 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Find typical cases to label and represent classes

 Decrease the size and complexity of the problem space for other
data mining methods
 Identify outliers in a specific domain (e.g., rare-event detection)

DETERMINING THE OPTIMAL NUMBER OF CLUSTERS

 Clustering algorithms usually require one to specify the number of

clusters to find.

 If this number is not known from prior knowledge, it should be

chosen in some way

The following are among the most commonly referenced ones:

 Look at the percentage of variance explained as a function of the

number of clusters; that is, choose a number of clusters so that
adding another cluster would not give much better modeling of the
data.

 Set the number of clusters to (n/2)1/2, where n is the number of

data points.

 Use the Akaike Information Criterion, which is a measure of the

goodness of fit

 Use Bayesian Information Criterion, which is a model-selection

criterion to determine the number of clusters

ANALYSIS METHODS

 Statistical methods

 Neural networks

 Fuzzy logic

Each of these methods generally works with one of two general method
classes:

 Divisive.

All items start in one cluster and are broken apart

42 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Agglomerative

all items start in individual clusters, and the clusters are

joined together.

K-MEANS CLUSTERING ALGORITHM

 The k-means clustering algorithm (where k stands for the

predetermined number of clusters) is arguably the most referenced
clustering algorithm.

 It has its roots in traditional statistical analysis. As the name

implies, the algorithm assigns each data point (customer, event,
object, etc.) to the cluster whose center (also called centroid) is the
nearest.

 The center is calculated as the average of all the points in the

cluster; that is, its coordinates are the arithmetic mean for each
dimension separately over all the points in the cluster.

 Initialization step: Choose the number of clusters (i.e., the value

of k).

 Step 1: Randomly generate k random points as initial cluster

centers.

 Step 2: Assign each point to the nearest cluster center.

 Step 3: Recompute the new cluster centers.

 Repetition step: Repeat steps 2 and 3 until some convergence

criterion is met (usually that the assignment of points to clusters
becomes stable).

43 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Association Rule Mining

 Association rule mining is a popular data mining method

 Association rule mining aims to find interesting relationships

(affinities) between variables (items) in large databases.

 It is commonly called a market-basket analysis.

 The main idea in market basket analysis is to identify strong

relationships among different products (or services) that are
usually purchased together

 The outcome of the analysis is invaluable information that can be

used to better understand customer-purchase behavior in order to
maximize the profit from business transactions.

 A business can take advantage of such knowledge by

(1) putting the items next to each other to make it more convenient
for the customers to pick them

(2) promoting the items as a package (do not put one on sale if

the other(s) is on sale); and

(3) placing them apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially seeing
and buying other items

 Applications of market-basket analysis include cross- marketing,

cross-selling, store design, catalog design,e-commerce site
design, optimization of online advertising, product pricing, and
sales/promotion configuration.

 ―Are all association rules interesting and useful?‖

 In order to answer such a question, association rule mining uses

two common metrics: support and confidence

 Several algorithms are available for generating association rules.

Some well-known algorithms include

44 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Apriori,
 Eclat, and
 FP-Growth.
 These algorithms only do half the job, which is to identify the
frequent itemsets in the database.

 A frequent itemset is an arbitrary number of items that frequently

go together in a transaction

 Once the frequent itemsets are identified, they need to be

converted into rules with antecedent and consequent parts.

 APRIORI ALGORITHM

 The Apriori algorithm is the most commonly used algorithm to

discover association rules.

 Given a set of itemsets (e.g., sets of retail transactions, each

listing individual items purchased)

 The algorithm attempts to find subsets that are common to at

least a minimum number of the itemsets (i.e., complies with a
minimum support).

 Apriori uses a bottom-up approach, where frequent subsets are

extended one item at a time (a method known as candidate
generation, whereby the size of frequent subsets increases
from one-item subsets to two-item subsets, then three-item
subsets, etc.),

 Groups of candidates at each level are tested against the data

for minimum support. The algorithm terminates when no further
successful extensions are found.

45 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

TEXT MINING

Text mining (text data mining or knowledge discovery in textual

databases)

 It is the semi automated process of extracting patterns from large

amounts of unstructured data sources.

 Text mining is the same as data mining;

 But with text mining, the input to the process is a collection of

unstructured (or less structured) data files such as Word
documents, PDF files, text excerpts, XML files, and so on.

 Text mining has two main steps

1. Imposing structure to the text-based data sources

2. Extracting relevant information and knowledge from this

structured text-based data using data mining techniques and
tools

TEXT MINING CONCEPTS AND DEFINITIONS

Benefits of text mining

 Law (court orders)

 Academic research (research articles)

 Finance (quarterly reports)

46 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Medicine (discharge summaries)

 Biology (molecular interactions)

 Technology (patent files), and

 Marketing (customer comments)

Example

 Free-form text-based interactions with customers in the form of

complaints

 Electronic communications and e-mail.

 Used to classify and filter junk e-mail

 Used to automatically prioritize e-mail based on importance

level as well as to generate automatic responses

Application areas of text mining:

 Information extraction - Identification of key phrases and

relationships

 Topic tracking - Based on a user profile and documents that a

user views, text mining can predict other documents

 Summarization - To save time on the part of the reader.

 Categorization - Identifying the main themes of a document and

then placing them into a predefined set of categories based on
those themes.

 Clustering - Grouping similar documents without having a

predefined set of categories.

 Concept linking - Connects related documents by identifying their

shared concepts

 Question answering - Finding the best answer to a given

question through knowledge-driven pattern matching.

47 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

NATURAL LANGUAGE PROCESSING

 NLP is an important component of text mining

 NLP is a subfield of artificial intelligence and computational

linguistics.

 NLP studies the problem of ―understanding‖ the natural human

language

 NLP Converts the human language (such as textual documents)

into more formal representations (in the form of numeric and
symbolic data) that are easier for computer programs to
manipulate.

 The goal of NLP is to move beyond syntax-driven text

manipulation (which is often called ―word counting‖) to a true
understanding and processing of natural language

 natural human language is vague and that a true understanding of

meaning requires extensive knowledge of a topic

Challenges associated with the implementation of NLP

 Part-of-speech tagging - It is difficult to mark up terms in a text as

corresponding to a particular part of speech (such as nouns, verbs,
adjectives, and adverbs)

 Text segmentation - Some written languages, such as Chinese,

Japanese, and Thai, do not have single-word boundaries. In these
instances, the text-parsing task requires the identification of word
boundaries, which is often a difficult task.

 Word sense disambiguation - Many words have more than one

meaning. Selecting the meaning that makes the most sense can
only be accomplished by taking into account the context within
which the word is used.

 Syntactic ambiguity - The grammar for natural languages is

ambiguous; that is, multiple possible sentence structures often
need to be considered.

48 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 Imperfect or irregular input - Foreign or regional accents and

vocal impediments in speech and typographical or grammatical
errors in texts make the processing of the language an even more
difficult task.

 Speech acts - A sentence can often be considered an action by

the speaker.

The sentence structure alone may not contain enough

information to define this action.

For example, ―Can you pass the class?‖ requests a

simple yes/no answer, whereas

―Can you pass the salt?‖ is a request for a physical

action to be performed..

 WordNet is a laboriously hand-coded database of English words,

their definitions, sets of synonyms, and various semantic relations
between synonym sets.

 It is a major resource for NLP applications, but it has proven to be

very expensive to build and maintain manually

 An important area of CRM, where NLP is making a significant

impact, is sentiment analysis.

 Sentiment analysis is a technique used to detect favorable and

unfavorable opinions toward specific products and services using a
large numbers of textual data sources (customer feedback in the
form of Web postings).

 NLP has successfully been applied to a variety of tasks via

computer programs to automatically process natural human
language.

Following are among the most popular of these tasks:

1. Information retrieval.

The science of searching for relevant documents, finding

specific information within them, and generating metadata as to their
contents.
49 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Information extraction.

A type of information retrieval whose goal is to automatically

extract structured information, such as categorized and contextually and
semantically well-defined data from a certain domain, using unstructured
machine readable documents.

3. Named-entity recognition.

Also known as entity identification and entity extraction, this

subtask of information extraction seeks to locate and classify atomic
elements in text into predefined categories

4. Question answering.

The task of automatically answering a question posed in

natural language;

To find the answer to a question, the computer program may

use either a prestructured database or a collection of natural language
documents (a text corpus such as the World Wide Web).

5. Automatic summarization.

The creation of a shortened version of a textual document by

a computer program that contains the most important points of the
original document.

6. Natural language generation.

Systems convert information from computer databases into

readable human language.

7. Natural language understanding.

Systems convert samples of human language into more

formal representations that are easier for computer programs to
manipulate.

8. Machine translation.

The automatic translation of one human language to another

50 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

9. Foreign language reading.

A computer program that assists a nonnative language

speaker to read a foreign language with correct pronunciation and
accents on different parts of the words.

10. Foreign language writing.

A computer program that assists a nonnative language user

in writing in a foreign language.

11. Speech recognition.

Converts spoken words to machine-readable input.

12. Text-to-speech.

Also called speech synthesis, a computer program

automatically converts normal language text into human speech.

13. Text proofing.

A computer program reads a proof copy of a text in order to

detect and correct any errors

14. Optical character recognition.

The automatic translation of images of handwritten,

typewritten, or printed text (usually captured by a scanner) into machine
editable textual documents

TEXT MINING APPLICATIONS

1. Marketing Applications

 Text mining can be used to increase cross-selling and up-selling

by analyzing the unstructured data generated by call centers

 blogs, user reviews of products at independent Web sites, and

discussion board postings are a gold mine of customer sentiments

 Text Mining used to predict customer perceptions and subsequent

purchasing behavior

51 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. Security Applications

 ECHELON surveillance system – It is assumed to be capable of

identifying the content of telephone calls, faxes, e-mails, and other
types of data, intercepting information sent via satellites, public
switched telephone networks.

 EUROPOL developed an integrated system capable of accessing,

storing, and analyzing vast amounts of structured and unstructured
data sources in order to track transnational organized crime.

 The U.S. (FBI) and the (CIA), are jointly developed a

supercomputer data and text mining system. The system is
expected to create a gigantic data warehouse along with a variety
of data and text mining modules to meet the knowledge-discovery
needs of federal, state, and local law enforcement agencies.

 Text mining is in the area of deception detection

3. Biomedical Applications

 Experimental techniques such as DNA microarray analysis, serial

analysis of gene expression (SAGE), and mass spectrometry
proteomics, among others, are generating large amounts of data
related to genes and proteins.

 Knowing the location of a protein within a cell can help to

determine its potential as a drug target

4. Academic Applications

 Text Mining provides semantic cues to machines to answer

specific queries.

52 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

TEXT MINING PROCESS

Context diagram for the text mining process

As the context diagram indicates:

 The input into the text-based knowledge discovery process is the

unstructured as well as structured data collected, stored, and
made available to the process.

 The output of the process is the context-specific knowledge that

can be used for decision making.

 The controls, also called the constraints (inward connection to the

top edge of the box), of the process include

software and hardware limitations,

privacy issues, and the

53 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

difficulties related to processing of the text

 The mechanisms of the process include proper techniques,

software tools, and domain expertise.

 The text mining process can be broken down into three

consecutive tasks,

Task 1 : Establish the Corpus

Task 2: Create the Term–Document Matrix

Task 3: Extract Knowledge

 each of which has specific inputs to generate certain outputs.

The three steps Text Mining Processes

Task 1: Establish the corpus

 Collect all relevant unstructured data

(e.g., textual documents, XML files, emails, Web pages,

short notes, voice recordings…)

 Digitize, standardize the collection

(e.g., all in ASCII text files)

 Place the collection in a common place

(e.g., in a flat file, or in a directory as separate files)

54 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Task 2: Create the Term–by–Document Matrix

 The digitized and organized documents (the corpus) are used to

create the term–document matrix (TDM).

 In the TDM, rows represent the documents and columns represent

the terms.

 The relationships between the terms and documents are

characterized by indices (i.e., a relational measure that can be as
simple as the number of occurrences of the term in respective
documents).

nt g
e e rin
Terms k g em ine
t ris a na e ng e nt
en tm are pm
tm c elo
es je ftw v P
Documents inv pro so de SA ...
Document 1 1 1

Document 2 1

Document 3 3 1

Document 4 1

Document 5 2 1

Document 6 1 1
...

 Should all terms be included?

 Stop words, include words

 Synonyms, homonyms

 Stemming

 What is the best representation of the indices (values in

cells)?

 Row counts; binary frequencies; log frequencies;

 Inverse document frequency
55 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 TDM is a sparse matrix. How can we reduce the

dimensionality of the TDM?
 Manual – a domain expert goes through it
 Eliminate terms with very few occurrences in very
few documents (?)
 Transform the matrix using singular value
decomposition (SVD)
 SVD is similar to principle component analysis

Task 3: Extract patterns/knowledge

 Classification (text categorization)

 Clustering (natural groupings of text)
 Improve search recall
 Improve search precision
 Scatter/gather
 Query-specific clustering
 Association
 Trend Analysis (…)

TEXT MINING TOOLS

 Following are some of the popular text mining tools, which we

classify as

 Commercial software tools

 Free software tools

1. Commercial Software Tools

 The following are some of the most popular software tools used for
text mining.

 Note that many companies offer demonstration versions of their

products on their Web sites.

1. ClearForest offers text analysis and visualization tools

(clearforest.com).

56 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

2. IBM Intelligent Miner Data Mining Suite, now fully integrated

into IBM’s InfoSphere Warehouse software, includes data and
text mining tools (ibm.com).

3. Megaputer Text Analyst offers semantic analysis of free-form

text, summarization, clustering, navigation, and natural language
retrieval with search dynamic refocusing (megaputer.com).

4. SAS Text Miner provides a rich suite of text processing and

analysis tools (sas.com).

5. SPSS Text Mining for Clementine extracts key concepts,

sentiments, and relationships from call-center notes, blogs, e-
mails, and other unstructured data and converts them to a
structured format for predictive modeling (spss.com).

6. The Statistica Text Mining engine provides easy-to-use text

mining functionally with exceptional visualization capabilities
(statsoft.com).

7. VantagePoint provides a variety of interactive graphical views

and analysis tools with powerful capabilities to discover knowledge
from text databases(vpvp.com).

8. The WordStat analysis module from Provalis Research analyzes

textual information such as responses to open-ended questions
and interviews (provalisresearch.com).

2. Free Software Tools

Free software tools, some of which are open source, are available from
a number of nonprofit organizations:

1. GATE is a leading open source toolkit for text mining. It has a

free open source framework (or SDK) and graphical development
environment (gate.ac.uk).

2. RapidMiner has a community edition of its software that includes

text mining modules (rapid-i.com).

3. LingPipe is a suite of Java libraries for the linguistic analysis of

human language (alias-i.com/lingpipe).

57 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. S-EM (Spy-EM) is a text classification system that learns from

positive and unlabeled examples (cs.uic.edu/~liub/S-EM/S-EM-
download.html).

5. Vivisimo/Clusty is a Web search and text-clustering engine

(clusty.com).

WEB MINING OVERVIEW

 Web mining (or Web data mining) is the process of discovering

intrinsic relationships (i.e., interesting and useful information) from
Web data, which are expressed in the form of textual, linkage, or
usage information.

 The Web is perhaps the world’s largest data and text repository.

 The amount of information on the Web is growing rapidly every

day.

 A lot of interesting information can be found online:

 whose homepage is linked to which other pages

 how many people have links to a specific Web page

 how a particular site is organized

Web also poses great challenges for effective and efficient knowledge
discovery:

1. The Web is too big for effective data mining

2. The Web is too complex.

Web pages lack a unified structure They contain far more

authoring style and content variation

3. The Web is too dynamic.

Not only does the Web grow rapidly, but its content is constantly
being updated. Blogs, news stories, stock market results, weather
reports

58 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. The Web is not specific to a domain.

Web users have very different backgrounds, interests, and usage

purposes.

5. The Web has everything.

Only a small portion of the information on the Web is truly relevant

or useful to someone

Three main areas of Web mining:

 Web content mining

 Web structure mining

 Web usage mining.

1. WEB CONTENT MINING

 Web content mining refers to the extraction of useful information

from Web pages.

 The documents may be extracted in some machine-readable

format so that automated techniques can generate some
information about the Web pages.

 Web crawlers are used to read through the content of a Web site
automatically.

59 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 The information gathered may include document characteristics

similar to what is used in text mining, but it may include additional
concepts such as the document hierarchy

 Web content mining can also be used to enhance the results

produced by search engines

 In addition to text, Web pages also contain hyperlinks pointing one

page to another.

 Hyperlinks contain a significant amount of hidden human

annotation

 When a Web page developer includes a link pointing to another

Web page, this can be regarded as the developer’s endorsement
of the other page.

 Therefore, the vast amount of Web linkage information provides a

rich collection of information about the relevance, quality, and
structure of the Web’s contents.

 A search on the Web to obtain information on a specific topic

usually returns a few relevant, high-quality Web pages and a larger
number of unusable Web pages.

 Use of an index based on authoritative will improve the search

results and ranking of relevant pages

 The idea of authority stems from earlier information retrieval work

using citations among journal articles to evaluate the impact of
research papers

 There are significant differences between the citations in research

articles and hyperlinks on Web pages:

 not every hyperlink represents an endorsement

 one authority will rarely have its Web page point to rival
authorities in the same domain

 authoritative pages are seldom particularly descriptive

60 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 The structure of Web hyperlinks has led to another important

category of Web pages called a hub.

 A hub is one or more Web pages that provide a collection of

links to authoritative pages.

 Hub pages provide link to a collection of prominent sites on a

specific topic of interest.

 A hub could be a list of recommended links on an individual’s

homepage, recommended reference sites on a course Web page

Hyperlink-induced topic search (HITS)

 HITS is a link analysis algorithm that rates Web pages using the
hyperlink information contained within them.

 The HITS algorithm collects a base document set for a specific

query. It then recursively calculates the hub and authority values
for each document.

 To gather the base document set, a root set that matches the
query is fetched from a search engine.

2.WEB STRUCTURE MINING

 It is the process of extracting useful information from the links

embedded in Web documents.

 It is used to identify authoritative pages and hubs.

 Just as links going to a Web page may indicate a site’s popularity

(or authority), links within the Web page (or the compete Web site)
may indicate the depth of coverage of a specific topic.

 Analysis of links is very important in understanding the

interrelationships among large numbers of Web pages, leading to
a better understanding of a specific Web community

61 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

3.WEB USAGE MINING

 Web usage mining is the extraction of useful information from data

generated through Web page visits and transactions.

 Three types of data are generated through Web page visits:

1. Automatically generated data stored in server access logs,

referrer logs, agent logs, and client-side cookies

2. User profiles

3. Metadata, such as page attributes, content attributes, and

usage data

 Analysis of the information collected by Web servers can help us

better understand user behavior. Analysis of this data is often
called click stream analysis.

 By using the data and text mining techniques, a company might be

able to determine interesting patterns.

 Click stream Analysis:

 Useful in determining where to place online advertisements.

 Click stream analysis might also be useful for knowing when
visitors access a site.

Process of extracting knowledge from clickstream data

62 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Applications of Web mining:

1. Determine the lifetime value of clients.

2. Design cross-marketing strategies across products.

3. Evaluate promotional campaigns.

4. Target electronic ads and coupons at user groups based on

user access patterns.

5. Predict user behavior based on previously learned rules and

users’ profiles.

6. Present dynamic information to users based on their interests

and profiles

Web usage mining software

SPATIAL DATA MINING

Spatial data mining is the process of discovering interesting, useful, non-

trivial patterns from large spatial datasets

63 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

64 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

65 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

66 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

67 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

68 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

69 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

70 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

71 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

72 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

73 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

PROCESS MINING

 ―The idea of process mining is to discover, monitor and improve

real processes (i.e., not assumed processes) by extracting
knowledge from event logs readily available in today’s
(information) systems.
 Process mining includes (automated) process discovery (i.e.,
extracting process models from an event log), conformance
checking (i.e., monitoring deviations by comparing model and log),
social network/organizational mining, automated construction of
simulation models, model extension, model repair, case prediction,
and history-based recommendations.‖

Events and event logs:

 It is assumed that an event refers to a process activity or a task,

which is a well-defined step in the process and is related to a
particular case, i.e. process instance.

 Another assumption is that these events are ordered.

 The case or process instance is a specific occurrence or execution

of a business process, while activity is an operation, part of a case,
that is being executed.

 An event log stores information about cases and activities, but also
information about event performers, event timestamps (moment
when the event is triggered) or data elements recorded with the
event

 Process mining activities such as extracting and filtering data from

information systems are not trivial.

 Data may be distributed over a variety of sources, event data may

be incomplete, an event log may contain outliers, logs may contain
events at different level of granularity, etc.

 Process Mining Manifesto gives following guidelines referring to

the event data:

 events should be trustworthy,

74 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 event logs should be complete,

 any recorded event should have well-defined semantics

the event data should be safe

 Process mining types

Three process mining types:

discovery,

conformance and

enhancement.

1. Process discovery:

 A process discovery technique produces a process model from

an event log, without using any a-priori information about the
process and it is the most eminent process mining technique.

2. Conformance

 Conformance compares an existing process model with an

event log of the same process

 It is used to check if reality, as recorded in the log, conforms to

the model and vice versa.

 Conformance checking can be used to:

 check the quality of documented processes (asses

whether they describe reality accurately);

 to identify deviating cases and understand what they have

in common; for auditing purposes;

 to judge the quality of a discovered process model

3. Enhancement :

 Enhancement extends or improves an existing process model

using information about the actual process recorded in event
log, with the aim of changing or extending the a-priori model.

75 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 For instance, by using time stamps in the event log one can
extend the model to show bottlenecks, service levels,
throughput times and frequencies

Process mining software tools and techniques:

 Many contemporary process mining software tools were developed

and are continuously improved, such as: (Celonis Gmbh), Disco
(Fluxicon), EDS (StereoLOGIC Ltd), Fujitsu (Fujitsu Ltd) Icaro
(Icaro Tech), Icris (Icris), LANA (Lana Labs), Minit (Gradient ECM),
myInvenio (Cognitive Technology), ProcessGold (Processgold
International B.V.), ProM (Open Source, hosted at TU/e), ProM
Lite (Open Source hosted at TU/e), QPR (QPR), RapidProM
(Open Source hosted at TU/e), Rialto (Exeura), SNP (SNP
Schneider-Neureither & Partner AG), ARIS PPM ( Software AG).

 Currently, the most prominent, open-source tool is ProM (Process

Mining Framework) , as it offers a variety of plug-ins that enable
application of various algorithms and latest developments in
process mining research.

 Three main categories of process mining algorithms:

 Deterministic algorithms,

 Heuristic algorithms and

 Genetic algorithms.

 Deterministic algorithms always generate repeatable models, as

all of the data has to be known and process mining output is
constant for the given input of variables

1.2. BUSINESS INTELLIGENCE PROCESS

What is Business Intelligence?

BI(Business Intelligence) is a set of processes, architectures, and

technologies that convert raw data into meaningful information that
drives profitable business actions.

 It is a suite of software and services to transform data into

actionable intelligence and knowledge.
76 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

 BI has a direct impact on organization's strategic, tactical and

operational business decisions.

 BI supports fact-based decision making using historical data

rather than assumptions and gut feeling.

 BI tools perform data analysis and create reports, summaries,

dashboards, maps, graphs, and charts to provide users with
detailed intelligence about the nature of the business.

Why is BI important?

 Measurement: creating KPI (Key Performance Indicators) based

on historic data

 Identify and set benchmarks for varied processes.

 With BI systems organizations can identify market trends and

spot business problems that need to be addressed.

 BI helps on data visualization that enhances the data quality and

thereby the quality of decision making.

 BI systems can be used not just by enterprises but SME (Small

and Medium Enterprises)

How Business Intelligence systems are implemented?

Here are the steps:

Step 1:

Raw Data from corporate databases is extracted. The data could

be spread across multiple systems heterogeneous systems.

Step 2:

The data is cleaned and transformed into the data warehouse. The
table can be linked, and data cubes are formed.

Step 3:

Using BI system the user can ask quires, request ad-hoc reports
or conduct any other analysis

77 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Examples of Business Intelligence System used in Practice

 In an Online Transaction Processing (OLTP) system information

that could be fed into product database could be
 add a product line
 change a product price
 Correspondingly, in a Business Intelligence system query that
would be executed for the product subject area could be did the
addition of new product line or change in product price increase
revenues
 In an advertising database of OLTP system query that could be
executed
 Changed in advertisement options
 Increase radio budget

Four types of BI users

Following given are the four key players who are used Business
Intelligence System:

1. The Professional Data Analyst:

78 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

The data analyst is a statistician who always needs to drill deep

down into data. BI system helps them to get fresh insights to develop
unique business strategies.

2. The IT users:

The IT user also plays a dominant role in maintaining the BI

infrastructure.

3. The head of the company:

CEO or CXO can increase the profit of their business by improving

operational efficiency in their business.

4. The Business Users:

 Business intelligence users can be found from across the

organization. There are mainly two types of business users
Casual business intelligence user

 The power user.

Advantages of Business Intelligence

Here are some of the advantages of using Business Intelligence System:

1. Boost productivity

With a BI program, It is possible for businesses to create reports

with a single click thus saves lots of time and resources. It also allows
employees to be more productive on their tasks.

2. To improve visibility

BI also helps to improve the visibility of these processes and make

it possible to identify any areas which need attention.

3. Fix Accountability :

BI system assigns accountability in the organization as there must

be someone who should own accountability and ownership for the
organization's performance against its set goals.

79 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

4. It gives a bird's eye view:

BI system also helps organizations as decision makers get an

overall bird's eye view through typical BI features like dashboards and
scorecards.

5. It streamlines business processes:

BI takes out all complexity associated with business processes. It

also automates analytics by offering predictive analysis, computer
modeling, benchmarking and other methodologies.

6. It allows for easy analytics:

BI software has democratized its usage, allowing even

nontechnical or non-analysts users to collect and process data quickly.
This also allows putting the power of analytics from the hand's many
people.

BI System Disadvantages

1. Cost:

Business intelligence can prove costly for small as well as for medium-
sized enterprises. The use of such type of system may be expensive for
routine business transactions.

2. Complexity:

Another drawback of BI is its complexity in implementation of

datawarehouse. It can be so complex that it can make business
techniques rigid to deal with.

3. Limited use

Like all improved technologies, BI was first established keeping in

consideration the buying competence of rich firms. Therefore, BI system
is yet not affordable for many small and medium size companies.

4. Time Consuming Implementation

It takes almost one and half year for data warehousing system to be
completely implemented. Therefore, it is a time-consuming process.

80 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE

Trends in Business Intelligence

Artificial Intelligence:

Gartner' report indicates that AI and machine learning now take on

complex tasks done by human intelligence. This capability is being
leveraged to come up with real-time data analysis and dashboard
reporting.

Collaborative BI:

BI software combined with collaboration tools, including social

media, and other latest technologies enhance the working and sharing
by teams for collaborative decision making.

Embedded BI:

Embedded BI allows the integration of BI software or some of its

features into another business application for enhancing and extending
it's reporting functionality.

Cloud Analytics:

BI applications will be soon offered in the cloud, and more

businesses will be shifting to this technology. As per their predictions
within a couple of years, the spending on cloud-based analytics will grow
4.5 times faster.

Prepared by,

D.DURAI KUMAR,

Head Of the Department,

Department Of Information Technology,

GTEC.

81 BA5021 DATAMINING FOR BUSINESS INTELLIGENCE

Gokaraju Rangaraju Institute of Engineering and Technology
No ratings yet
Gokaraju Rangaraju Institute of Engineering and Technology
49 pages
DWDM Unit - 1-1
No ratings yet
DWDM Unit - 1-1
25 pages
Importance and Evolution of Data Mining
No ratings yet
Importance and Evolution of Data Mining
15 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Data Mining Essentials for MCA Students
No ratings yet
Data Mining Essentials for MCA Students
18 pages
Bi - Unit 3
No ratings yet
Bi - Unit 3
18 pages
Data Mining & Techniques Guide
No ratings yet
Data Mining & Techniques Guide
108 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
40 pages
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
No ratings yet
Data Mining and Data Warehousing: Gayathri Vidya Parishad College of Engineering Visakhapatnam
11 pages
Data Mining Basics
No ratings yet
Data Mining Basics
20 pages
DM Unit - 1
No ratings yet
DM Unit - 1
137 pages
Data Minng
No ratings yet
Data Minng
20 pages
Unit 1f
No ratings yet
Unit 1f
50 pages
Data Mining and Data Warehousing
100% (1)
Data Mining and Data Warehousing
12 pages
Data Mining and Data Warehouse Study Material - Edited
No ratings yet
Data Mining and Data Warehouse Study Material - Edited
7 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
29 pages
Data Mining Moodle Notes U1
No ratings yet
Data Mining Moodle Notes U1
11 pages
Data Mining in Insurance Analysis
No ratings yet
Data Mining in Insurance Analysis
11 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
12 pages
DWDM All Units
No ratings yet
DWDM All Units
102 pages
Data Mining: Knowledge Discovery Overview
No ratings yet
Data Mining: Knowledge Discovery Overview
6 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
DWDM Unit I
No ratings yet
DWDM Unit I
20 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Understanding DMDW Concepts
No ratings yet
Understanding DMDW Concepts
17 pages
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
No ratings yet
Data Mining and Data Warehouse: Qis College of Engineering & Technology Ongole
10 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
3 pages
Datamining With Big Data - Siva
No ratings yet
Datamining With Big Data - Siva
69 pages
Database Tech Evolution for Analysts
No ratings yet
Database Tech Evolution for Analysts
59 pages
Data Mining vs. Data Warehousing Explained
No ratings yet
Data Mining vs. Data Warehousing Explained
9 pages
Data Warehousing
No ratings yet
Data Warehousing
23 pages
Module 1
No ratings yet
Module 1
41 pages
Data Mining Ch1
No ratings yet
Data Mining Ch1
38 pages
Data Objects and Discretization in Mining
No ratings yet
Data Objects and Discretization in Mining
76 pages
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
No ratings yet
Data & Web Mining: Manoj Pandia, Silicon Institute of Technology
21 pages
Sheet 1 Solution1
No ratings yet
Sheet 1 Solution1
4 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Overview On Data Mining Schemes To Design Business Intelligence Framew Ork For Mobile Technology
No ratings yet
Overview On Data Mining Schemes To Design Business Intelligence Framew Ork For Mobile Technology
6 pages
Hu DM 2024
No ratings yet
Hu DM 2024
205 pages
Data Mining Exercises Overview
No ratings yet
Data Mining Exercises Overview
4 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
DM Material
No ratings yet
DM Material
98 pages
DWM Unit 4 Introduction To Data Mining
100% (2)
DWM Unit 4 Introduction To Data Mining
17 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
51 pages
Dwdm-Unit-1 R16
No ratings yet
Dwdm-Unit-1 R16
17 pages
Data Mining for Business Growth
No ratings yet
Data Mining for Business Growth
7 pages
Data Mining and Warehouse Insights
No ratings yet
Data Mining and Warehouse Insights
54 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
14 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
41 pages
Data Mining & KDD Overview
No ratings yet
Data Mining & KDD Overview
63 pages
Data Warehousing and Mining Overview
100% (1)
Data Warehousing and Mining Overview
685 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
Bi 1nov2017 One
No ratings yet
Bi 1nov2017 One
10 pages
Important Questions e Business
No ratings yet
Important Questions e Business
55 pages
A Unit 1 2 Sub Heading EB
No ratings yet
A Unit 1 2 Sub Heading EB
14 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
DMBI - For MBA - Unit V
No ratings yet
DMBI - For MBA - Unit V
52 pages
Unit 1 Classification & Prediction DM
No ratings yet
Unit 1 Classification & Prediction DM
71 pages
Unit 2 DATA WAREHOUSE AND DATA MART
No ratings yet
Unit 2 DATA WAREHOUSE AND DATA MART
17 pages
Cse3054 - Data-Mining - Concepts-And-Techniques - Eth - 1.0 - 66 - Cse3054 - 61 Acp
No ratings yet
Cse3054 - Data-Mining - Concepts-And-Techniques - Eth - 1.0 - 66 - Cse3054 - 61 Acp
2 pages
Practical Machine Learning Guidelines
No ratings yet
Practical Machine Learning Guidelines
5 pages
Chapter 1
No ratings yet
Chapter 1
1 page
Machine Learning Bits
100% (2)
Machine Learning Bits
28 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Overview of 7 Classification Algorithms
No ratings yet
Overview of 7 Classification Algorithms
21 pages
Bagging vs Pasting in Ensemble Learning
No ratings yet
Bagging vs Pasting in Ensemble Learning
28 pages
Research Proposal for Masters in China 攻读硕
No ratings yet
Research Proposal for Masters in China 攻读硕
7 pages
Applied Predictive Analytics Principles and Techniques For The Professional Data Analyst 1st Edition Dean Abbott Available All Format
100% (1)
Applied Predictive Analytics Principles and Techniques For The Professional Data Analyst 1st Edition Dean Abbott Available All Format
102 pages
E-Commerce &ERP Notes Part-4
No ratings yet
E-Commerce &ERP Notes Part-4
29 pages
Axioms of Cyber Physical Systems
No ratings yet
Axioms of Cyber Physical Systems
51 pages
DBSCAN Clustering Algorithm Based On Density
No ratings yet
DBSCAN Clustering Algorithm Based On Density
5 pages
KTU Deep Learning Course Overview
No ratings yet
KTU Deep Learning Course Overview
6 pages
Introduction To Intelligent Systems
No ratings yet
Introduction To Intelligent Systems
3 pages
Big Data Research: Shaokun Fan, Raymond Y.K. Lau, J. Leon Zhao
No ratings yet
Big Data Research: Shaokun Fan, Raymond Y.K. Lau, J. Leon Zhao
5 pages
Top Cited Articles - October 2024 - Top Cited Articles in Data Mining
No ratings yet
Top Cited Articles - October 2024 - Top Cited Articles in Data Mining
6 pages
Curse of Dimensionality and Its Reduction
No ratings yet
Curse of Dimensionality and Its Reduction
5 pages
Lecture 18. Backpropagation
No ratings yet
Lecture 18. Backpropagation
55 pages
Chapter5 DataWarehouse
No ratings yet
Chapter5 DataWarehouse
77 pages
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Digital Marketing Research Strategies
100% (1)
Digital Marketing Research Strategies
32 pages
Unit 3 - Asso Rule Mining
No ratings yet
Unit 3 - Asso Rule Mining
27 pages
AI Part B Question Bank 2
No ratings yet
AI Part B Question Bank 2
10 pages
Chapter 09 Reporting Proces
100% (1)
Chapter 09 Reporting Proces
30 pages
Priya Paper Final
No ratings yet
Priya Paper Final
9 pages
Data Analysis and Mining Guide
No ratings yet
Data Analysis and Mining Guide
10 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Authenticating Social Media Evidence
No ratings yet
Authenticating Social Media Evidence
11 pages
Unit Iii Analysis Design Concepts and Principles
No ratings yet
Unit Iii Analysis Design Concepts and Principles
48 pages
IEEE Cloud & Data Projects 2016
No ratings yet
IEEE Cloud & Data Projects 2016
4 pages