CHAPTER 2
DATA MINING AND ITS APPLICATION
2.1 Data Mining (DM)
“Knowledge shows the way to Power and Success”
The origin of data mining technology meets people’s necessities. DM sometimes also
called as Knowledge Discovery from the Database (KDD). A terrific amount of data and
information is being collected with the help of computing devices and latest technologies.
Now data is everywhere: from business transactions, government, healthcare, websites
and scientific data etc. Just retrieval is not enough for decision-making, so the DM come
into picture for summarization of data for valuable information i.e., Knowledge discovery
and the discovery of patterns in raw data [9].
In the beginning, we started storing all data. Unfortunately, these gigantic collections of
data accumulated on dissimilar data structures very rapidly became devastating. DM can
extract implicit but potentially useful information and knowledge, which people do not
know in advance, from a lot of noisy, incomplete, random and fuzzy data in practical
application. The DM is happening field and powerful means to extract useful knowledge
from massive amounts of data to bridge the gap between knowledge and data.
Another definition of DM is the investigation and analysis of huge quantities of data in
order to discover legitimate, narrative, potentially useful, and eventually understandable
patterns in data. Process of analyzing through intelligent algorithms from large databases
to find patterns that are:
Valid: The true patterns that holds in common.
Novel: the pattern we do not know beforehand.
Valuable: From the patterns we can invent actions.
Understandable: We can deduce and figure out the patterns.
1
DM and KDD is a new interdisciplinary field, merging ideas from statistics, machine
learning databases and parallel computing.
Researchers have defined the term ‘data mining’ in many ways.
Few definitions of DM or KDD, which are available in literature, are given below.
2.2 KDD (Knowledge Data Discovery)
KDD process is a type of data mining methodology which used to extract hidden
knowledge from a large database, by implementing pre-processing step and data
transformation step.
Identification of Goal Definition of Problem Application Goal Known
Prior
Target of Data Set Data Set Selection Data set Creation
Data Pre-Processing Removing Noisy Data Handling Missing Data
Data Transformation Find Useful Feature Find Weighted Value
Data Mining Choosing DM Fun. Search for Presentation
Presentation Visualization Replace Redundant Pattern
Figure 2.1: KDD Process
This research will predict diabetes by using the Knowledge Discovery in Database
(KDD) methodology. KDD is the process of extracting knowledge from large database
2
and emphasize “high-level" application of particular data mining methods. KDD process
consists of nine step, the steps are iterative and interactive in nature 9. Note that the
process is iterative at each step, meaning that one might have to move back to previous
step. The process starts with determining the KDD goals, and ends with the
implementation of the discover knowledge.
KDD Steps:
1. Developing an understanding of
The appropriate prior knowledge
The Aim of the end-user
2. Creating a target data set or selecting a data set, on which detection is to be
accomplish.
3. Data cleaning and pre-processing.
Removal of noise in dataset.
Plan of action for handling missing data.
4. Data reduction
Finding useful features to represent the data depending on the aim of the
task.
Use of dimensionality reduction methods to reduce the decrease number of
variables for the representations for the data.
5. Choosing the data mining task.
Choose the Aim of the KDD process is classification, regression,
clustering or any other.
6. Choosing the data algorithms.
Selecting methods to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
7. Data mining.
A set of such representations as classification rules or trees, regression,
clustering.
3
8. Define mined patterns.
9. Combine founded knowledge.
2.3 Data mining process
Data mining is the process of extracting hidden, previously unknown patterns from huge
database or data warehouse. Data mining is also known as knowledge discovery from
data (KDD). Data mining play important role in the various area like banking, education,
health care, medical etc. Many organizations use data mining technique to analyses large
dataset, to support decision making process and to get better result for their long-term
need.
Data Data Data Data
Data
Processing Trans-formation Mining Evaluation
Selection
Figure 2.2: Data Mining Process Steps
Health organization use data mining technique in order to identify hidden patterns from
disease, drugs dataset and used for prediction and detection of different disease and also it
supports decision making process in clinical diagnosis. Different data mining technique is
used prediction and detection of different disease, some of the technique is listed below.
[24]
4
2.3.1 Data Mining Techniques
Classification is the process of finding a model which describes and distinguishes data
classes or concepts based on a class label. There are different classification algorithms
some of this are Artificial Neural Network (ANN), Decision tree, Bayesian network,
naïve bays etc.
Clustering is the process of analysing data objects without consulting a class label. It is
process of grouping new class based on maximizing the intra class similarity and
minimizing the interclass similarity. There are different clustering algorithms some of
this are K nearest neighbour and k mean clustering.
Association rule learning is machine learning method which used for finding frequent
patterns. Some of the association algorithm is Apriori algorithm, Eclat algorithm and FP
growth algorithm.
2.3.2 Applications of Data mining
A Traffic Prediction
P
P
Videos Surveillance
L
I
C Search Engine Result Refining
A
T
I Online Fraud Detection
O
N
N Product Recommendations
O
N
Figure 2.3: Area where DM Used
5
Traffic Predictions: Google uses the DM algorithm n the traffic prediction we all used
the GPS navigation system because of this navigation system the data is saved is a central
database and update the location of a vehicle. The underlying problem is that there are a
minimum number of cars that are equipped with GPS. Machine learning in such scenarios
helps to estimate the regions where congestion can be found on the basis of daily
experiences. [7]
Videos Surveillance: Imagine a single person monitoring multiple video cameras, a
difficult job to do and boring as well. This is why the idea of training computers to do this
job makes sense.
The video surveillance device nowadays is powered by way of AI that makes it viable to
hit upon crime earlier than they happen. They song uncommon behaviour of people like
status immobile for a long term, stumbling or snoozing on benches.
Search Engine Result Refining: Google and other search engines use DM to improve
the search results for you. Every time you execute a search, the algorithms at the backend
keep a watch at how you respond to the results. If you open the top results and stay on the
web page for long, the search engine assumes that the results it displayed were in
accordance to the query. Similarly, if you reach the second or third page of the search
results but do not open any of the results, the search engine estimates that the results
served did not match requirement. This way, the algorithms working at the backend
improve the search results.[7]
Online Fraud Detection: DM is proving its potential to make cyberspace a secure place
and tracking monetary frauds online is one of its examples. For example: PayPal is using
ML for protection against money laundering.
Product Recommendations: DM algorithm is used in product recommendations User
got the same product on his social media account that he saw on a e-commerce website.
Future Healthcare: Data mining improve health systems. It uses data and analytics to
verify best practices that improve supervision and reduce costs. Researchers use data
mining algorithms like multi-dimension
6
l databases, machine learning, soft computing, data visualization and statistics. Mining
can be useful to predict the volume of patients in every class. Methods are developed that
make sure that the patients get appropriate supervision at the right place and at right time.
Market Basket Analysis: Market basket analysis is a modelling algorithm based on
theory that if you buy a certain group of items, you are more likely to buy another group
of items. This method may allow the shopkeeper to know the purchase behaviour of a
purchaser. This information can help the shopkeeper to understand the purchaser’s
requirements and change the shop’s layout accordingly.
Education: There is new emerging field, known as Educational Data Mining, concerns
with developing techniques that discover knowledge from data obtained from the
educational Environments. The objectives of EDM are identified as predicting the
students’ future studying behaviour, understanding the effects of educational help, and
improving scientific knowledge about learning. Data mining can be used by an institution
to take correct decisions and also for predicting the Progress Report of the student. With
the results the institution can focus on how to teach and what to teach.[7]
CRM: Customer Relationship Management, it is about acquiring and retaining
customers, also advancing customers’ loyalty and developing customer focused
strategies. To maintain a proper relationship with the customer.
Product Recommendations DM algorithm are used in product recommendations User
got the same product on his social media to account that he saw on an e-commerce
website.
2.3.3 Data Mining Challenges:
Developing a Unifying Theory of Data Mining.
Scaling Up for High Dimensional Data/High Speed Streams.
Mining Sequence Data and Time Series Data.
7
2.4 Introduction to Machine Learning
Machine learning works on a very simple concept understanding with experiences.
Machine learning is the process that comes from humans and animals teaches computer
that learning from the experience. Machine learning contains algorithms that learn from
past data and predicts the future data. In machine learning we train computer by
algorithm on some data and predicted the future results. The algorithms adaptively
improve their performance as the number of samples available for learning increases.
2.4.1 Types of Techniques of Machine Learning
Supervised ML
Unsupervised ML
Semi supervised ML
Reinforcement ML
Machine Learning Multitasking Learning
Ensemble Learning
Neural Network
Instance Based Learning
Figure 2.4: Types of Machine Learning
8
Supervised Learning: In supervised learning mechanism we have to educate the model
with some prior knowledge so that they can behave like intelligent program. Here we
have to give training as well as we can use this program for further use.
Unsupervised Learning: In unsupervised learning mechanism we have to educate the
model without any prior knowledge means this is typical to make a program behaves
intelligently.
Reinforcement Learning: In this learning all programs learn their steps on the basis of
their experiences. This comes in between supervised & unsupervised. Here a terms agent
comes in picture which has very important work. Here agent will take action or learn
decisions on the basis of prior working.
Multitasking Learning: Multitask Learning (MTL) is an initial changing tool whose
main motto to enhance generalization conduct. MTL improves the above mechanism by
averaging the domain related advice containing in the training indicator of related works.
Decision Tree Model: A decision tree model is one of the most common data mining
models. It is popular because the resulting model is easy to understand. The algorithms
use a recursive partitioning approach. Decision tree is a type of supervised learning
algorithm that is mostly used in classification problems.
Types of decision tree is based on the type of target variable; it can be of two types:
Categorical Decision
Decision Tree
Continuous Decision
Figure 2.5: Types of Decision Tree
Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree.
9
Example: In above scenario of student problem, where the target variable was “It will
rain today” YES or NO.
Continuous Variable Decision Tree: Decision Tree has continuous target variable then
it is called as Continuous Variable Decision Tree. Example: - Salary of a person.
Support Vector Machine Model: A Support Vector Machine (SVM) searches for so
called support vectors which are data points that are found to lie at the edge of an area in
space which is a boundary from one class of points to another. In the terminology of
SVM we talk about the space between regions containing data points in different classes
as being the margin between those classes. The support vectors are used to identify a
hyperplane (when we are talking about many dimensions in the data, or a line if we were
talking about only two-dimensional data) that separates the classes.[6]
Y-Axis
X-Axis
Figure 2.6: Model of Support Vector Machine
Artificial neural network
Artificial neural network is one of prediction algorithm which use learning rate and
momentum to classify data accurately. ANN predict the output by adjusting weight. It
consists of three layers
10
OUTPUT
INPUT
HIDDEN LAYER
LAYER
LAYER
Figure 2.7: Layers of Artificial Neural Network
Back propagation algorithm is type of Artificial neural network algorithm by which each
neuron is learned by adjusting the weighted associated with it in order to correct or
reduce the error. It is supervised learning algorithm which used gradient descent
optimization algorithm in order to adjust the weight on the neurons by computing the
gradient of loss function. [6]
Advantage of Artificial neural network
This study chooses ANN algorithm because of the following advantages some of them
are:
1) Ability to classify nonlinear data and Complex relationship.
2) It has high ability tolerance to Noisy data and missing value.
3) It has ability to classify untrained data.
Clustering: Clustering is the process of grouping the physical and abstract objects into
classes of the similar objects. Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters. It is an unsupervised
learning method there are no predefined classes. Clustering technique will generate high
quality clusters that intra-class similarity is high and inter-class similarity is low. The
characteristic of a clustering result also relies upon both the similarity measure used by
the technique and its implementation. The aspect of a clustering technique is measured
by its performance to find some or all of the unseen patterns.
11
Boosting: Boosting is very important classification method in the recent development. It
works by applying a classification algorithm sequentially to reweighted version of
training dataset, then choosing the weighted majority vote of sequence of classifiers
produced this simple algorithm results in dramatic improvement in performance for many
classification algorithms. This seems that phenomenon can be understood in terms of
statistical principles, namely additive modelling on logistic scale which uses Bernoulli
criterion as much as it can.
Association Rule Mining: Association rules analysis is a technique to uncover how
items are associated to each other. Association rule mining „ Finding frequent patterns,
associations, correlations, or causal structures among sets of items in transaction
databases. What customer buying in his basket by finding associations and correlations
between the different items that customers place in their baskets. „
Applications of association rule mining
1) Basket data analysis.
2) Cross-marketing.
3) Catalog design.
4) Loss-leader analysis.
2.5 Importance of Boosting Method
Boosting is Machine learning Meta algorithm for reducing bias and variance in
supervised learning and machine learning which converts weak learner to strong learner.
A question is posed by Kearn and Valiant “Can a group of weak learners make a strong
learner? “Here a weak learner is defined as classifier i.e., slightly correlated with the
right classification (it can provide example which are better than random guessing) on
contrary. a strong learner is a classifier which is arbitrarily well correlated with the right
classification.
12
2.6 Types of Classification Algorithms
Naïve Bayes
Support Vector Machine
Logistic Regression
Decision Tree
Random Forest
Classification Algorithms
K-Mean
Neural Network
Fuzzy k-NN
Genetic Algorithm
Figure 2.8: Types of Classification Algorithms
13