0% found this document useful (0 votes)
22 views6 pages

A Detailed Study On Machine Learning Techniques For Data Mining

This paper provides a comprehensive overview of machine learning techniques for data mining, emphasizing the importance of Knowledge Discovery in Databases (KDD). It discusses various approaches such as classification, clustering, and regression, detailing their processes, advantages, and disadvantages. The document also compares different data mining algorithms, highlighting their effectiveness and limitations in extracting valuable insights from large datasets.

Uploaded by

santhiya191288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

A Detailed Study On Machine Learning Techniques For Data Mining

This paper provides a comprehensive overview of machine learning techniques for data mining, emphasizing the importance of Knowledge Discovery in Databases (KDD). It discusses various approaches such as classification, clustering, and regression, detailing their processes, advantages, and disadvantages. The document also compares different data mining algorithms, highlighting their effectiveness and limitations in extracting valuable insights from large datasets.

Uploaded by

santhiya191288
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Conference on Trends in Electronics and Informatics

ICEI 2017

A DETAILED STUDY ON MACHINE


LEARNING TECHNIQUES FOR DATA
MINING
Sivaramakrishnan R Guruvayur Dr. Suchithra R
Research Scholar, Jain University Head, Department of MSc(IT) Bangalore, India
Bangalore, India [email protected]
[email protected]

ABSTRACT: Data mining is the way of extracting the database researchers, and the MIS and Business people
useful information, patterns from large volume of group began utilizing the term "data mining" at first to
information by using various techniques. It is a powerful extract valuable data from big data. One of the
technology with great potential to help businesses to procedures including data mining is known as
make full use of the available data for competitive Knowledge Discovery in Databases (KDD), which is
advantages. This paper discusses various machine
utilized for finding helpful knowledge from data. KDD
includes data preparation, determination, cleaning and
learning techniques and the detailed processes of
legitimate understanding of consequences of the data
Knowledge Discovery in Databases (KDD).This study mining procedure to guarantee helpful data is gathered
also focus on various DM/ML approaches such as from the data. Data mining contrasts from customary
Classification, Clustering and Regression and discuss data investigation and statistical methodologies in that it
different types of each approach with its advantages and utilizes logical systems from a few controls, for e.g.
disadvantages. numerical investigation, pattern matching and areas of
artificial intelligence, for example, machine learning, and
Keywords: Data mining, Machine Learning, Bayesian neural systems and genetic algorithms [2],[3].Data
Network, Decision tree Induction, Support Vector Machine. mining, or knowledge discovery in databases, is the
process of extracting knowledge from large databases.
In Data mining there are three types used to group
I. INTRODUCTION objects into identified classes such as classification,
Data mining and machine intelligence are currently a regression and clustering [12] which is shown in Fig1.
hot debated research area and are connected in database,
artificial intelligence, and statistics and so on to find Classification is used to separate the information into
classes. A characterization of the classes can then be
important information and the patterns in big data utilized to make expectations for new unclassified
accessible to clients. Data mining is mainly about information. Classes can be generally binary partition, or
training unstructured information and extracting can be difficult and multi-valued. There are two phases
important data from them for end clients to help business included in Classification. The first is learning process
choices. Data mining methods utilize scientific phase in which the analysis of training data is done then
calculations and machine intelligence strategies. The the creation of rules and patterns. The second phase used
prominence of such strategies in dissecting business to tests the data and archives the accuracy of
issues has been upgraded by the arriving of huge classification patterns [3].
information [1].
Data mining has turned out to be a standout amongst
the most imperative tools for separating and handling
data and to establish patterns to create helpful data for
decision making. Of late, there has been a considerable
measure of breakthrough in data gathering innovation,
for example, standardized bar-code scanners in business
spaces and sensors in logical and modern parts, which
has prompted the era of enormous measures of data. This
exceptional development in big data and databases has
delivered a huge interest for new methods and tools that
can change big data into helpful data. Statisticians,

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1187


International Conference on Trends in Electronics and Informatics
ICEI 2017

Use the trained model to group the unknown


information

DataMining II. TYPES OF MACHINE LEARNING


ALGORITHMS
Machine learning algorithms can be arranged in the
following way:
Classification Regression Clustering
Supervised Learning: These sorts of
algorithms are the ones that are trained on
illustrations called labeled cases where the
inputs are furnished with the desired result
already known.
Fig1.Types of Data mining Approaches
Unsupervised Learning: Unsupervised
Clustering: This approach comes under machine learning is the machine learning task
unsupervised learning because there are no of inducing a function to depict concealed
predefined classes. The data may be grouped structure from "unlabeled" data. Since the
together as a cluster in this concept [4]. unlabeled examples are given to the learner,
Regression: This is used to map data item into there is no assessment of the correctness of the
a really valuable prediction variable. structure that is resulted by the applicable
Regression analysis can be utilized to show the algorithm—which is considered as one way of
connection between one or more free factors recognizing unsupervised learning from other
and dependent factors. two learning method.
There is a significant overlap between Machine The conventional technique for transforming
Language and Data Mining. These two terms are information into knowledge depended on
always confused because they regularly utilize manual examination and interpretation by a
domain expert so as to discover valuable
similar strategies and hence overlap essentially. The patterns in information for decision support.
pioneer of ML, Arthur Samuel, characterized ML as
In Fig[2] the overall process of KDD is shown.
a "field of study that gives computers the ability to
learn without being explicitly programmed."
Machine Learning concentrates on prediction and
Classification, in view of known properties already Raw
learned from the training information. Machine
Learning calculations require an objective from the
area (e.g., subordinate variable to predict). Data
Mining concentrates on the revelation of known Data Selection
properties in the data. It needn't bother with a
particular objective from the domain, yet
concentrates on finding new and interesting Preprocessing
knowledge.

A ML approach generally comprises of two stages: Transformation


Training and testing. Regularly, the accompanying
steps are performed:

Identify class attributes (elements) and classes Data


from Training data. Mining

Identify a subset of the attributes essential for


classification.
Evaluation/Interpret
Learn the model utilizing training data. ation

Knowledg

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1188


International Conference on Trends in Electronics and Informatics
ICEI 2017

Fig2.Overall process of KDD required for appropriate separation. The algorithm then
KDD process consists of the following steps[5]: encodes these attributes into a model named a
classifier.
Understanding the application area:
incorporates pertinent prior knowledge and
objectives of the application. s There are many classification methods available in data
mining and the common techniques are as follows:
Extracting the target data set: incorporates
choosing data set or concentrating on a subset
of factors. (a) Decision tree induction: Decision tree can be built
from the class labeled tuples. It is like a tree like
Data cleaning and preprocessing: incorporates
fundamental operations, for example, noise structure in which there are interior node, branch and
removal and handling of missing information. leaf node. Interior node determines the test on trait,
Data from real-world sources are regularly branch indicates the result of the test and leaf node
erroneous, inadequate, and conflicting, maybe speaks about the class label. Two stages that are
because of operation error or framework learning and testing are straightforward and quick. The
execution defects. Such low quality
information should be cleaned before data fundamental objective is to anticipate the result for
mining. continuous attribute be that as it may; decision tree is
less fitting for assessing tasks. There might be
Data integration: incorporates coordinating
various, heterogeneous data sources. mistakes in predicting the classes by utilizing decision
Data reduction and projection: incorporates tree approach. Pruning calculations are costly and
finding helpful elements to represent the data building decision tree is additionally a costly errand as
and utilizing dimensionality lessening or at each level there is division of node. There are
change strategies. 6) Choosing the function of several data mining algorithms such as C4.5, ID3,
data mining: incorporates choosing the
motivation behind the model determined by CART, J48, NB Tree, REP Tree etc.
the data mining algorithm
(b) Bayesian Network (BN) is a graphical model for
Choosing the data mining algorithm(s):
incorporates choosing method(s) to be utilized connections among an arrangement of different
for searching patterns in data, for example, variable components. This graphical model structure S
settling on which model and parameters might is a coordinated acyclic graph (DAG) and every one of
be suitable. the nodes in S are in coordinated correspondence with
Data mining: incorporates scanning for the components of an data set. The arcs represent the
patterns of intrigue in a specific effects among the components while the lack of needed
representational form or an arrangement of
such indications. arcs in S encodes restrictive independence. Bayesian
classifier has shown high exactness and speed when
Interpretation: It incorporates interpretation of
the discovered data and additionally the connected to vast databases [6] [7] Bayesian systems
possible perception of the extracted designs are utilized for displaying information Bioinformatics,
(patterns). engineering, medicines, Bio monitoring.
Using discovered knowledge: It incorporates
consolidating this knowledge into the (c) Support vector machine (SVM) is a classification
execution framework, taking actions with training algorithm. It prepares the classifier to
respect to knowledge. predicate the class of the new sample. SVM is based
Data mining involves model to discover patterns on the machine learning algorithm designed by Vapnik
which consists of various components. in 1960's. It is additionally in view of the structure
chance minimization rule to anticipate over fitting.
A. Classification This is a classifier in view of finding an isolating hyper
Classification is a supervised sort of machine learning plane in the component space between two classes in
in which there is arrangement of labeled information in such a way that the separation between the hyper plane
advance. The classifier-training algorithm utilizes and the nearest data purposes of each class is
these pre-grouped cases to decide the set of parameters augmented. The approach depends on a limited
characterization chance [8] rather than on ideal
classification.

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1189


International Conference on Trends in Electronics and Informatics
ICEI 2017

B. Clustering C. Regression
Clustering is a data mining procedure of grouping set There are two sorts of regression techniques, such as
of data items into various clusters or groups so that linear and non –linear [11].
objects inside the bunch have high similarity, however
are extremely dissimilar in alternate groups. Clustering (a) Linear regression: Linear regression is utilized
algorithms are used to organize data, categorize data, where the connection amongst target and predicator can
for data compression and model construction, for be represented in straight line. The advantage of using
detection of outliers etc. linear regression is it is easy to understand the
hypothetical function.
Common Data clustering techniques are discussed.
�= ��1 x + P2 + e
(a) The k-means algorithm [9] is the most famous
grouping way utilized these days in logical and (b) Multivariate linear regression: The regression line
mechanical applications. The name originates from can't be envisioned in two dimensional space.
representing to each of the k clusters Cj by the mean (or
�= ��1 + ��2 �
1 + ��3 �2 + ⋯ + �����− 1 +
weighted normal) cj of its focuses, the so-called
centroid. While this portrayal does not work well with (c) Non-Linear Regression: For this situation non-linear
all attributes, it works well from a geometrical and relationship can be there and this can't be mentioned to
statistical point of view for numerical qualities. The as straight line. This can be represented to as linear
total of distance between components of a collection of response by preprocessed the information.
points and its centroid communicated through a proper
distance capacity is utilized as the target function.
III. COMPARISON OF DIFFERENT DATA
(b) Hierarchical clustering consolidates data objects MINING TECHNIQUES
into clusters, those clusters into larger groups, etc,
making a hierarchy. A tree which represents the TABLE I. COMPARISON OF DIFFERENT DATA MINING
ALGORITHMS
command of groups is known as a Dendrogram.
Algorithm Findings Drawbacks
Singular data objects are the leaves of the tree, and the Decision Tree It can deal with It won’t be able to
inside nodes are nonempty clusters. Sibling nodes both consistent and predict the value of a
divide the points secured by their common parent. This discrete continuous class
information. attribute
permits investigating information at various levels of It gives quick It gives error
granularity. Hierarchical clustering is strategies are result in contained message
classifying when large number
ordered into agglomerative (bottom-up) and divisive unknown records. of classes used
(top-down) methodologies. It gives great Unrelated attribute
comes about with may leads to bad
small measure tree. manner decision
There are a various methodologies available for data Results does not trees
clustering. In connectivity models (e.g., hierarchical influence with Even small changes
anomalies. made to the data can
clustering), information focuses are assembled by the It doesn't require modify complete
distance between them. In centroid models (e.g., k-means), preparation decision tree.
each cluster is mentioned by its mean vector. In technique like
normalization.
distribution models (e.g., Expectation Maximization It functions better
algorithm), the gatherings are thought to be submissive to with numeric
information
a factual conveyance. Density models group the data Naïve Bayesian Compared to other Provides less
points as dense furthermore, associated areas (e.g., classifier it gives accuracy since it
less error rate. concentrates more
Density-Based Spatial Clustering of Applications with Easy to adapt on independent
Noise [DBSCAN]). Finally, graph models (e.g., clique) It can handle features.
characterize each group as an arrangement of associated continues data in a
good manner
nodes (information focuses) where every node has an edge When work with
to at least one other node in the set[10]. large database it
gives high

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1190


International Conference on Trends in Electronics and Informatics
ICEI 2017

accuracy and categorical domains and domains with mixed numeric


speed
It can handle with and categorical values. Authors have used the well-
discrete values known soybean disease and credit approval data sets
Neural Networks Used to classify Less interpretability
the pattern on It takes long training
for demonstrating the clustering performance of the
untrained data time. two algorithms. In [5] the paper gives an overview of
Works well with data mining and knowledge discovery database field,
continuous values
K-Means Simple and Does not work with clearing up how data mining and knowledge discovery
efficient algorithm noisy data and non- in databases are connected both to each other and to
It is relatively fast. linear datasets.
It gives better related fields, for example, machine learning,
result when statistics, and databases. The work concentrates on
distinct data is
used
specific real applications, particular data mining
Support vector It can produce lack of transparency methods, challenges required in certifiable uses of
machine accurate and error of results knowledge discovery, and present and future research
free classification Scale dependent ,
results even when iterative headings in the field.
input data are non- It works on slow
monotone and training , nonlinear
non-linearly
separable. V. CONCLUSION
SVMs provide a This paper gives a survey on Machine learning
good out-of-
sample techniques for data mining. Throughout the years
generalization, if data mining has delighted in enormous achievement,
the parameters are
appropriately the application domains extended persistently yet the
chosen. mining methods additionally kept up moving
It can produce a
unique solution,
forward. Various issues have developed and solution
since the have found by data mining scientists. In any case,
optimality problem there are ranges and issues that still require
is convex.
Hierarchical Embedded Less interpretability consideration for future upgrades in this
Clustering adaptability in with respect to innovation. More research on the most proficient
regards to the level cluster descriptors.
of granularity. x Incorrect
method to manage the social issue of in some cases,
Well suited for termination criterion unconscious and unsuspecting people's security
issues including Inability to make
require to be conducted. Data mining procedures
point linkages, e.g. corrections once the
scientific splitting/merging should accordingly develop to coordinate with this
classification trees decision is made. challenge.

REFERENCE
IV. RELATED WORK
[1] U Fayyad, G Piatetsky-Shapiro, P Smyth, “From Data
In [10] the authors have done a literature survey on Mining to Knowledge Discovery in Databases,” AI
machine learning and data mining methods for cyber Magazine, vol.17, no.3, pp. 37-54, 1996.
analytics in support intrusion detection. They have
[2]. P. R. Peacock, “Data mining in marketing: Part 1”,
discussed about various ML/DM methods. Described
Marketing Management, pp. 9-18, 1998.
about well -known cyber data sets and the complexity
of ML/DM. Various challenges for using ML/DM is [3] Balagatabi, Z. N., & Balagatabi, H. N. (2013).
addressed. In [8] authors have discussed about various Comparison of Decision Tree and SVM Methods in
Classification of Researcher's Cognitive Styles in Academic
classification techniques and have done a comparative Environment. Indian Journal of Automation and Artificial
analysis of different classification algorithms. The Intelligence, 1(1), 31- 43.
various classification techniques are decision tree,
Support vector Machine, Nearest Neighbor etc. In [9] [4] A Survey of Clustering Data Mining Techniques P.
Berkhin.
this paper for extending the K-means algorithm the
authors have presented two algorithms with

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1191


International Conference on Trends in Electronics and Informatics
ICEI 2017

[5] U. Fayyad, G. P. Shapiro, and P. Smyth, “The KDD


process for extracting useful knowledge from volumes of
data,” Commun. ACM, vol. 39, pp. 27–34, 1996.

[6]Friedman, N., Geiger, D., & Goldszmidt, M. (1997).


Bayesian network classifiers. Machine learning, 29(2-3), 131-
163.

[7]. Jensen, F. V. (1996). An introduction to Bayesian


networks (Vol. 210). London: UCL press.

[8] Machine Learning Techniques for Data Mining: A Survey,


Seema Sharma1 , Jitendra Agrawal2 , Shikha Agarwal3 ,
Sanjeev Sharma.

[9] Huang Z (1998) Extensions to the k-Means Algorithm for


Clustering Large Data Sets with Categorical Values. Acsys
CRC, CSIRO

[10] A Survey of Data Mining and Machine Learning


Methods for Cyber Security Intrusion Detection Anna L.
Buczak, Member, IEEE, and Erhan Guven, Member, IEEE.

[11] Data Mining - Techniques, Methods and Algorithms: A


Review on Tools and their Validity, Mansi Gera and Shivani
Goel , International Journal of Computer Applications (0975
– 8887) Volume 113 – No. 18, March 2015

[12] J. Han and M. Kamber, “Data mining: concepts and


techniques”, Morgan-Kaufmann Academic Press, San
Francisco, 2001.

[13] Decision Tree Induction: An Approach for Data


Classification Using AVL-Tree ,Devi prasad
Bhukya and S. Ramachandram, International Journal of
Computer and Electrical Engineering, Vol. 2, No. 4,
August, 2010 1793-8163.

978-1-5090-4257-9/17/$31.00 ©2017 IEEE 1192

You might also like