0% found this document useful (0 votes)
23 views6 pages

A Genetic Algorithm For Classification

Uploaded by

fran.nanclares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

A Genetic Algorithm For Classification

Uploaded by

fran.nanclares
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/262282896

A genetic algorithm for classification

Conference Paper · May 2011

CITATIONS READS

3 3,847

2 authors, including:

Holban Stefan
Polytechnic University of Timisoara
88 PUBLICATIONS 334 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Improve the algorithm I created: Local Binary Compress, LBC View project

MaternQual CEEX99 View project

All content following this page was uploaded by Holban Stefan on 29 October 2015.

The user has requested enhancement of the downloaded file.


Recent Researches in Computers and Computing

A Genetic Algorithm for Classification

RAUL ROBU, ŞTEFAN HOLBAN


Department of Automation and Applied Informatics, Department of Computers
“Politehnica” University of Timisoara
Vasile Parvan Blvd.2, 300223, Timisoara
ROMANIA
[Link]@[Link], [Link]@[Link]

Abstract: - The paper presents aspects regarding genetic algorithms, their use in data mining and especially
about their use in the discovery of classification rules. A synthetic presentation of the fitness functions of the
genetic algorithms used for mining the classification rules is performed. A genetic algorithm with a new fitness
function for mining the classification rules is suggested. The proposed algorithm was tested on classic dataset
Car, Zoo and Mushroom. The same datasets were tested with classic algorithms NaiveBayes si J48. The results
obtained by applying the three algorithms are presented.

Key-Words: - genetic algorithm, data mining, classification

1 Introduction
The genetic algorithms are adaptive techniques that can be successfully used to solve complex search and
optimization problems [1]. They are based on the principles of genetics and Darwin’s natural selection theory
(“the one that is best endowed, survives”). Genetic algorithms have been successfully used in data mining, in
order to determine classification rules [2], in order to search for appropriate cluster centers) [1], to select the
attributes of interest in predicting the value of a target attribute [3], etc. Classification of instances was
performed using some hybrid algorithms based on genetic algorithms and particle swarm optimization [4],
respectively Naïve Bayes and k-Nearest Neighbors [5]. A few applications in which genetic algorithms were
successfully applied to solve classification problems are prints’ classification, heart disease classification,
classification of emotions on the human face, etc.
The fitness functions of the genetic algorithms used for mining classification rules may contain metrics
concerning predictive accuracy, rule comprehensibility as well as rule interestingness [2][3]. Diverse studies
suggest genetic algorithms with fitness functions that take into consideration in different ways the above
mentioned metrics. The paper synthetically presents the fitness functions suggested in mining classification
rules and proposes a new fitness function. The genetic algorithm that incorporates the suggested function was
implemented in Java and tested on 3 classic datasets Zoo, Car and Mushrooms. The paper presents aspects
regarding the steps of the proposed algorithm. The 3 datasets were also classified using Naive Bayes and J48
algorithms from WEKA. The results obtained with the three algorithms are comparable and are presented in
chapter 4.

2 Fitness functions used in mining classification rules


Genetic algorithms are used to discover classification rules for data that will be used for predictions. The
discovered rules must have a great prediction accuracy, they must be comprehensive and interesting [2][3]. The
form of the rules is IF X THEN Y, where X is the antecedent of the rule and is formed from a conjunction of
conditions and Y is the consequent of the rule, that is the predicted class. In the fitness functions used for the
discovery of classification rules the following factors from the confusion matrix are often used:
• True positive (tp): the actual class is Y and the predicted class is also Y.
• False positive (fp): the actual class is Y, but the predicted class is not Y.

ISBN: 978-1-61804-000-8 52
Recent Researches in Computers and Computing

• True negative (tn): the actual class is not Y and the predicted class is also not Y.
• False negative (fn): the actual class is not Y, but the predicted class is Y.
In [3] the next fitness function is suggested:

Fitness=w1 x (CF x Comp) + w2 x Simp (1)

The first term from relation (1) determines prediction accuracy and the second one determines rule
comprehensibility. CF= tp / (tp+fp) is confidence factor and Comp= tp / (tp+fn) is the completeness factor.
Simp represents the simplicity of the rule (is inversaly proportional to the number of conditions from the
antecedent of the rule). w1 and w2 are weights defined by the user.
In [6] the fitness function was determined:

Fitness=tp / (tp+[Link]) * tn / (tn+[Link]) (2)

Where 0.2<A<2 si 1<B<20. A and B are parameters defined by the user that determine the convergence rate
of the solution and which are experimentally determined according to the requests.
In [7] the fitness function is calculated using the formula:

Fitness=w1 x (|A & C|-1/2) / |A| + w2 x CM (3)

where |A & C| is the number of instances that satisfy the antecedent and consequent of the rule and |A| is the
number of instances that satisfy the antecedent of the rule. Comprehensibility is noted with CM and is
calculated using the formula in relation (4).

CM= (L-n) / (L-1) (4)

where L is the maximum number of conditions of the rule and n is the length of the rule.
A complete fitness function that considers not only prediction accuracy and the clarity of the rule but also
the interestingness of the rule was suggested by Dehuri et all in [2] and it is presented in relation (5).

w1C ( R) + w2 Pr edAcc + w3 RInt


Fitness = (5)
w1 + w2 + w3

Where C(R) is comprehensibility, PredAcc is predictive accuracy and RInt is the interestingness of the rule.
Comprehensibility is computed in (6), where M is the maximum number of conditions of a rule.

C(R)=M- number of condition (R) (6)

Prediction accuracy is calculated with relation (7).

| A&C |
Pr edAcc = (7)
| A|

Rule interestingness is computed in (8) where InfoGain(Ai) is information gain bring by attribute i, and
|dom(G)| is the domain cardinality of the class attribute G.

ISBN: 978-1-61804-000-8 53
Recent Researches in Computers and Computing

n −1

∑ InfoGain( A )
i =1
i

RInt = 1 − n −1 (8)
log 2 (| dom(G ) |)

3 The genetic algorithm


The general forma of a genetic algorithm is the following [8]:

Algorithm
Compute initial population;
WHILE stopping condition not fulfilled DO
BEGIN
Select individuals for reproduction;
Create offspring’s by crossing individuals;
Eventually mutate some individuals;
Compute new generation
END

Data classification using a genetic algorithm presumes going over the steps presented below. The dataset is
divided into training set and test set. The initial population is made of the rules obtained by codifying the
instances from the training set. The rules from the initial population are expressed as following:

IF attribute1=value1 and attribute2=value2 and… THEN class=c1

A chromosome is represented by a classification rule. A gene is represented by the value of an attribute.


Each rule (chromosome) is codified under the form of a bit array [6].
An attribute that can take m values will be codified on m+1 bits (one bit for each possible value and another
bit that attests the presence or absence of the attribute from the rule). The class will be codified on m bits.
If we have 3 attributes A1 with 4 possible values, A2 with 2 possible values and A3 with 3 possible values
and a class attribute with 3 possible values, the codification of a rule can be made as in table 1 (the bits are
numbered from right to left). If the dataset contains numeric attributes they will discretize.

Table 1 Codification of chromosomes


Attribute A1 A2 A3 C
Code 11000 001 1010 100
Rule If (A1=valoare4 ) and A3=valoare2 then C=valoare3

In the following stage the fitness of each rule is calculated. The function suggested for calculating the fitness
is the one from relation (9).

Fitness=w1*PA + w2 * CPH+w3* S (9)

where,
• PA is predictive accuracy, PA= tp / (tp+fp)
• CPH is comprehensibility, CPH =MNC-ANC
• MNC = maxim number of conditions
• ANC=actual number of conditions
• S is sensitivity, S=tp / (tp+fn)

ISBN: 978-1-61804-000-8 54
Recent Researches in Computers and Computing

• w1, w2 and w3 are user defined weights

Some chromosomes from the current population are selected to form new descendants. The selection of
chromosomes is proportional with their performance (performed after the roulette principle). For each selected
chromosom a number between 0 and 1 is generated. If this number is lower then pc (crossover probability) the
chromosome will cross with another and will form two descendants (the crossing point is obtained randomly),
otherwise a descendant identical with the parent is created.
A crossing example after position 6 may be seen in table 2.

Table 2 – Crossing chromosomes


Cromozom1 100010011010100
Cromozom2 110001011001010
Cromozom1’ 110001011010100
Cromozom2’ 100010011001010

Through crossing, new chromosomes will be created (new rules). The numer of chromosomes from the
population expected to cross (nr_crom_cross) is determined as following:

nr_crom_cross =nr_corm_pop * pc ` (10)

where nr_crom_pop is the number of chromosomes from the population and pc is the crossing probability.
The mutation is performed by randomly choosing a bit from a chromosome and commuting its value. pm is
the notation for the probability of the mutation which is a parameter established usually by the users. For each
chromosome gene a number between 0 and 1 is generated and if the generated number is lower than pm, then
the gene suffers a mutation.
After the descendants were created, the most performing n parents and children are selected in order to form
a new population (n is mentioned by the user). The rest of the chromosomes from the new population are
chosen from the children chromosomes. The algorithm repeats itself until a stable number of generations is
formed or until there are no changes in the performances obtained for a number of population.
The user must choose the dimension of the initial population (P), the crossing probability (pc), the mutation
probability (pm), n number of the elite chromosomes and the number of generations for which the algorithm
runs.
The rules that are determined are used for classification of instances from the test set with the purpose to
find out the precision with which the model performs the classification. If this is satisfactory, the built model is
used to realize predictions.

4 Experimental results
The suggested algorithm was implemented in Java and tested on 3 datasets Car, Zoo and Mushrooms. The
chosen crossing factor was 0.9, the mutation probability 0.03, the population 100 and the number of generations
30. The weights of the fitness function have been chosen with the values w1=0.6, w2=0.2 and w3=0.2,
emphasing in this way the accuracy of the prediction. The results are presented in table 3. Tests were made on
the same datasets using WEKA [9] and the already classic algorithms NaïveBayes and J48. Each dataset was
equally divided into training set and test set. The results are presented in table 3.
The Car dataset, contains 1728 instances and 7 categorical attributes. These are buying (the purchasing price
of the car with the possible values vhigh,high,med,low), maint (the price for the car maintenance with the
possible values vhigh,high,med,low), doors (the number of doors -2, 3, 4, 5more), persons (number of persons
it may carry 2,4,more), log_but (dimension of the trunk– small, medium or large), safety (low, med, high), and
class (unacc, acc, good, vgood).

ISBN: 978-1-61804-000-8 55
Recent Researches in Computers and Computing

Zoo Dataset contains 101 instances and 8 attributes. These are animal, name, hair, feathers, eggs, milk,
airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, type (the
class attribute with 7 possible values).
The class attribute in Mushrooms dataset indicates if they are eatable or poisonous. The dataset has 8124
instances and 23 attributes such as cap-shape, cap-color, odor, gill-attachment, gill-spacing, gill-size, and so on.

Table 3 – Testing the algorithms


Car Zoo Mushrooms
NaiveBayes 86 90 94
J48 85 82 100
Genetic 89 85 96
Algorithm

As it can be observed from Table 3 the results obtained by the three algorithms are good and comparable,
on different detests different algorithms have better performances.

5 Conclusion
Data mining techniques are used to extract useful information and knowledge from the database. Classification
is a data mining technique which explores data from the training set and builds a classifier model, based on
these data, that can be used for performing predictions. The genetic algorithms are adaptive techniques that can
be successfully used in solving complex search and optimization problems. A database can be viewed as a very
large search space and a genetic algorithm can be viewed as a search strategy in a database. The classification
rules that are searched are the ones that will constitute the classifier model.
The present paper presents the way in which genetic algorithms have been used to discover classification
rules. A synthesis was presented concerning the fitness functions used at classification rules mining. A new
fitness function was suggested. The genetic algorithm and the suggested fitness function were implemented in
Java and tested on 3 classic datasets Car, Zoo and Mushrooms, and the results were good. The suggested fitness
function determines prediction accuracy and rule comprehensibility. The interestingness degree is another
important metric for the classification rules. An ulterior development direction would be to develop a metric for
the interestingness of the rule and to include it in the fitness function. Another future development direction
would be determining the most appropriate weight coefficients for a certain dataset.

References:
1. U. Maulik and S. Bandyopadhyay, Genetic algorithm-based clustering technique, Pattern Recognition 33,
2000, pp. 1455-1465
2. [Link], A. Ghosh and R. Mall, Genetic Algorithms for Multi-Criterion Classification and Clustering in
Data Mining, International Jurnal of Computing & Information Sciences, 2006, pp. 143-154
3. A.A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, Advances in
Evolutionary Computation, Springer-Verlag, 2002, pp. 819-845.
4. R. Ding, H. Dong, X. Feng and G. Yin, A Hybrid Particle Swarm Genetic Algorithm for Classification,
Proceedings of the 2009 International Joint Conference on Computational Sciences and Optimization,
vol.2, 2009, pp. 301
5. M. Aci, C. Inan and M. Avci, A hybrid classification method of k nearest neighbor, Bayesian methods and
genetic algorithm, Expert Systems with Applications, Vol. 37, 2010, 5061-5067
6. P. Vivekanandan and D. R. Nedunchezhian, A New Incremental Genetic Algorithm Based Classification
Model to Mine Data With Concept Drift, Journal of Theoretical and Applied Information Technology,
2010, pp. 36-42
7. P. Vivekanandan and D. R. Nedunchezhian, A Fast Genetic Algorithm for Mining Classification Rules in
Large Datasets, International Journal on Soft Computing (IJSC), Vol.1, 2010, pp. 10-20
8. D.M. Mukhopadhyay, M.O. Balitanas, A. Farkhod, S.H. Jeon, and D. Bhattacharyya, Genetic Algorithm:
A Tutorial Review, International Journal of of Grid and Distributed Computing, Vol.2, No.3, 2009, pp.
25-32
9. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I.H. Witten, The WEKA Data Mining
Software: An Update, ACM SIGKDD Explorations Newsletter, Volume 11 , Issue 1, 2009, pp. 10-18

ISBN: 978-1-61804-000-8 56

View publication stats

You might also like