Study of Discretization Methods in Classification
Study of Discretization Methods in Classification
Abstract—Classification is one of the important tasks in Data discretization methods. These are, ChiMerge, Chi2, Modified
Mining or Knowledge Discovery with prolific applications. Chi2, Extended Chi2, Class-Attribute Interdependence
Satisfactory classification depends on characteristics of the Maximization (CAIM), Class-Attribute Contingency
dataset too. Numerical and nominal attributes are commonly Coefficient (CACC), Autonomous Discretization Algorithm
occurred in the dataset. However, classification performance may (Ameva), and Minimum Description Length Principle
be aided by discretization of numerical attributes. At present, (MDLP). The final objective is to investigate how suitable
several discretization methods and numerous techniques for these eight discretization methods perform together with the
implementing classifiers exist. This study has three main commonly known five classifiers. These are, Neural Network,
objectives. First is to study the effectiveness of discretization of
K Nearest Neighbour (K-NN), Naive Bayes, C4.5, and Support
attributes, and second is to compare the efficiency of eight
Vector machine (SVM).
discretization methods. These are ChiMerge, Chi2, Modified
Chi2, Extended Chi2, Class-Attribute Interdependence The paper is organized as follows. Section II introduces the
Maximization (CAIM), Class-Attribute Contingency Coefficient eight discretization methods. Section III presents the related
(CACC), Autonomous Discretization Algorithm (Ameva), and work, this is followed by Section IV where datasets and
Minimum Description Length Principle (MDLP). Finally, the classifiers selected are described. Section V is the section
study investigates suitability of the eight discretization methods where major steps in the study are explained. Results and
when applied to the five commonly known classifiers, Neural discussion are included in Section VI. Finally, the paper is
Network, K Nearest Neighbour (K-NN), Naive Bayes, C4.5, and
concluded in Section VII.
Support Vector machine (SVM).
50
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
ChiMerge : This is a bottom-up discretization method ξ C,D = max m1 ,m2 (3)
using the χ2 value to determine the merging point [4]. The
discretization process of ChiMerge begins with sorting the where:
numerical attributes and calculate the χ2 value for every pair of m1 = 1- min c E,D EC* and 0.5<c E,D
adjacent as shown in equation 1. The pair of adjacent values
m2 =max{c(E,D)|E C* and c E,D < 0.5 },
which has the lowest χ2 value are merged into one interval. The card(E∩D)
2
χ -Threshold is the used as a stopping criterion while merging c E,D = 1- ,
card(E)
process is conducted. C is the equivalence relation set,
D is the decision set,
k (Aij - Eij )
2 C* = { E1 , E2 ,…, En } is the equivalence classes.
= m
i=1 j=1 (1)
E ij CAIM : Class-Attribute Interdependence Maximization
(CAIM) is a top-down discretization proposed by Kurgan and
where: Cios [8]. The main goal of CAIM is to maximize the class-
m = 2 (the 2 intervals being compared), attribute interdependence and to generate the minimal discrete
k = number of classes, interval. CAIM criterion is proposed as a heuristic measure that
Aij = number of examples in the ith interval, jth class, is used to quantify the interdependence between class and the
Eij = expected frequency of Aij =
Ri *Cj
, discrete interval. The larger CAIM value indicates the higher
N interdependence. CAIM method begins discretizing process by
Ri = number of examples in ith interval = kj=1 Aij , select numerical attributes and sorted in ascending then the
Cj = number of examples in jth class = m i=1 Aij , CAIM value is computed using equation 4 for the midpoint for
k
N = total number of examples = j=1 Cj . every pair of adjacent and the cutting point is defined by select
the midpoint that has the highest CAIM value.
Chi2 : This is a bottom-up discretization method which
also use the χ2 to determine the merging point [5]. The process 2
n maxr
of Chi2 is similar to ChiMerge, thus, the χ2 value is also r=1 M+r
calculated by (1). Instead of the user specify χ2-Threshold in CAIM C,D F = (4)
n
ChiMerge, Chi2 introduces an inconsistency rate, used in the
process for determine the proper χ2-Threshold. Discretization in where:
Chi2 also includes feature selection, this makes it perform well
when the dataset contains attributes with many noises. n is the number of intervals,
r iterates through all intervals,
Modified Chi2 : This is a bottom-up discretization method
which stems from Chi2, the χ2 value is also used for determine maxr is the maximum value among within the rth
the merging point [6]. Instead of using the inconsistency rate as column of the quanta matrix,
a stopping criterion in Chi2, the level of consistency (Lc) M+r is the total number of continuous values of
coined from Rough Sets Theory as shown in equation 2. attribute F,
Therefore, the inconsistency checking (In-ConCheck(data) < δ) C is the class variable,
is replaced by (Lc-discretized ≤ Lc-original ) after each step of
D is the discretization variable,
discretization.
F is the attribute.
card(BXi ) CACC : Class-Attribute Contingency Coefficient (CACC)
Lc = (2) is a top-down discretization method [9]. In CACC, contingency
card(U)
coefficient is used to determine the cutting point in discretizing
where: process. To speed up the discretization and to reduce the
number of intervals, the contingency coefficient formula has
U is the set of all objects of the data set, been customized by divide the y with log(n). CACC value can
X can be any subset of U, then be calculated using equation 5.
51
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
qir is the number of samples with class i, C4.5 [5]. Comparison between Chi2 and MDLP was carried
out in [6] while the work in [10] compared Ameva with
Mi+ is the total number of samples with class i,
CAIM. Application of discretization in C4.5 and Naive Bayes
M+r is the total number of samples in the interval (dr-1,dr). using Equal width binning, RMEP, Error based, and SOM
Based showed in improvement in accuracy and reduction in
Ameva : Autonomous Discretization Algorithm (Ameva) is construction time [12]. The work in [8] claimed that CAIM
a top-down discretization based on χ2 statistic [10]. The lower was the discretization that guarantee the lowest number of
number of interval is shown as a main feature of Ameva. Since intervals and the highest dependence between class and
Ameva is a top-down discretization method, the discretization discrete intervals when compared to Equal-Width, Equal-
begins with one interval and repeatedly divide into many Frequency, Paterson-Niblett, Maximum Entropy, CADD, and
intervals by using the Ameva criterion as shown in equation 6. IEM with CLIP4 and C5.0 classifiers. Similar work also
confirm this facts in [13], however, Naïve Bayes and Semi-
X2 (k) Naïve Bayes with IEM discretization were the best for AODE
Ameva k = (6) classifiers in term of accuracy. ChiMerge had been used
k(l-1)
together with ID3 in [4] where MDLP was adopted for neural
where: network classifier in [11]. MDLP was used as the method for
discretization for comparison of different variations of Naive
k is the selected interval,
Bayes classifiers using medical datasets in [14]. Naïve Bayes
l is the number of classes. with Minimal Entropy Partitioning yielded better results than
Equal Width Interval, Holte’s 1R, and Minimal Entropy
MDLP : Minimum Description Length Principle (MDLP) Partitioning discretization [15]. A combination of Naïve Bayes
is a top-down discretization which uses the entropy with LD, NDD, WPKID, and WNDD discretization methods
minimization heuristic to discretize the continuous or numeric were adopted for natural datasets in [16]. Nevertheless, there
data into a range or interval [11]. The process starts with sort a is a study which contradicts the use of discretization by
selected numeric attributes similar to the other top-down suggesting that using real data directly could yield better result
discretization after that every pair of adjacent are computed for for ID3 and C4.5 [17].
the entropy value to find the optimal cutting point to be a Therefore, it can be said that, in almost most cases,
boundary of the new intervals. MDLP is used as a criterion to discretization of numerical attributes leads to better
decide whether the given partition should accept or reject. If performance and higher accuracy in classification. While there
the condition in the equation 7 is true, the partition induced by have been some studies on comparison of discretization
a cut point is accepted or rejected otherwise.
methods previously, they tend to have specific objectives and
with only few types of classifiers. This work is the first work
log2 (N-1) ∆(A,T;S) which attempts to compare popular discretization methods
Gain A,T;S > + (7)
N N together with popular classifiers with particular objectives of
finding suitable combination of discretization method and type
of classifier.
where:
IV. DATASETS AND CLASSIFIERS USED INVESTIGATION
A is the selected attribute,
A. Datasets
T is the cut point,
In order to validate the study, the number of numerical
S is the set of examples, datasets tested ought to be as many as possible with different
variety. Also number of samples in a dataset must be
N is the number of examples.
sufficiently large. Datasets used in this study are taken from
the well known UCI Repository [18]. Attributes in these
III. RELATED WORKS datasets are of numerical values only, which are relatively
Many discretization methods have been introduced and fewer than those that are of nominal or mixed types. Number
widely used in classification. In fact, the use of different of samples in the datasets ranges from 106 to 20,000. Datasets
discretization methods for classification purposes have been vary from as few as 4 to as many as 90 attributes. The number
investigated with different objectives. The discretization of categories also varies from as few as 2 to as many as 30.
methods were compared by taxonomy without experiment Altogether, twenty five datasets are selected, they are
with the classifier in several aspects such as merge or split, use summarized in Table 1.
class information or not, stopping criterion, sensitive to the
outlier, and etc. to suggest the discretization method for the
users in various environment and datasets [3]. Extended Chi2
was found superior to Modified Chi2, and Boolean reasoning
with the VPRS classifier [7]. Chi2 was compared with original
discretized dataset with respect to accuracy and tree size in
52
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE I. DATASETS USED IN THE STUDY popular and open source classifiers available in public domain.
This study, therefore, selects popular and known classifiers
Dataset No. of No. of No. of which are likely to be chosen when classification is required.
samples attributes categories
Five popular classifiers are selected for the study, these are K-
Nearest Neighbor (K-NN), Naive Bayes, Decision Tree, Neural
1 letter-recognition 20000 16 26
Network, and Support Vector Machine (SVM). These
2 pen-based-recognition 7494 16 10 classifiers represent different characteristics from the basic
classification algorithm (i.e. K-NN) to a mathematical oriented
3 wall-following-robot- 5456 24 4 (i.e. SVM). As each of these five classifiers is available in
navigation public domain, users make an assumption that suitable
discretization is done to numerical attributes in the dataset
4 waveform 5000 21 3 used. While some of these five classifiers, such as neural
network is capable of dealing with numerical values directly. It
5 winequality-white 4898 11 7
is also an objective of this study, as stated in Section I, to
6 winequality-red 1599 11 6
investigate whether discretization of numerical attributes is
preferable to using numerical values directly.
7 yeast 1480 8 10
V. METHODOLOGY
8 banknote- 1372 4 2
authentication This study intentionally chooses publicly available for the
five classifiers stated in Section IV instead of implementing
9 pima-indians-diabetes 768 8 2 them. This is because it is an opportunity to test some of their
default discretization methods too. CRAN-R software is
10 transfusion 748 4 2 selected as the main tool [19] with additional library package
CARET [20] and another classification and regression library
11 user-knowledge- 403 5 5
modeling
e1071[21], klaR[22], and RWeka [23]. The study can be
summarized in three steps as follows :
12 movement-libras 360 90 15
Data Preparation : This is the initial step which comprises
13 leaf 340 15 30
two tasks, replacing missing values and normalization. Each
dataset was scanned and found that some samples had missing
14 e coli 336 7 8 attribute values. Missing values were replaced by the mean of
that attribute in the dataset. After this was done, normalization
15 vertebral-column-3C 310 6 3 was carried out to each attribute of each dataset in order to
reduce bias between the attributes in the datasets. The popular
16 vertebral-column-2C 310 6 2 Z-Score was adopted for normalization.
17 haberman 306 3 2 Discretization : The datasets are now ready for
discretization. For each dataset, eight duplications are made, so
18 glass 214 9 6 that each discretization method can then be applied to the
duplicate dataset. Therefore, for each type of dataset, there are
19 seeds 210 7 3 9 datasets of that type, one is the original with real numerical
attribute values and the other eight datasets are those whose
20 sonar-all-data 208 60 2
attributes are discretized by the eight discretization methods as
21 parkinson 195 22 2 described in Section II. As they are 25 different types of data
used in this study, so there are altogether 225 datasets.
22 planning-relax 182 12 2
Classification : This process is the application of five
23 wine 178 13 3 different type of classifiers as described in Section IV. As there
are 9 datasets for each type of data, each classifier is applied to
24 iris 150 4 3 each dataset, with original numerical attribute values and with
nominal attribute values from eight different discretization
25 breast-tissue 106 9 6 methods. To ensure validity of the classification, the ten-fold
cross validation is used and the process is repeated 5 times for
B. Classifiers each classifier. Therefore each discretization is applied to 25
A classifier in this study is taken to mean any algorithm or datasets and subjected to 5 classifiers. So, 5,625 different
technique which enables classification by means of supervised classifications are carried out in this study.
learning. Selecting appropriate classifier for the task at hand is
essential in order to ensure satisfactory result. At present, many
classifiers exist and new ones are continuingly developed.
Unless it is about developing and researching particular aspects
in classification, scientists and engineers tend to resort to
53
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
Among the six datasets which Chi2 yielded the best
VI. RESULTS AND DISCUSSION performance, classifiers that were used are shown in Table II.
As there are several perspectives where results of the study
can be shown, however it seems tedious to focus results to TABLE II. CLASSIFIERS WHICH CHI2 YIELDED BEST PERFORMANCE
detailed level such as best discretization for each dataset, best
classifier for each dataset and best combination of both for Classifier No. of datasets
each type of dataset. Hence, this Section reveals the key
findings pertaining to the study. C4.5 2
K-NN 1
C4.5 1
Neural Network 1
54
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
[5] H. Liu and R. Setiono, “Chi2: feature selection and discretization of
VII. CONCLUSION numeric attributes,” in Proc. Seventh International Conference on Tools
with Artificial Intelligence (ICTAI’95), 1995, pp. 388–391.
This study aims to investigate eight discretization methods [6] F. E. H. Tay and L. Shen, “A modified Chi2 algorithm for
for the five popular classifiers. It was found that the top three discretization,” IEEE Trans. Knowledge and Data Engineering, vol. 14,
methods that yielded best performance are Chi2 and MDLP no. 3, pp. 666–670, May 2002.
and CACC. None of the eight discretization methods [7] C.-T. Su and J.-H. Hsu, “An extended Chi2 algorithm for discretization
outperformed others by significant margin. Note that of real value attributes,” IEEE Trans. Knowledge and Data Engineering,
discretization of numerical attributes may not always guarantee vol. 17, no. 3, pp. 437–441, Mar. 2005.
better performance, especially when classifiers used are those [8] L. A. Kurgan and K. J. Cios, “CAIM discretization algorithm,” IEEE
Trans. Knowledge and Data Engineering, vol. 16, no. 2, pp. 145–153,
which were originally developed to deal with numerical values
Feb. 2004.
(i.e. Neural Network, K-NN and SVM). Nevertheless, this
[9] C.-J. Tsai, C.-I. Lee, and W.-P. Yang, “A discretization algorithm based
study recommends Chi2 and MDLP as for discretization. on Class-Attribute Contingency Coefficient,” Information Science, vol.
Hence, discretization of numerical attributes by these two 178, no. 3, pp. 714–731, Feb. 2008.
methods ought to be carried out, especially when resort to [10] L. Gonzalez-Abril, F. J. Cuberos, F. Velasco, and J. A. Ortega, “Ameva:
public domain softwares. An autonomous discretization algorithm,” Expert Systems with
Applications, vol. 36, no. 3, Part 1, pp. 5327–5332, Apr. 2009.
With respect to classifiers, this study recommends C4.5 as [11] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of
the first choice for nominal classifier together with either Chi2 continuous-valued attributes for classification learning,” in Proc. the
or MDLP as discretization method. It must be emphasized the 13th International Joint Conference on Artificial Intelligence, 1993, pp.
above are recommendations and are not meant to be absolute 1022–1027.
rule, as the study reveals that the nature of the dataset tends to [12] F. Kaya, “Discretizing continuous features for naïve Bayes and C4. 5
influence the classifier choice as well as discretization. If classifiers,” University of Maryland publications: College Park, MD,
USA, 2008.
discretization of numerical attributes cannot be carried out or is
[13] M. Mizianty, L. Kurgan, and M. Ogiela, “Comparative Analysis of the
disregarded, then classifiers such as Neural Network, SVM and Impact of Discretization on the Classification with Naive Bayes and
K-NN ought to be the preferred choice. Semi-Naive Bayes Classifiers,” in Proc. Seventh International
Conference on Machine Learning and Applications(ICMLA ’08), 2008,
As discretization is an additional process in data pp. 823–828.
preparation, computation complexity of each discretization [14] R. Abraham, J. B. Simha, and S. S. Iyengar, “A comparative analysis of
must be considered, especially in applications when response discretization methods for medical datamining with naïve Bayesian
time is critical as some methods are likely to be more resource classifier,” in Proc. Information Technology, 2006(ICIT’06), 2006, pp.
demanding than others. Future work can also be carried out on 235–236.
lesser known classifiers as it may reveal useful facts in [15] J. Dougherty, R. Kohavi, M. Sahami, and others, “Supervised and
discretization. unsupervised discretization of continuous features,” in Proc. Machine
learning: proceedings of the twelfth international conference(ML-95),
1995, vol. 12, pp. 194–202.
ACKNOWLEDGMENT [16] Y. Yang and G. I. Webb, “A comparative study of discretization
methods for naive-bayes classifiers,” in Proc. The 2002 Pacific Rim
The authors gratefully acknowledge the School of Knowledge Acquistion Workshop(PKAW 2002), 2002, vol. 2002.
Information Technology (SIT), King Mongkut’s University of
[17] D. Ventura and T. R. Martinez, “An empirical comparison of
Technology Thonburi (KMUTT) for partial scholarship for Mr. discretization methods,” in Proc. the Tenth International Symposium on
Supharoek Chattanachot of this research and for the support of Computer and Information Sciences(ISCIS X), 1995, pp. 443–450.
the computing facilities. [18] M. Lichman, UCI machine learning repository. (2013) [Online].
Available: [Link]
REFERENCES [19] R Core Team, R: A language and environment for statistical computing.
R Foundation for Statistical. (2016) [Online]. Computing, Vienna,
[1] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques, Austria. Available: [Link]
Third Edition, Elsevier, 2011. [20] M. Kuhn, The Caret Package. (2016) [Online]. Available: [Link]
[2] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning [Link]/web/packages/caret/[Link]
Tools and Techniques, Second Edition, Morgan Kaufmann, 2005. [21] D. Meyer et al., e1071. (2015) [Online]. Avaliable: [Link]
[3] L. Peng, W. Qing, and G. Yujia, “Study on Comparison of [Link]/web/packages/e1071
Discretization Methods,” in Proc. International Conference on Artificial [22] C. Roever et al., klaR. (2014) [Online]. Avaliable: [Link]
Intelligence and Computational Intelligence, 2009(AICI ’09), 2009, vol. [Link]/web/packages/klaR
4, pp. 380–384.
[23] K. Hornik, C. Buchta, and A. Zeileis, “Open-source machine learning: R
[4] R. Kerber, “ChiMerge: Discretization of Numeric Attributes,” in Proc. meets Weka,” Comput Stat, vol. 24, no. 2, pp. 225–232, May 2008.
The Tenth National Conference on Artificial Intelligence(Aaai-92), San
Jose, California, 1992, pp. 123–128.
55
Authorized licensed use limited to: Bauman Moscow State Technical University. Downloaded on September 27,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.