w,
higher-level quality information [2]. As distinguished from the data warehouse, which
has unique data approach, DM gives results that show relations and interdependence of
data. Mentioned dependences are mostly based on various mathematical and statistic
relations [3]. Figure 9 represents the process of knowledge data discovery.
Figure 9: Process of knowledge data discovery based on [10]
Data for concrete research are collected from internal database system of
Student's Service Dept., and external bases in the form of various technological
documents, decisions, reports, lists, etc. After performed selection of various data for
analysis a DM method is applied, leading to the appropriate rules of behavior and
appropriate patterns. Knowledge of observed features is presented at the discovered
pattern. DM is known in literature as the "extraction of knowledge", "pattern analysis",
"data archaeology" [3].
134 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
3.1. Regression trees
A regression tree is based on [7] a nonparametric model which looks for the best
local prediction, or explanation, of a continuous response through the recursive
partitioning of the predictor variables space. The fitted model is usually displayed in a
graph which has the format of a binary decision tree which grows from the root node to
the terminal nodes, also called leaves.
Figure 10: Graphical Display of a Regression tree based on [7]
Figure 10 illustrates the features of a model provided by a regression tree that
explains the relationship between a response variable y and a set of two explanatory
variables 1 x and 2 x . In the example above, the predicted values for y are obtained
through a chain of logical statements that split the data into four subsets.
Mathematical Formulation
Let 1 2 t ( t , t ,..., mt ) ' x = x x x be a vector which contains m explanatory variables
for a continuous univariate response t y . The relationship between t y and t x
follows the regression model:
t ( t ) t y = f x + (3.1.)
where the functional form f is unknown and there are no assumptions about
the random term t
. Following [14], a regression tree model with k leaves is a
recursive partitioning model that approximates f by fitting functions in
subregions of some domain D Rm so that:
1
( ) ( ) ( )
k
t t i t
t
fx fxIx
=
= (3.2.)
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 135
where j ( t ) I x indicates the membership of tth observation to the jth leave that
constitute a subregion of D . The functional form of
i f is usually taken to be a
constant and, conditionally to the knowledge of the subregions, the relationship
between y and x in (3.1.) is approximated by a linear regression on a set of k
dummy variables. In [5] discuss, upon evidences, that there is not much gain in
choosing
i f to be a linear or a more complex function of t x .
3.1.1. CART Regression Tree Approach
The most important reference in regression tree models is the CART
(Classification and Regression Trees) approach in [6], thus the discussion from now on is
entirely based on this work. The top-down method of the growing tree implies specifying
at first a model associated with the simplest tree as in Figure 11.
Figure 11: Simplest tree structure based on [7]
To locate the model parameters in the tree structure above, we adopt a labelling
scheme which is similar to the one used in [8]. Here, the root node is at position 0 and a
parent node at position i generates the left-child node and right-child node at positions
2i +1 and 2i + 2 , respectively. Therefore, a tree model equation for Figure 11, that fits
a constant model in each leave, may be written as:
( ) ( ) 01 01 0 0 02 02 0 0 t t , , t , , t y = I x w c + I x w c + (3.3.)
'
0 0
01 '
0 0
1,
(.)
0,
t
t
fw x c
I
fw x c
++ o = +
v+ >
(3.4.)
02 01 I (.) = 1 I (.) (3.5.)
In (3.4.), the hyperplane '
0 t 0 w x = c which defines the binary partition of the
variable space is obtained from a vector of weights '
0 01 02 0 ( , ,..... m ) w = w w w , and a scalar
threshold 0 c .
136 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
The constant 01 is the local model for the observations that are assigned to the
left child node, that is, for those which 01 I (.) =
parameter 02 , hence the model in (3.3) approximates the unknown function f in (3.1.)
by a non-smooth and discontinuous function. In the CART context, it is usual to obtain
the recursive partitioning by using hyperplanes that are perpendicular to the predictor
variables axis, which means that the vector 0 w is composed of m1 zeros and a unity
element in the 0
Sth position. This simplifies the specification of the model, allowing to
perform a less expensive search in the continuous space of hyperplanes. For practical
purposes, that means to replace the vector of parameters o w in (3.3.) by a scalar
parameter 0 S (index of the splitting variable) which takes values in the set of integers
S = {1, 2,......,m} so that:
0
0
0
01
0
1,
(.)
0,
S t
S t
fx c
I
fx c
o ++
= + v+ >
(3.6.)
02 01 I (.) = 1 I (.) (3.7.)
By omitting the arguments of the indicator functions, more complex tree
structures are represented in the equations that follow Figure 12:
( ) t 02 02 11 11 12 12 01 t y = I + I + I I + u ( ) ( ) t 11 11 12 12 01 21 21 22 212 02 t y = I + I I + I + I I + u
Figure 12: Regression trees with 3 and 4 terminal nodes based on [7]
ij and ij I denotes the constant model respectively and indicator function for the subset
of observations located on node at position i that are assigned to the child-node j
( j = 1 left / j = 2 right).
3.1.2. Growing the tree by CART algorithm
The tree growing induction can be seen as an iterative and recursive
nonparametric modeling cycle that specifies at each iteration a tree architecture by
selecting a node to be split, a splitting variable, and a threshold. Then, local models are
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 137
estimated for the observations assigned to the generated nodes. This procedure is
repeated recursively in the created nodes until it reaches a point where there is no gain in
partitioning the observed data. The final model can be evaluated through cost-complexity
measures and re-specified by cutting some branches of the tree. This cycle starts with the
specification of the simplest model (Fig 9) by selecting a splitting variable S0 x and a
threshold 0 c and estimating the parameters 01 and 02 .The selection and estimation
procedures are carried out simultaneously by searching exhaustively the pair ( ) 0 0 S ,c that
minimizes the sum of squared errors or the sum of absolute deviations for the model in
equation (3.3). The tree obtained in the former case is called LS (Least Squares)
regression tree. Conditional on 0 S , 0 c it is straightforward to show that the best
estimators for 01 and 02 , in the least square sense, are the sample mean values of the
response variable within the children nodes. The tree grown by minimizing of the sum of
absolute deviations is called LAD (Least Absolute Deviation) regression tree and it can
be shown that the sample median of the response variable within the terminal nodes is the
best estimator in this case. More about absolute deviation regression can be found in [16]
and [4]. After splitting the root node (node 0), and if we use the least square criterion, the
tree model may be re-specified by selecting the tree architecture in the left side of Figure
12.
The recursive partitioning implies that all parameters which are not involved in
the current splitting are kept fixed. Thus, in the equation on the left side of Figure 12, the
estimates of 11 and 12 are found by selecting the pair ( ) 1 1 S ,c that minimizes the
overall sum of squared errors conditionally on 0 S , 0 c and 02 . The
specification/estimation process will continue by maximizing the decrease in the sum of
squared errors while adding a terminal node. This procedure naturally forces at each
iteration the sum of squared errors to be lowered and thus it becomes necessary to
establish a stopping rule in order to verify if a generated node shall be recursively split or
shall be declared as terminal, otherwise it will lead to an overfitting. In [6] suggests that a
generated node containing a number of observations equal or less than 5 shall be declared
as terminal. To reduce the complexity of the tree, the last diagnostic checking can be
performed through a pruning technique
We present in Figure 13 and Figure 14 a version of CART algorithm to fit a
regression tree model. The tree growing algorithm is shown in Figure 13 while Figure 14
highlights the node splitting procedure. The algorithm executes in the node at the ith
position an exhaustive search to find the index i S of the univariate splitting variable and
the threshold i c that provide the minimum value of a loss function L , usually the sum of
squared errors or absolute deviations. A node is declared as terminal if its number of
observations is equal or less than 5. According to the adopted labelling scheme, we
control the status of the ith node by assigning: 0 (test node) and 1 (terminal node).
During the execution of the algorithm, the status of the node can change from terminal
(1) to test (0) and if a node position has to be skipped, these nodes are labelled with 1 .
138 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
Figure 13: General form of CART algorithm based on [7]
Figure 14: CART splitting algorithm based on [7]
3.2. Illustrated examples of DM
More than one hundred cluster models and decision tree models have been
implemented in this project. For the sake of being practical this project will present only
some representative examples. In implementation we used Microsoft Decision Trees
(MDT) which is based on CART algorithm [17] described in the paragraph 3.1.
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 139
The first example is evaluation of student's mark based on the following criteria:
exams, departments, and number of attempts and gender of students. MDT, which is
represented at Figure 15, chooses number of attempts as the most significant attribute for
the first level. After that the algorithm chooses department and gender as the next
attributes by importance. At the end, there are evaluations of expected marks depending
on the values from analyzed attributes.
Figure 15: MDT for evaluation of marks
Based on the decision tree the following can be concluded: students that take
exam in less number of attempts get better marks and students from the first and the
second year of studies (Common Bases Department CB) get worse marks than students
from the third and the forth year.
Figure 16 represents correlation between criteria for evaluation of attained mark.
By moving cursor to the left-hand side of the Figures, the intensity of correlation between
observed criterions is displayed. Therefore it is easily perceived that the obtained mark
depends on student's gender the least, a bit more on student's department, the most on the
number of attempts.
140 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
Figure 16: Analysis of correlation between criterions for evaluation of attained mark
Figure 17: MDT for evaluation of length of studying period based on [13]
The second example represents MDT for evaluating the length of studies period.
(Figure 17).
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 141
Relevant criterions for decision-making are: firstly, the year of studies observed, then
secondly, observed average student's mark and then for one category, student's department.
Total population, which is observed for this analysis, is 5524 students, (Figure 17).
Table 1: Interval's limits of time of studying [13]
Interval I II III IV V
Span (Year) till 0,856 0,86-2,758 2,76-4,760 4,761-7,761 Noise
Number of students 1627 1495 1248 1152 2
% 29,45 27,06 22,59 20,858 0,04
Accordingly, out of the total number of observed students, 1672 students or
29.45% belong to the first interval, 1495 students or 27.06% belong to the second, 1248
students or 22.59% belong to the third, 1152 students or 20.858% belong to the forth and
2 students and 0.04% belong to the fifth interval. Two students that have wrongly
inscribed data, belong to the fifth interval.
For example, if it is necessary to make evaluation of the length of studies of a
second year student, whose average mark is over 7.70, it will be claimed accordingly that
his length of studies will be 2.758 years, with the possibility of 0.2592 or 25.92%.
Figure 17 shows correlation between criterions of evaluation of length of
studies. Moving cursor to the left-hand side of the Figures, we come to the intensity of
correlation of observed criteria. It is easily perceived that the length of studies depends
on student's department and gender the least, while it depends more on average mark and
current student's status.
Figure 18: Analysis of dependence between criterions according to evaluation of time of
studying based on [18]
142 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
This example shows clustering by criterion of average mark for those students
whose data are in data warehouse.
Table 2: Average student's mark based on [18]
Student
type Cluster No. Avg Mark Number of student Min. Avg Max Avg
The worst 1 Without pass exams 1176 ------ ------
Poor 2 6.54 1760 6.000 7.093
Average 3 7.32 1140 7.095 7.556
Very good 4 7.77 1080 7.558 8.000
The best 5 8.89 1121 8.023 10.000
From the Table 2 we can see that 1176 students belong to the first category,
without passed exams, and at the same time without any possibility to have the average
mark calculated. Students with average mark 6.54 or within the interval from 6.00 to
7.093 belong to the second category, while the total number is 1760. 1140 students with
7.32 average mark or within the interval of average mark from 7.095 to 7.556 belong to
the third category. 1080 students with 7.77 average mark or within the interval of average
from 7.558 to 8.000 belong to the fourth category. The last one, the fifth category
comprises students with the best average mark within the interval from 8.023 to 10.00,
while the total number of them is 1121.
The second example is clustering students by criterion length of studies.
Table 3: The results of clustering by time of studying for bachelor students based on [18]
length of
studies
Cluster
No.
Count of
student
Avg length of
studing (Year)
Min time of
studing (Year)
Max time of
studing (Year) Percent
Short 1 84 4.67 4.03 5.30 25.15%
average 2 83 5.69 5.31 6.09 24.85%
long 3 84 6.44 6.091 6.79 25.15%
very long 4 83 8.17 6.80 9.55 24.85%
The result of this clustering comprises four categories. The first category
consists of the students with the shortest length of studies. The average is 4.67 years,
while the total number of those students, from the interval of average length of studies
from 4.03 to 5.30 years, is 84 or 25.15%. The second category comprises students with
average length of studies of 5.69 years, within the interval from 5.31 to 6.09 years, while
the total number of them is 83 or 24.85%, etc. It has to be noted that the observed data
refer to bachelor students during the period of not more than 10 years, so the interval of
the last category is from 6.80 years to, approximately 10 years.
4. IMPLEMENTATION
In implementation of DW and DM we have used MS SQL Server 2000, DTS
services SQL Server's 2000 and OLAP Services 8.0. Excel 2000, ProClarity and Web
application. On the Figure 18 is shown schema of data warehouse and data mining
solution for student data service.
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 143
Figure 18: Scheme of data warehouse and data mining solution based on [11]
In conclusion data warehouse is a very flexible solution fitting end user
purposes, which by tools in everyday usage (e.g. Microsoft Excel) can explore database
more efficiently than any other tool from OLTP environment. Significant advantage of
this approach to knowledge and information research in databases is that the user does
not have to possess knowledge of relational model and complex query languages.
It is obvious that MDT algorithms give remarkable results for analysis with the
aim of good prediction.
MDT is based on possibility of various attributes and it is used when prediction
of cases is necessary. MDT algorithm also generates rules. Every tie in the tree can be
expressed as a set of rules that describes it, as ties that led to it. Besides, these clusters are
based on statistic of various attributes. They provide for the creation of the model that is
not used for predictions but a very efficient one in finding records that have common
attributes so that they can be put in the same group.
MDT enable the user to analyze a great number of various DM problems, with
the aim of timely prediction. Using good elements, which follow the prediction, it is
possible to make good business plans and lead business system to the benchmark. It can
be said with pleasure that this method of analysis of data and making-decision process
becomes more and more popular in solving new problems.
144 M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining
5. COMPARATIVE ANALYSIS
The following table contains comparative features of existing and new solution
Table 4: Comparative features of existing and new solution based on [11]
Existing system New system
Memory Space 100 MB 150 MB
Avg. query time 3 sec < 0.5 sec
User defined queries NO YES
Detection of faulty data NO YES
Sophisticated analysis NO YES
Used for Data processing and data
analysis
Data analysis
Data security NO YES
Data granularity Detailed data Aggregated data
Dependency analysis NO YES
Complex statistical analysis NO YES
Current technique Relational databases Data Warehouse and Data
Mining
Referring to table 4 we could conclude that the features of the new system in
the sense of data analysis are much better, the only disadvantage is that it takes up more
Memory Space for storing aggregated data and that it cannot be used for data analysis
6. CONCLUSION
This paper shows the phases through which a DW and DM solution is formed.
Based on the demonstration we can conclude that DW offers a flexible solution to the
user, who can use tools, like Excel, with user-defined queries to explore the database
more efficiently in comparison to all other tools from the OLTP environment.
The significant benefit from this solution of information and knowledge
retrieval in databases is that the user does not need to possess knowledge concerning the
relational model and the complex query languages.
This approach in data analysis becomes more and more popular because it
enables OLTP systems to get optimized for their purpose and to transfer data analysis to
OLAP systems.
REFERENCES
|1| Barry, D., Data Warehouse from Architecture to Implementation, Addison-Wesley, 1997.
|2| Berry, M.J.A., and Linoff, G., "Mastering data mining", The Art and Science of Customer
Relationship Management, 1999.
|3| Bhavani, T., Data Mining: Technologies, Techniques, Tools and Trends, 1999.
|4| Birkes, D., and Dodge, Y., Alternative Methods of Regression. John Wiley & Sons, 1993.
|5| Breiman, L., and Meisel, W.S., General estimates of the intrinsic variability of data in
nonlinear regression models, Journal of the American Statistical Association, 71 (1976)
301-307.
M. Suknovi, M. upi, M. Marti, D. Krulj / Data Warehousing and Data Mining 145
|6| Breiman, L.J.H., Friedman, R.A.O., and Stone, C.J., Classification and Regression Trees,
Belmont Wadsworth Int. Group, 1984.
|7| De Rosa, J.C., Viega, A., and Medeiros, M.C., Tree-Structured Smooth Transition Regression
Models Based on CART Algorithm, Department of Economics, Janeiro, 2003.
|8| Denison, T., Mallick, B.K., and Smith, A.F.M., A Bayesian CART algorithm, Biometrika
85 (1998) 363-377.
|9| Gunderloy, M., and Sneath, T., SQL SERVER Developers Guide to OLAP with Analysis
Services, Sybex, 2001.
|10| Jiwei, H., and Micheline, K., Data Mining: Concepts and Techniques, Simon Fraser
University, 2001.
|11| Krulj, D., "Design and implementation of data warehouse systems", M Sc. Thesis, Faculty of
Organizational Sciences, Belgrade, 2003.
|12| Krulj, D., Suknovi, M., upi, M., Marti, M., and Vujnovi, T., "Design and development
of OLAP system FOS student service", INFOFEST, Budva, 2002.
|13| Krulj, D., Vujnovi, T., Suknovi, M., upi, M., and Marti, M., "Algorithm of Data Mining,
good base for decision making", SYM-OP-IS, Tara, 2002.
|14| Lewis, P.A.W., and Stevens, J.G., Nonlinear modeling of time series using multivariate
adaptive regression splines (MARS), Journal of the American Statistical Association, 86
(1991) 864-877.
|15| Lory, O., and Crandall, M., Programmers Guide for Microsoft SQL Server 2000, Microsoft
Press, 2001.
|16| Narula, S.C., and Wellington, J.F., The minimum sum of absolute errors regression: A state
of the art survey, Internat. Statist. Rev., 50 (1982) 317-326.
|17| Seidman, C., Data Mining with Microsoft SQL Server 2000, Microsoft Press, 2001.
|18| Suknovi, M., upi, M., Marti, M., and Krulj, D., "Design and development of FOS Data
Warehouse", SYM-OP-IS, Tara, 2002.
|19| Suknovi, M., Krulj, D., upi, M., and Marti, M., "Design and development of FOS Data
Warehouse", SYMORG, Zlatibor, 2002.
|20| Vidette, P., Building a Data Warehouse for Decision Support, Prentice Hall, 1996.