0% found this document useful (0 votes)
31 views11 pages

SYSS Forecasting Classification

Uploaded by

fatemeh1725
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

SYSS Forecasting Classification

Uploaded by

fatemeh1725
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/309492895

Forecasting to Classification: Predicting the direction of stock market price


using Xtreme Gradient Boosting

Working Paper · October 2016


DOI: 10.13140/RG.2.2.15294.48968

CITATIONS READS

51 20,986

4 authors, including:

Shubharthi Dey Snehanshu Saha


PES Institute of Technology BITS Pilani, K K Birla Goa
1 PUBLICATION 51 CITATIONS 274 PUBLICATIONS 1,664 CITATIONS

SEE PROFILE SEE PROFILE

Suryoday Basak
Pennsylvania State University
47 PUBLICATIONS 811 CITATIONS

SEE PROFILE

All content following this page was uploaded by Snehanshu Saha on 28 October 2016.

The user has requested enhancement of the downloaded file.


Forecasting to Classification: Predicting the direction of
stock market price using Xtreme Gradient Boosting

∗ †
Shubharthi Dey Yash Kumar
PESIT South Campus PESIT South Campus
Bangalore Bangalore
Karnataka, India Karnataka, India
deysubharthi15@[Link] yash.kumar1396@[Link]
‡ §
Snehanshu Saha Suryoday Basak
PESIT South Campus PESIT South Campus
Bangalore Bangalore
Karnataka, India Karnataka, India
snehanshusaha@[Link] suryodaybasak@[Link]

ABSTRACT 1. INTRODUCTION
Stock market prediction is the art of determining the fu- The stock market has always been the area of interest for
ture value of a company stock or other financial instrument people around the world. Perception and experience are
traded on an exchange. It had been a real challenge for ana- put to good use by non-practitioners to invest in a particu-
lysts and traders to predict the trends of stock market due to lar company hoping for multiplied return. The stock market
its uncertain nature. Stock prices are likely to be influenced trend prediction had also been the area of keen interest of
by the factors like product demand, sale, manufacture, in- statisticians and computer scientists simply for the reason
vestor’s sentiments, ruling government, recession etc. The that the area throws complex modeling questions. There
successful prediction of a stock’s future price could yield sig- exist methods or algorithms which could predict the stock
nificant profit. The main aim of this paper is to design an ef- valuation with a fair degree of accuracy. However, a question
ficient model which will accurately predict the trend of stock still persists: if a person decides to buy shares from a partic-
market using eXtreme Gradient Boosting(XGBoost) which ular company, what is the probability that it turns out to be
has proved to be an efficient algorithm with over 87% of ac- a successful expedition or mere failure? An informed guess
curacy for 60 day and 90 day periods and it has proved to works on a broader basis i.e. considering the production sale
be much better when compared to traditional non-ensemble and demand of the organization in the present scenario, it
learning techniques. The prediction problem has been recon- may be fine to invest in aparticular stock. However, it is
structed as a classification problem and XGBoost turned out too much to expect this to work in complex situations ig-
to be significantly better than the algorithms found in litera- noring certain nunaced concepts and factors that govern the
ture. The proposed model outperforms all existing forecast- market. For example, the political situation in a country
ing models in literature and is able to forecast on long-term is inefficient and too volatile to handle the economy of the
basis, an added feature absent in literature. country. A fall in the economy triggers fall in the stock value
of a company. Due to these minute and chaotic parameters,
the prediction becomes increasingly difficult. The traders
Keywords tend to invest in a firm which has potential or history of
XGBoost, ensemble learning, Exponential smoothing good return based on the current situation. However, there
always exists a possibility that a company which appears to
∗First Author and Second author share equal credit. have incurred a loss may be the potential firm which the
†First Author and Second author share equal credit. traders/investors may have continued faith in.
‡Dr. Saha is the corresponding author. Over the past years, there had been enormous research in
this field pivoted around statistical machine learning. Dif-
§Suryoday is with the Center of Applied Mathematical Mod-
ferent predictive models and algorithms to a certain degree
eling. of accuracy, have been proposed and tested. Implementa-
tion of machine learning techniques is an evolving concept.
This is somewhat different from traditional forecasting and
diffusion methods. Early models used in stock forecast-
ing involved statistical methods such as time series model
and multivariate analysis (Gencay (1999), Timmermann and
Granger(2004), Bao and Yang(2008)). Our paper mainly fo-
cuses on the machine learning approach of analysis as it is
evident that the historical data set obtained (say from the
date when the company existed) is impossible to analyze
without data mining methods. The prediction as a result of 3 contains key definitions. Section 4 describes Data prepro-
our proposed algorithm may help people to decide whether cessing and feature extraction in detail. Section 5 describes
to invest in a particular company taking into consideration the methods and algorithm used. The results of the applied
the chaos and volatility of stock. We adopted a machine algorithm are obtained and analyzed in section 6. The next
learning approach different from the commonly practiced section is a comparative study establishing the superiority
ones such as Support Vector Machine, Neural Network and of our proposed algorithm. The authors conclude with a
Naive Bayesian Classifier, Linear Discriminant Analysis etc. comprehensive summary of the work.
Next section discusses related literature.
3. KEY DEFINITIONS
2. LITERATURE SURVEY Time Series Data: A time series is a series of data points
Recent years witnessed considerable traction in the field of indexed (or listed or graphed) in time order. Most com-
Stock market prediction specially from the Machine Learn- monly, a time series is a sequence taken at successive equally
ing point of view. Prediction of stock market behavior is spaced points in time. Thus it is a sequence of discrete-time
used to determine the future trends. Prediction is usually data.
accomplished by analyzing historic time series data. Note: We used the Apple,Yahoo data set. This is a time
Several algorithms have been used in stock prediction such series data set which is further smoothed exponentially as
as SVM, Neural Network, Linear Discriminant Analysis, Lo- discussed in section 4.
gistic Regression, Linear Regression, KNN and Naive Bayesian Gradient Boosting: Gradient boosting is a machine learn-
Classifier. It was found that logistic regression was one of ing technique for regression and classification problems, which
the best with a success rate of 55.65%. Dai and Zhang produces a prediction model in the form of an ensemble of
(2013) used the training data from 3M Stock data. Data weak prediction models, typically decision trees. It builds
contains daily stock information ranging from 1/9/2008 to the model in a stage-wise fashion like other boosting meth-
11/8/2013 (1471 data points). Multiple algorithms were ods do, and it generalizes them by allowing optimization of
chosen to train the prediction system. These algorithms an arbitrary differentiable loss function.
include Logistic Regression, Quadratic Discriminant Analy- Technical Indicators: Technical Indicators are important
sis and SVM. The algorithms were applied to the next day parameters that are calculated from time series stock data
model which predicted the outcome of the stock price on the that aim to forecast financial market direction. They are
next day and long term model, which predicted the outcome tools which are widely used by investors to check for bearish
of the stock price for the next n days. The next day predic- or bullish signals.
tion model produced accuracy results ranging from 44.52% Relative strength Index: Relative strength index(RSI)
to 58.2%. SVM reported the highest accuracy of 79.3%, For is calculated as
the long term prediction where time window taken was [Link] 100
one of the published paper [11] on ANN(artificial neural net- RSI = 100 −
(1 + RS)
work),authors used for the forecast the direction of Japanese
stock market gave an accuracy of 81.27%. In the published average gain over past 14 days
paper on [10] Random Forest(RF), an ensemble technique is RS =
average loss over past 14 days
used to predict the stock market prices( Trend up or Down)
and returned the highest accuracy of 78.81% for the ith day The RSI is classified as a momentum oscillator, measuring
on the direction of movement in the daily TSE(Tehran Stock the velocity and magnitude of directional price movements.
Exchange) index. Momentum is the rate of the rise or fall in price. The
Ensemble learning algorithms remains largely unexplored RSI computes momentum as the ratio of higher closes to
however. The focus of the is to implement the ensemble lower closes: stocks which have had more or stronger posi-
learning technique and to discuss its advantages over non- tive changes have a higher RSI than stocks which have had
ensemble techniques. We will be using an ensemble learning more or stronger negative changes.
method known as Xgboost to build our predictive model. The RSI is most typically used on a 14-day timeframe, mea-
Our model has been trained for 60, 90 and 120 days respec- sured on a scale from 0 to 100, with high and low levels
tively and the results were impressive. Moreover, majority marked at 70 and 30, respectively. Shorter or longer time-
of the the related work focused on the time window of 10 to frames are used for alternately shorter or longer outlooks.
44 days on an average as majority of the authors had pre- More extreme high and low levels-80 and 20, or 90 and 10 -
ferred to use metric classifiers on time series data which is occur less frequently but indicate stronger momentum.
not smoothed. Therefore, those models are unable to learn Stochastic Oscillator: Stochastic oscillator is given by
from the data set when it comes to predicting for a long term C − L14
window. Our model underscores that it firstly smooths the %k = 100 ∗
H14 − L14
data to begin with being a non-metric classifier, it is capable
of accurately predicting in a long term time window. where, C=current closing price; L14 = Lowest low over past
The paper will highlight certain critical aspects largely ig- 14 days; H14 = Highest high over past 14 days
nored by most of the literature. These include analyzing The term stochastic refers to the point of a current price in
the non-linearity in features used for analysis and futility relation to its price range over a period of time. This method
of employing linear classifiers, Long-run predictions running attempts to predict price turning points by comparing the
into 90 days where related manuscripts considered the time closing price of a security to its price range.
window up to 44 days. Significant improvement in accuracy William % R: William’s %R is give by
obtained by our approach embellish the claims. H14 − C
The remainder of the paper is organized as follows. Section %R = −100 ∗
H14 − L14
where, C = Current Closing Price; L14 = Lowest Low over as:
the past 14 days; H14 = Highest High over the past 14 days.
Williams %R ranges from -100 to 0. When its value is above S0 = Y0 ; St = α ∗ Yt + (1 − α) ∗ St−1
-20, it indicates a sell signal and when its value is below -80, where α is the smoothing factor, 0 < α < 1. Larger val-
it indicates a buy signal. ues of α reduce the level of smoothing. When α = 1, the
Moving Average Convergence Divergence(MACD): smoothed statistic becomes equal to the actual observation.
The formula for calculating MACD is: The smoothed statistic, St can be calculated as soon as two
observations are available. This smoothing removes random
M ACD = EM A12 (C)-EM A26 (C)
variation or noise from the historical data allowing the model
to easily identify long term price trend in the stock price be-
SignalLine = EM A9 (M ACD) havior. Technical indicators are then calculated from the
where, MACD = Moving Average Convergence Divergence; exponentially smoothed time series data which are later or-
C = Closing Price series; EM An = n day Exponential Mov- ganized into feature matrix. The target to be predicted in
ing Average. the ith day is calculated as follows:
When the MACD goes below the SignalLine, it indicates a targeti = Sign(closei+d −closei )
sell signal. When it goes above the SignalLine, it indicates
a buy signal. where d is the number of days after which the prediction is
Price Rate of Change: It is calculated as follows: to be made. When the value of targeti is +1, it indicates
that there is a positive shift in the price after d days and
C(t)-C(t-n)
P ROC(t) = -1 indicates that there is a negative shift after d days. The
C(t-n) targeti values are assigned as labels to the ith row in the
where, PROC(t) = Price Rate of Change at time t; C(t) = feature matrix.
Closing price at time t. It measures the most recent change We typically expect to have a huge data set in order to
in price with respect to the price in n days ago. enable the algorithm to recognize the pattern in the data
On Balance Volume: This technical indicator is used to set. This analysis becomes cumbersome with high possibil-
find buying and selling trends of a stock. ity of error. Machine Learning proposes solutions for han-
 dling such huge data in an efficient manner. There are a
OBV (t-1)+V ol(t) if C(t) > C(t-1)
 range of methods classified into metric and non-metric clas-
OBV (t) = OBV (t-1)-V ol(t) if C(t) < C(t-1) sifer based on their working principle.

OBV (t-1) if C(t) = C(t-1) XGBoost is an ensemble ML technique which is based on the
concept of Decision Tree with some advanced modifications
where, OBV(t) = On Balance Volume at time t; Vol(t) = that are designed to differentiate the performance of an XG-
Trading Volume at time t; C(t) = Closing price at time t. Boost model from that of a simple Decision tree model. We
Convex Hull: The convex hull of a set of points X is its present a brief overview of decision tree from the perspective
subset which forms the smallest convex polygon that con- of XGBoost.
tains all the points in X. A polygon is said to be convex Technical Indicators are important parameters that are cal-
if a line joining any two points on the polygon also lies on culated from time series stock data that aim to forecast fi-
the polygon. The significance of Convex Hull is explained nancial market direction. They are tools which are widely
in section 6.1 used by investors to check for bearish or bullish signals.
Linear Separability: Two sets of points X0 and X1 in n These technical indicators are calculated from time series
dimensional Euclidean space are said to be linearly separa- stock market data available for predicting the direction of
ble if there exists an n dimensional normal vector W of a stock market. The technical indicators used are RSI indi-
hyperplane and a scalar k, such that every point x ∈ X0 cator,Stocastic Oscillator, William % R, Moving Av-
gives W T x > k and every point x ∈ X1 gives W T x < k erage Convergence Divergence(MACD), Price Rate
Bagging: Bootstrap aggregating, also called bagging, is a Of Change,On Balance Volume and defined in the pre-
machine learning ensemble meta-algorithm designed to im- vious Section.
prove the stability and accuracy of machine learning algo- Technical indicators that are calculated using past observa-
rithms used in statistical classification and regression. It also tions, have been used as features. Thus, the order of the
reduces variance and helps to avoid overfitting. Although it dates become irrelevant when bagging is performed. We use
is usually applied to decision tree methods, it can be used these indicators that were calculated with t-n data and reuse
with any type of method. Bagging is a special case of the them to predict the t+1 event. An extreme example can be
model averaging approach. that we use features of day 3, features of day 30 to predict
day 45. However,doing that does not ignore information
embedded in the correlation of consecutive days. The fea-
4. DATA PREPROCESSING AND FEATURE tures of ith day is calculated using OHCLV data of the past
EXTRACTION n days. For example, for the sake of simplicity, let us as-
The data set is borrowed from yahoo finance, which included sume we have a technical indicator called F. This indicator
closing price, opening price, High, Low and Volume (Ap- is calculated using the closing price of the past n days i.e,
ple,Yahoo). The time series historical stock data is ex- Close(i-1), Close(i-2)... Close(i-n).
ponentially smoothed. Exponential smoothing applies Feature extraction is a mechanism that computes numeric
more weightage to the recent observation and exponentially or symbolic information from the observation. The main
decreasing weights to past observations. The exponentially task is to select or combine the features that preserve most
smoothed statistic of a series Y can be recursively calculated of the information and remove the redundant components in
order to improve the efficiency of the subsequent classifiers and non-redundant. Once, the feature is extracted and the
without degrading their performances. It is the process of feature matrix is prepared the data is subjected to train the
acquiring higher level information. The dimensionality of algorithm which is done under the process of ensemble learn-
the feature space may be reduced by the selection of sub- ing. After the model recognizes a pattern, 20% of the data
sets of good features. Feature extraction plays an important set is used to test the robustness of the model and finally
role in a sense of improving classification performance and the prediction made is checked for accuracy, specificity, sen-
reducing the computational complexity. It also improves sitivity which takes in the last block of the flowchart.
computational speed due to the fact that for less features, Surveying various machine learning algorithms was a key
less parameters have to be estimated. motivation even though some methods and algorithms could
Feature Extraction Algorithm: According to the pub- have been easily done with. This explains the reason for
lished paper[12] on Feature Selection, also known as the fea- describing methods such as SVM or LDA even though the
ture subset selection (FSS), or attribute selection (Attribute results are not very promising as they fall under the cate-
Selection) which is a method to select a feature subset from gory of metric classifiers as mentioned in [1]. We reiterate
all the input features to make the constructed model bet- that any learning method is as good as the data and with-
ter. In the practical application of machine learning, the out a balanced data set there could not exist any reason-
quantity of features is normally very large, in which there able scrutiny of the efficiency of the methods used in the
may exist irrelevant features, or the features may have de- manuscript or elsewhere. Non-metric classifiers which in-
pendence on each other. Feature Selection can remove irrel- clude Decision Tree, Boosted Trees, would bolster the logic
evant or redundant features, and thus decrease the number behind discouraging”Black-box” approaches to Data Analyt-
of features to improve the accuracy of the model. Also se- ics in the context of this problem or otherwise.
lecting the really relevant features can simplify the model, Before feeding the training data to the XgBoost model, the
and make the data generation process easy to understand two classes of data are tested for linear separability by
for the researchers. The XgBoost algorithm is able to rank finding their convex hulls. Linear Separability is a prop-
the various features based on their importance and based on erty of two sets of data points where the two sets are said
the rank the high or low is accomplished. to be linearly separable if there exists a hyperplane such
that all the points in one set lies on one side of the hyper-
5. METHOD plane and all the points in other set lies on the other side of
the hyperplane. The separability of the training set deter-
We begin by presenting a high level work flow diagram of
mines whether the hypothesis space can solve a particular
the method employed.
binary classification problem or not. The separability test
can provide a set of hypothesis (initial solutions) which can
be refined to minimize generalization error.
Data Collection

5.1 Test for linear separability


In order to check for Linear Separability, the convex hulls for
the two classes are constructed. If the convex hulls intersect
Exponential Smoothing
each other, then the classes are said to be linearly insepara-
ble. Principle component analysis is performed to reduce the
dimensionality of the extracted features into two dimensions.
This is done so that the convex hull can be easily visualized
in 2 dimensions. The convex hull test reveals that the classes
Feature Extraction
are not linearly separable as the convex hulls almost overlap.

Ensemble Learning

Stock Market Prediction

Fig.1:Illustration for proposed methodology

The first two blocks in the flowchart are Data Collection and Fig.2: Convex Hull Test for Linear Separability
Exponential Smoothing which has been discussed in section
4. The third block describes Feature Extraction is basi-
cally the selection of features from data set and building the The observation concludes that Linear Discriminant Analy-
derived values so that the data becomes more informative sis cannot be applied to classify our data and hence, provid-
ing a stronger justification to why Xgboost is used. Another tion gain due to a split can be calculated as follows
important reason is that since each decision trees in the al-
∆I(N ) = I(N ) − PL ∗ I(PL ) − PR ∗ I(PL )
gorithm operate on the random subspace of the feature space,
it leads to the automatic selection of the most relevant subset where I(N) is the impurity measure (Gini or Shannon En-
of features. tropy) of node N, PL is the proportion of the population in
After the analysis it is found that the data is not linearly sep- node N that goes to the left child of N after the split and
arable, hence the metric methods are not preferred. There- similarly, PR is the proportion of the population in node N
fore the implementation of non-metric methods come into that goes to the right child after the split. NL and NR are
picture which is discussed in the next section. the left and right child of N respectively.
Assume there are n data points D = {(xi , yi )}n i=1 and feature
vectors {xi }ni=1 with stated outcomes. Each feature vector
5.2 Non-Metric Classifiers is d- dimensional.
A few significant and often used non-metric classifier include: 1: We define a classification tree where each node is endowed
Decision Tree with a binary decision if xi <= k or not; where k is some
Random Forest threshold. The topmost node in the classification tree con-
Xtreme Gradient Boost tains all the data points and the set of data is subdivided
Decision Tree among the children of each node as defined by the classi-
Decision trees can be used for various machine learning ap- fication . The process of subdivision continues until every
plications. Decision tree constructs a tree that is used for node below has data belonging to one class only. Each node
classification and regression. But trees that are grown really is characterized by the feature, xi and threshold k chosen
deep to learn highly irregular patterns tend to over-fit the in such a way that minimizes diversity among the children
training sets. A slight noise in the data may cause the tree nodes. This is often referred as gini impurity.
to grow in a completely different manner. This is because 2: X = (X1 , ..., Xd ) is an array of random variables de-
of the fact that decision trees have very low bias and high fined on probability space called as random vectors. The
variance. Each of the nodes in tree is split on the basis of joint distribution of X1 , ..., Xd is a measure on µ on Rd ,
training set attributes. Every other node of the tree is then µ(A) = P (X ∈ A), A ∈ Rd where d = 1, ...., m. For exam-
split into child nodes based on certain splitting criteria or ple, Let x = (xi , ...., xd ) be an array of data points. Each
decision rule, which determines the allegiance of the particu- feature xi is defined as a random variable with some distri-
lar object (data) to the feature class. The leaf nodes must be bution. Then the random vector X has joint distribution
pure nodes; when any feature vector that is to be classified identical to the data points, x.
reaches a leaf node. Splitting is done on the basis of highest 3: Let us represent hk (x) = h(x|θk ) implying decision tree
importance of the attribute which is done using Gini impu- k leading to a classifier hk (x). Thus, a random forest is a
rity or Shannon entropy. Information gain is calculated classifier based on a family of classifiers h(x|θ1 ), ...., h(x|θk ),
and the attributes are selected. One significant advantage built on a classification tree with model parameters θk ran-
of decision trees is that both categorical and numerical data domly chosen from model random vector θ. Each classifier,
can be dealt with; a disadvantage is that decision trees tend hk (x) = h(x|θk ) is a predictor of the number of training
to over-fit the training data. In order to prevent over fitting samples. y =+ − 1 is the outcome associated with input data,
of the model pruning must be done, while constructing the x for the final classification function, f (x).
tree or after the tree is constructed. Next, we describe the working of the Xgboost learner by
Gini impurity is used as the function to measure the quality exploiting the key concepts defined above.
of split in each node. Gini impurity at node N is given by
X Xtreme Gradient Boost(XGBoost)
g(N ) = P (wi )P (wj ) XGBoost is another method which comes under non-metric
i6= j
classifier family which is based on the concept of Decision
where P (wi ) is the proportion of the population with class Tree, but there are significant differences between the two.
label i. Another function which can be used to judge the XGBoost is an ensemble of decision trees wherein weighted
quality of split is Shannon Entropy. It measures the disor- combinations of predictors is taken. XGBoost works on the
der in the information content. In Decision trees, Shannon same lines of Random Forest, but there is a difference in
entropy is used to measure the unpredictability in the in- working procedures. The similarities are that the features
formation contained in a particular node of a tree (In this extracted in both the cases is completely random in nature.
context, it measures how mixed the population in a node If n is the total number of attributes in the feature matrix
is). The entropy in a node N can be calculated as follows. then lets say m is the number of attributes which are finally
chosen to determine the split at each node. Here m is related
d
X to n in the following way
H(N ) = − P (wi )log2 (P (wi )) n
i=1 m=
3
where d is number of classes considered and P (wi ) is the XGBoost basically is a collection of weak classifier decision
proportion of the population labeled as i. Entropy is the trees and it primarily focuses to train the new decision tree
highest when all the classes are contained in equal propor- to learn from the errors committed by the previous tree[s].
tion in the node. It is the lowest when there is only one class The learning trees are trained sequentially. Initially, a re-
present in a node (when the node is pure). gression function is drawn which is fitted to the data set and
The best split is characterized by the highest gain in infor- due to random plotting of regression function errors occur
mation or the highest reduction in impurity. The informa- which are referred to as residual errors. Subsequently, a
plot of all the residual errors is considered and another re- ”+1+ and ”-1” has been removed as this is used for train-
gression function is made to fit the model, the residual errors ing. ID of the data set has been removed as it adds to
occurred in that case is taken care by the combination of the the noise and is not significant at all. Gneralized Linear
previous regression function and the current regression func- Models, for instance, assume that the features are uncorre-
tion. Hence continuing in this manner the regression func- lated. Assuming otherwise may sometimes make prediction
tion gets more and more complex in nature and the root less accurate, and most of the time make interpretation of
mean squared error is observed to be significantly reduced. the model almost impossible, under the GLM setting. Fortu-
The following are the basic steps followed while executing nately, boosted trees are very robust to these features. This
XGBoost eliminates the exercise of checking strong/weak correlation
of the features.
5.3 Algorithm Feature ranking & Feature correlation: Building a fea-
The following steps are recursively carried throughout the ture importance list is important as the ranks show the de-
process. creasing importance of features and may indicate/suggest
Step 1: Learn a regression Predictor. omitting a few based on Gain, Cover and Frequency. In this
Step 2: Compute the error residual. case, the Gain values are the same for all the six features.
Step 3: Learn to predict the residual. The error rate is Therefore pruning any of the features is not advised.
calculated using the parametersmentioned below. All the 6 technical indicators used are strongly correlated to
Error in prediction is given by each other. The correlation is confirmed by performing the
Chi-Squared Test. Higher Chi- Squared values imply better
J = (y, ŷ) correlation, p-values were always greater than 0.1.
where One of the limitations of XgBoost is its limited flexibility in
X handling non-numeric data. If there are categorical or ordi-
J(.) = (y[i] − ŷ[i])2 nal data, conversion to numeric data is necessary. The data
set under investigation has only numeric data, therefore it
ŷ can be adjusted in order to reduce the error by using the
did not affect the performance of XgBoost while pruning or
following formula
splitting.
y[i] = y[i] + αf [i] As figures 3 and 4 show (Please refer the next section, Re-
sults) train-RMSE error decreases. The model learns the
where data well without exhibiting random fluctuations in the er-
f [i] ≈ ∇J(y, ŷ) ror rate. Let us now proceed to the experimental results
which will confirm the theory and expectations from the
Each learner estimates the gradient of the loss function. predictive model.
Gradient Descent is used to take sequence of steps to re-
duce sum of predictors weighted by step size α. We present
the proposed algorithm for XGBoost below.
6. RESULT
The main aim of this paper is to predict the rise and fall of
Algorithm 1 Xtreme Gradient Boosting
the stock market. Hence as a measuring parameter we have
1: procedure XtremeGradientBoost(D) . D is the used +1 for indicating the rise in stock valuation in the
labeled training data future and -1 to indicate the fall in the prices. The follow-
2: Initialize model with a constant value ing results were observed after the computation of the data
n
set by using XGBoost. We obtain the root mean squared
X
F0 (x) = arg min L(yi , γ).
γ
i=1
error(RMSE) for the 60 day prediction and 90 day pre-
3: for do m = 0 to M diction for Apple Inc. Data set as
4: Compute the pseudo-residuals
5: Fit base learner to pseudo residuals
6: Ti = new DecisionTree()
7: f eaturesi = RandomFeatureSelection(Di )
8: Ti . train(Di ,f eaturesi )
9: Compute multiplier γm
10: Update the model
11: output Fm (x)

L(y, γ) is the differentiable loss function and hm (x) is the


base learner, connected by the following relation: Fm (x) =
Fm−1 (x) + γm hm (x).

5.4 Analysis & Data Discovery


The following aspects deserve some discussion in the context
of the data and algorithm applied to analyze the data.
Data Set Analysis: The data set doesn’t contain catego-
riacal and ordinal variables. The column containing labels, Fig. 3: RMSE plot for Apple Inc. data set
Fig.4 shows the reduction in RMSE for the 28 day, 60 day 6.1 Area Under ROC Curve
and 90 day prediction respectively for Yahoo!Inc. In statistics, a receiver operating characteristic(ROC),
Data set for each iteration: or ROC curve, is a graphical plot that illustrates the per-
formance of a binary classifier system as its discrimination
threshold is varied. The curve is created by plotting the true
positive rate (TPR) against the false positive rate (FPR) at
various threshold settings. The true-positive rate is also
known as sensitivity, recall or probability of detection[1] in
machine learning. The false-positive rate is also known as
the fall-out or probability of false alarm[1] and can be calcu-
lated as (1 - specificity). The ROC curve is thus the sensitiv-
ity as a function of fall-out. In general, if the probability dis-
tributions for both detection and false alarm are known, the
ROC curve can be generated by plotting the cumulative dis-
tribution function (area under the probability distribution
from -∞ to the discrimination threshold) of the detection
probability in the y-axis versus the cumulative distribution
function of the false-alarm probability in x-axis.
There are four possible outcomes from a binary classifier. If
Fig.4: RMSE plot for Yahoo! Inc. data set the outcome from a prediction is p and the actual value is
also p, then it is called a true positive (TP); however if the
actual value is n then it is said to be a false positive (FP).
It is clear from these graphs that there is a decreas-
Conversely, a true negative (TN) has occurred when both
ing trend of RMSE value as the number of iterations
the prediction outcome and the actual value are n, and false
increases.
negative (FN) is when the prediction outcome is n while the
The parameters that are used to evaluate the robustness of
actual value is p. To draw a ROC curve, only the true pos-
a binary classifier are accuracy, precision and recall (also
itive rate (TPR) and false positive rate (FPR) are needed
known as sensitivity and specificity). These parameters are
(as functions of some classifier parameter). The TPR defines
calculated by Confusion Matrix. The formula to calculate
how many correct positive results occur among all positive
these parameters are given below:
samples available during the test. FPR, on the other hand,
tp + tn defines how many incorrect positive results occur among all
accuracy =
tp + tn + f p + f n negative samples available during the test.
A ROC space is defined by FPR and TPR as x and y axes
tp
precision = respectively, which depicts relative trade-offs between true
tp + f p positive (benefits) and false positive (costs). Since TPR is
tp equivalent to sensitivity and FPR is equal to 1 - specificity,
recall = the ROC graph is sometimes called the sensitivity vs (1 -
tp + f n
specificity) plot. Each prediction result or instance of a con-
tn fusion matrix represents one point in the ROC [Link]
specif icity =
tn + f p diagonal divides the ROC space. Points above the diagonal
where, represent good classification results (better than random),
tp = Number of true positive values points below the line represent poor results (worse than ran-
tn = Number of true negative values dom).
fp = Number of false positive values We can infer that our model performed significantly well as
fn = Number of false negative values; the accuracy results compared to non-ensemble techniques as well as other en-
for the two data sets has been tabulated below: semble techniques for a short term window of 28 days as
well as long term window of 60 and 90 days respectively.
Results For the Apple data set used, the time window taken was
Days accuracy precision recall specificity 60 and 90 days for predicting the results which gave bet-
60 0.879918 0.773997 0.856182 0.890330 ter results compared to non-ensemble techniques and other
90 0.897095 0.756569 0.888198 0.901730 ensemble techniques. The above four curves showcase the
performance measure of the algorithm. The diagonal line in
Table 1: Accuracy results for 60 & 90 days for Apple Inc. the graph is the threshold line which basically shows that if
the performance curve is above the diagonal line or in other
words higher is the curve above the diagonal line more is the
Results predicted accuracy and vice versa.
Days accuracy precision recall specificity
28 0.9995 1.0 0.99916 1.0 7. DISCUSSION AND CONCLUSION
60 0.99918 0.99915 0.99915 0.99920 The algorithm used here has to be checked for its robustness.
90 0.99917 1.0 0.9982 1.0 this is accomplished by checking the accuracy in predicting
the result or it can be achieved by analyzing the receiver
Table 2: Accuracy Results for 28, 60 & 90 days for Yahoo! operating characteristics or the ROC curve. It is evident
Inc. that Xgboost has outperformed the metric as well as other
non-metric classifiers in accuracy. From Fig.9 and 10 it is stock forecasting. Many algorithms such as SVM, ANN etc.
clear our model gave higher accuracy. have been studied for robustness in predicting stock market.
According to the area under ROC curve obtained for 60 However, ensemble learning methods have remained unex-
and 90 day prediction from Fig.5 and Fig.6, results are ploited in this field. In this paper, we have used XGBoost
clearly better than the previously used machine learning classifier to build our predictive model and our model has
methods(AUC for 60 day is 0.8435109 and for 90 day pre- produced really impressive results. The model is found to
diction XGboost gives AUC of 0.7127071 ). Fig 7and Fig 8 be robust in predicting future direction of stock movement.
(AUC for 28 day is 0.94 and for 60 day prediction XGboost The robustness of our model has been evaluated by calcu-
gives an AUC of 1.0 ) show the same evidence. Also the 90 lating various parameters such as accuracy, precision, recall
day prediction on Yahoo! data set gave AUC of 1.0. Lo- and specificity. For all the datasets we have used i.e, Apple
gistic Regression poorly performs as compared to SVM and and Yahoo, we were able to achieve accuracy in the range 87-
Xgboost. Also from the bar graph, it is evident that SVM 99% for long term prediction. ROC curves were also plotted
performs well with an accuracy of nearly 80% but Xgboost to evaluate our model. The curves demonstrate the fidelity
however outrules SVM in terms of Accuracy(≈88%). of our model graphically.
In the published paper[10] on ANN(Artificial Neural Net- Our model can be used for devising new strategies for trad-
work) which predicted for ith day, results showed an accu- ing or to perform stock portfolio management, changing
racy of 81% which is less than the accuaracy obtained from stocks according to trends prediction. In future, we could
our given model. Likewise, in [11], Random Forest (RF) build boosted tree models to predict trends for really short
model could predict stock market movement direction with time window in terms of hours or minutes. Ensembles of
an accuracy of 78.81% which is still lesser than Xgboost’s different machine learning algorithms can also be checked
predicted accuracy. for its robustness in stock prediction. We also recommend
The remarkably high accuracy of prediction in the case of exploration of the application of Deep Learning practices in
Yahoo! Inc. data set could be a matter of concern. Natural Stock Forecasting involving learning weight coefficients on
suspicion about inherent bias in training data could arise. large, directed and layered graphs.
However, we have checked the data set and confirm the non-
existence of heavy bias of data set. The proportion of positive
and negative data are in range of 45:55.
The literature survey helps us conclude that Ensemble learn-
ing algorithms have remained unexploited in the problem of
stock market prediction. We have used an ensemble learning
method known as XGBoost to build our predictive model.
The comparative analysis testifies the efficacy of our model
as it outperforms the models discussed in the literature sur-
vey. We believe that this is due to the lack of proper data
processing in [07][15][16]. In this paper, we have performed
exponential smoothing which is a rule of thumb technique
for smoothing time series data. Exponential smoothing re-
moves random variation in the data and makes the learning
process easier. To our surprise, very few papers found in the
literature survey exploited the technique of smoothing. An-
other important reason could be the inherent non linearity
in data. This fact discourages the use of linear classifiers.
However in [15], the authors have used linear classifier al- Fig.5: ROC Curve for 60 days(Apple Inc.) AUC is
gorithm as the supervised learning algorithm which yielded 0.8435
a highest accuracy of 55.65%. We believe that the use of
SVM in [7] and [13] is ill-advised. Due to that fact that
the two classes in consideration (rise or fall) are linearly in-
separable, researchers are compelled to use SVM with non
linear kernels such as Gaussian kernel or Radial Basis Func-
tion. Despite many advantages of SVMs, from a practical
point of view, they have some [Link] these
arguments we can also conclude why the prediction by these
classifiers were limited to a maximum of 44 day time win-
dow which qualifies our learning model to surpass all these
metric classifiers in terms of long term prediction. An im-
portant practical question that is not entirely solved, is the
selection of the kernel function parameters - for Gaussian
kernels the width parameter σ - and the value of  in the 
loss insensitive function (Horvath (2003) in Suykens et
al.).
Predicting stock market due to its non linear, dynamic and
complex nature is really difficult. However in the recent
years, machine learning techniques have proved effective in Fig.6: ROC curve for 90 days(Apple Inc). AUC is 0.7127
Fig.10: Result for Long term Prediction on Yahoo! Inc. :
Fig.7: ROC curve for short term prediction of 28
XGBoost beats all other predictive algorithms reported in
days(Yahoo! Inc.). AUC is 0.94
literature by quite a margin.

The proposed model indicates, to the best of our knowledge,


the nonlinear nature of the problem and the futility of using
linear discriminant type machine learning algorithms. The
accuracy reported is not pure chance but is based solidly on
the understanding that the problem is not linearly separable
and hence the entire suite of SVM type classifiers or related
machine learning algorithms should not work very well. The
solution approach adopted is a paradigm shift in this class
of problems and minor modifications may work very well for
slight variations in the problem statement.

8. REFERENCES
[1]Das Shom Prasad,Padhy Sudersan, Support Vector Ma-
chines for Prediction of Future Prices in indian Stock Mar-
[Link] Journal of Computer Applications(0975-
8887),March 2012.

Fig.8: ROC curve for 60 days(Yahoo! Inc.). AUC is 1.0 [2]Chauhan Bhagwant,Bidave Umesh, Gangathade Ajit, Kale
Sachin Stock Market Prediction Using Artificial Neural Net-
[Link] Journal of Computer Science and Infor-
mation Technology,Vol 5(1),2014,904-907

[3]Kar Abhishek Stock Market Prediction using Artificial


Neural NetworkS.Y8021

[4]S. R. Y. Mayankkumar B Patel, Stock prediction using


artificial neural network, International Journal of Innovative
Research in Science, Engineering, and Technology, 2014.

[5]Mingyue iu and u Song .Predicting the Direction of Stock-


Market Index Movement Using an Optimized Artificial Neu-
ral Network Model .Published online 2016 May 19. doi:
10.1371/ [Link].0155133

[6]Elisa Siqueira, Thiago Otuki,Newton da Costa Jr Stock


Return and Fundamental Variables:A Discriminant Analysis
[Link] Mathematical Sciences, Vol. 6, 2012, no.
115, 5719 - 5733.
Fig.9: Result for Long term Prediction on Apple Inc. :
XGBoost beats all other predictive algorithms reported in [7]Yuqing Dai, Yuning Zhang (2013). Machine Learning in
literature. Stock Price Trend Forecasting. Stanford University.
[8]Hellstrom, T., Holmstromm, K. (1998). Predictable Pat-
terns in Stock Returns. Technical Report Series IMa-TOM-
1997-09 .

[9]R. Gencay, Linear, non-linear and essential foreign ex-


change rate prediction with simple technical trading rules,
Journal of International Economics, vol. 47,no., pp. 91-
107,19.

[10]A. Timmermann and C. W Granger, Efficient market


hypothesis and forecasting, International Journal of Fore-
casting, vol. 20,no., pp. 15- 27,2004.

[11] Sadegh Bafandeh Imandoust and Mohammad Bolan-


draftar. Forecasting the direction of stock market index
movement using three data mining techniques: the case of
Tehran Stock ExchangeS. Bafandeh Imandoust Int. Journal
of Engineering Research and Applications ISSN : 2248-9622,
Vol. 4, Issue 6( Version 2), June 2014, pp.106-117

[12]Yuqinq He, Kamaladdin Fataliyev, and Lipo Wang, Fea-


ture Selection for Stock Market Analysis.

[13]Phichhang Ou,Hengshan Wang. Prediction of Stock Mar-


ket Index Movement by Ten Data Mining Techniques.

[14]Khan,W., Ghazanfar,M.A., Asam,M., Iqbal,A., Ahmed,S.,


Javed Ali Khan. Predicting Trend In Stock Market Ex-
change Using Machine Learning Classifiers. [Link](Lahore),
28(2), 1363-1367, 2016

[15]Haoming Li, Zhijun Yang and Tianlun Li (2014). Algo-


rithmic Trading Strategy Based On Massive Data Mining.
Stanford University.

[16]Xinjie (2014). Stock Trend Prediction With Technical


Indicators using SVM. Stanford University.

View publication stats

You might also like