IEEE 2021 DQL Extracting Decision Tree
IEEE 2021 DQL Extracting Decision Tree
Abstract—Deep reinforcement learning (DRL) has achieved advance for mimicking the cognitive process of human experts
promising results on traffic signal control systems. However, due to solve decision-making problems. Although most of these
to the complexity of the decisions of deep neural networks, methods are good at explainability, their performance may not
it is a great challenge to explain and visualize the policy
of reinforcement learning (RL) agents. The decision tree can be sufficient to meet the requirements due to the complexity
provide useful information for experts responsible for making of high-dimension problems.
reliable decisions. In this paper, we employ decision trees to In recent years, the notion of eXplainable Artificial Intel-
extract models with readable interpretations from expert policy ligence (XAI), especially eXplainable reinforcement learning
achieved by DRL methods. We evaluate our methods via a (RL) has renewed interest with the recent successes in the
single-intersection traffic signal control task on the simulation
platform of Urban MObility (SUMO). The experimental results technique of deep neural networks. It is a great demand
demonstrate that the extracted decision trees can be used to that non-expert users are to trust and manage the intelligent
understand the learning process and the learned optimal policy system. A major factor contributing to the tension between
of the DRL methods. performance and interpretability is the increasing opacity of
Index Terms—Deep reinforcement learning, Decision tree, machine learning models, especially neural networks defined
Traffic signal control systems
by complex structures and a large number of parameters [11].
I. I NTRODUCTION Currently, many state-of-the-art deep reinforcement learning
(DRL) methods employ neural networks technique for ap-
With the increasing traffic volumes and limited road ca- proximating their function. With stronger function approxi-
pacities, inefficient traffic light cycle control causes numer- mators, DRL has shown great promise in various domains,
ous problems, such as long delay and waste of energy. To e.g. complex high dimensional games [12], [13] and high-
address the traffic congestion problem, researchers developed dimension continuous control tasks [14]. It may be acceptable
intelligent transportation systems (ITSs) that utilize machine in some cases when losing explanation capability in support of
learning techniques to control traffic flow over collected traffic RL performance, especially in video games or board games.
information [1], [2]. However, in many other practical applications, especially
Recent development in sensing and communication tech- medicine, law, finance, and transportation, the assurance about
nologies have obtained great progress in the field of ITSs the reasoning behind these complex models is necessary to
owning to the ability to collect traffic data in real time, reduce decision risk. Generally, the real context is simulated
especially in the traffic signal control systems (TSCS) [3], [4]. by oversimplifying complex dynamics formatting the digital
Many algorithms have been proposed for systems operating in environment.
TSCS. Fuzzy logic methods were first proposed to solve traffic Compared to reinforcement learning, decision trees are
congestion over a single [5] and multiple intersections [6]. Ex- regarded as models with readable interpretations for humans
pert systems were designed to solve congested traffic task with by visualizing the decision path [15]. But decision trees suffer
a knowledge-based methodology [7]. Besides, the techniques from low accuracy for weak expressivity. The soft decision tree
inspired by biological behavior, such as genetic algorithms [8], method creates a type of soft decision tree by using the training
swarm optimization algorithms [9] and ant colony optimization neural networks, which generalizes better than the one learned
algorithms [10], are employed to solve traffic congestion. directly from the training data [16]. Besides, the verifiable
These methods need to know the prior knowledge of the reinforcement learning via policy extraction (VIPER) achieves
system model and consider extensive possible situations in performance equal to that of the original policy [17]. Hence,
978-1-6654-2621-3/21$31.00 © 2021 IEEE we employ the VIPER method to improve the interpretations
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
of the trained DRL model for traffic signal control systems. for all actions. After training the network, we will obtain the
In this paper, we propose an algorithm of extracting decision optimal policy by selecting the action with the largest Q-
trees from trained reinforcement learning in TSCS to improve function.
the interpretability. First, we employ the Deep Q-network
B. Decision tree: ID3 and CART
(DQN) algorithm to learn the expert policy in TSCS. Second,
we utilize the expert policy model to collect state-action pairs Decision trees provide a white box structure for approxi-
by interacting with the TSCS. Third, we use the decision tree mating the target function, in which the learned function is
method, e.g., iterative dichotomiser 3 (ID3) and classification represented by a binary tree. A decision tree is a flowchart-
and regression trees (CART), to fit the dataset of the collected like structure in which each internal node represents a test
state-action pairs for explaining the expert policy. Finally, of an attribute, each branch represents the result of the test,
we evaluate the proposed method in a single-intersection and each leaf node represents a class label. The regression
traffic signal control task on the simulation platform of Urban and classification rule are generated as it is traversed down
MObility (SUMO). from root to leaf nodes. According to decision tree algorithms,
we can split each node recursively. Iterative dichotomiser 3
II. PRELIMINARIES (ID3) [18] selects the highest information gain attribute to
A. Reinforcement learning split attribute by using entropy function and information gain:
m
We model the traffic signal control task as a Partially X
Observable Markov Decision Process (POMDP). POMDP is Entropy = −pi log2 pi , (7)
i=1
composed of six components: 1) a set of real state space Z;
2) a set of observed state space S; 3) a set of actions A; 4) where pi is the probability of class i and m is the number of
a state transition function P; 5) a reward function R; 6) a classes for the target attribute.
discount factor γ ∈ [0, 1), which can be represented as a tuple The classification and regression trees (CART) method [19]
(Z, S, A, P, R, γ). measures the divergences between the probability distributions
A policy, π : S × A → [0, 1], is defined as a mapping from of the target attribute’s values with the Gini index:
the state space S to the action space A by interacting with the k
X
environment, and we have Gini index = 1 − p2i , (8)
X i=1
π(a|s) = 1, ∀s ∈ S. (1)
a∈A
where pi is the probability of class i. The tree selects recur-
sively the attribute with the lowest Gini index [20].
The goal is to obtain an optimal policy
III. E XTRACTING D ECISION T REE FROM T RAINED DRL
π ∗ = arg maxJ(π), (2)
π In this section, we present the framework of extracting
which maximizes the discounted reward function decision trees from trained DRL with specific implementations
"∞ # of obtaining the expert policy model, getting the trajectories
X
t from environments and extracting the decision tree.
J(π) = Eτ ∼π(τ ) [r(τ )] = Eτ ∼π(τ ) γ rt , (3)
t=0 A. Overall architecture
where τ = (s0 , a0 , s1 , a1 , ...) is the learning episode, π(τ ) = We focus on the interpretability of the deep reinforcement
p(s0 )Π∞t=0 π(at |st )p(st+1 |st , at ), rt is the immediate reward learning (DRL) in the traffic signal control system. Reinforce-
received on the transition from st to st+1 under action at . ment learning techniques promise to produce traffic signal
Deep Q-network (DQN) is a variation of the classic Q- controls that perceive, learn, decide, and act on their own.
learning algorithm, which utilizes the deep learning architec- However, they will be unable to explain their decisions and
ture with parameter θ to approximate Q-function Q(s, a) [12], actions to human users. We propose the extracting decision
i.e., tree from the trained DRL model in the traffic signal control
Q(s, a) ≈ Q(s, a; θ). (4) system to improve the interpretability of DRL.
The whole learning process in our methods is shown in
It learns the action-value function Q∗ corresponding to the Fig. 1. First, we employ the DQN algorithm to learn the
optimal policy by minimizing the loss: expert policy π ∗ for traffic signal control. Considering that the
h i
2
L(θ) = Es,a,r,s0 (Q∗ (s, a | θ) − y) , (5) different levels of the expert policy may have different effects
on collecting the state-action pairs dataset, we save the expert
y = r + δ max
0
Q∗ (s0 , a0 ) , (6) policy when the total rewards are up to a higher level with tiny
a fluctuation. Next, after obtaining the expert policy, we employ
where y is a target Q function whose parameters are pe- policy π ∗ to collect data by interacting with the environment
riodically updated with the most recent θ to help stabilize for a few episodes. To better explain the expert policy, we
the learning process. In general, the agent trains the neural simulate all the cases in the same traffic signal control task.
network in such a way that it predicts the cumulative rewards Finally, we employ the decision tree algorithm to fit the
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Traffic light control task control problems. When the total reward of the agent gets
closer to the optimal without divergence, we consider it better
to achieve the efficiency and quality of the learned policy.
C. Getting trajectories
After obtaining the expert policy π ∗ , we are still unable to
understand why the traffic signal control system takes these
decisions and actions. To better explain the expert policy, we
employ the policy extraction method to transform the expert
policy into a different format which could be a smaller neural
network or the other format such as a decision tree. There are
Deep Q-network many methods to replicate the expert policy. Considering that
Get the expert policy π* the VIPER method fits a single decision tree regardless of the
internal structure of the expert policy, we employ the VIPER
method to implement policy extraction.
The VIPER method achieves trajectories by executing the
expert policy π ∗ in the environment E. The agent observes
state st and then takes action at by querying the expert
policy π ∗ at every time step. We obtain a state-action pair
Expert policy π* (st , at ) which we will save to fit a decision tree. The agent
will transit to a new state st+1 by executing the action
Collect the trajectories
at , and the process repeats until reaching the terminal. We
record the sequence of the states and actions as the trajectory
T = {(s1 , a1 ), (s2 , a2 ), . . . , (st , at ), sterminal }. Thus we can
obtain a dataset of state-action pairs via interacting with the
environment using the expert policy for each episode.
D. Extracting the decision tree
Dataset Decision tree The dataset is then utilized to learn a decision tree with a
Extract the decision tree
supervised learning method where the states and actions are
used as features and labels, respectively. The decision tree is
Explainable policy built by splitting the dataset with CART and ID3 methods,
learning from the root node of the tree and selecting the
Fig. 1. Overall architecture of extracting decision trees from the trained deep successor children.2 To measure the divergence between the
Q-network.
probability distributions of the features, we employ the Gini
index as an impurity-based criterion. Besides, we utilize the
collected dataset and explain the DRL model disregarding its pre-pruning methods to prevent the generation of unimportant
inner processing or internal representations. branches resulting in overfitting the training data and poorly
generalizing to test data. Compared to neural networks, the
B. Obtaining the expert policy model decision tree can explain how they arrive to a decision. The
In this stage, we can employ any learning methods to obtain obtained decision tree is a binary tree, in which the branch
the expert policy. For example, it could be created by utilizing nodes represent a condition regarding the input feature (which
any traffic signal control algorithms (e.g., evolutionary algo- is the state space), and the leaf nodes represent the action
rithms and artificial neural networks) or reinforcement learning labels.
algorithms. The core task of this stage is utilizing any existing E. Integrated Algorithm
method to obtain a traffic signal control policy. The expert
The integrated algorithm of extracting the decision tree from
policy, π ∗ : S × A → [0, 1], defines how a model or object
DRL is summarized as in Algorithm 1. First, in the traffic
interacts with the environment by mapping from perceived en-
P signal control task, the agent obtains the expert policy π ∗
vironmental states to actions, and a∈A π(a|s) = 1, ∀s ∈ S.
by using the DQN algorithm (Line 3). Second, we employ
That is, at each step t, the traffic signal controller can observe
the expert policy π ∗ to generate state-action pairs dataset by
state st , query the expert policy π ∗ to obtain action at , receive
interacting with the environment (Lines 4-12). Third, after
a new observation st+1 by taking action at , and repeat.
obtaining the dataset, we employ the decision tree methods
To obtain the expert policy, we employ DRL, specifically the
to fit the expert policy and improve the explainability of the
DQN algorithm which combines Q-learning and deep neural
DRL model (Lines 13 and 14).
networks. 1 The DRL algorithm is widely used for discrete
2 See Appendix B and C for more details of the ID3 and CART algorithms,
1 See Appendix A for more details of the DQN algorithm. respectively.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Extracting Decision Tree from DRL
Input: Learning rate α; replay memory size M ;
greedy ε; minibatch size B; discount factor γ.
Output: expert policy π ∗ , decision tree
0
1 Initialize parameters θ, θ in the primary and target
neural networks with random values.
2 Initialize the replay memory m to be empty.
∗
3 π ← Get expert policy using the DQN algorithm.
4 for total reward keep a high and stable performance
do each episode
(a) Case A (b) Case B
5 Initialize state s
6 for η up to Tmax do each step of episode
7 Choose an action a according to the expert
policy π ∗ .
8 Take action a, observe reward r0 , new state s0 .
9 Store the state-action pair (st , at ) into the
dataset. D.
10 s ← s0
11 end
12 end
13 Clean the data.
(c) Case C (d) Case D
14 Learn the decision tree using the CART or ID3
algorithms. Fig. 2. Cases A, B, C and D are four possible actions.
IV. E XPERIMENT all-yellow phase which appears lasting for ty [s] seconds
before the second action taken. Here, we prefer the
In this section, we conduct experiments in a single- duration of ty = 2s to ensure a safe switch between
intersection traffic signal control task on the Urban MObility two different continuous phases.
(SUMO) platform [21]. The SUMO platform can achieve real- • Reward setting: The setting of reward is formulated as
time visualization on the network and simulated entities. It follows:
also has various vehicle behaviour models, roads with several X
lanes and traffic lights. Besides, SUMO supports built-on rth = − (queuet+∆t [l] + a · waitt+∆t [l]) , (9)
computational frameworks for RL and provides APIs for road ih∈ε,l∈Lih
network design, traffic volume simulation and traffic light
control. Specifically, SUMO can control the time of traffic where a is a trade-off coefficient, waitt and queue
signals according to the given policy. In addition, we can measure the cumulative delay of the first vehicle and the
import real-world road networks from OpenStreetMap to it, queue length along each incoming lane, respectively. We
and the vehicles can get into the system from any position in set ∆t = 5s, which means the interaction between the
the network. RL agents and the traffic environment appears every 5s.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
the E-W lanes, the decision tree model gives action 2, which
means executing the E-W straight phase to reduce the traffic
waiting time. To better visually present the explainability of
the expert policy, we set the max depth of CART algorithms as
3. As shown in Fig. 4, we can easily understand the decision
process in this reduced case.
When lower complexity brings more understandability, the
accuracy of the decision tree is slipped. As shown in Fig. 5,
the CART achieves better performance when the max depth
of the CART increases. The max depth of the decision tree 5
is specified by 190 parameters, which means this tree requires
about 13% of the parameters needed to specify the DQN.
Besides, with less parameters learned, decision tree methods
require less training time than the DQN. Generally, using an
appropriate tree depth setting of the decision tree can have a
Fig. 3. The performance of the tested methods implemented by DQN, ID3 better explanation without lossing the performance.
and CART in a single-intersection traffic signal control task.
V. C ONCLUSION
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
state[49] ≤ 0.5
gini = 0.704
samples = 13852
value = [3713, 5347, 3612, 1180]
class = action: 1
True False
Fig. 4. Demonstration of how a CART makes a decision given the information of the state. The max depth of the CART is set as 3.
[4] S. G. Shelby, D. M. Bullock, D. Gettman, R. S. Ghaman, Z. A. Sabra, [7] N. V. Findler, S. Surender, and Ş. Catrava, “On-line decisions about
and N. Soyke, “An overview and performance evaluation of acs lite–a permitted/protected left-hand turns in distributed traffic signal control,”
low cost adaptive signal control system,” in Transp. Res. Board, vol. Eng. Appl. Artif. Intell., vol. 10, no. 3, pp. 315–320, 1997.
190, 2008, pp. 130–137. [8] Z. Liu, “A survey of intelligence methods in urban traffic signal control,”
[5] C. P. Pappis and E. H. Mamdani, “A fuzzy logic controller for a trafc Int. J. Comput. Sci. Netw. Secur., vol. 7, no. 7, pp. 105–112, 2007.
junction,” IEEE Trans. Sys., Man, Cybern., vol. 7, no. 10, pp. 707–717, [9] C. Dong, Z. Liu, and X. Liu, “Chaos-particle swarm optimization
1977. algorithm and its application to urban traffic control,” Int. J. Comput.
[6] W. Choi, H. Yoon, K. Kim, I. Chung, and S. Lee, “A traffic light Sci. Netw. Secur., vol. 6, no. 1B, pp. 97–101, 2006.
controlling flc considering the traffic congestion,” in Proc. AFSS Int. [10] Y. Wen and T. Wu, “Reduced-order rolling horizon optimization of traffic
Conf. Fuzzy Syst. Springer, 2002, pp. 69–75. control based on ant algorithm,” J. Zhejiang Univ. Eng. Sci., vol. 39,
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Algorithm 4: Classification and Regression Trees
(CART)
Input: Attributes set A = {a1 , a2 , . . . , ad }; Training
set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}.
1 Function: TreeGenerate (D, A)
Output: A decision tree with node as the root
2 Generate node for the tree
3 if all stopping criteria have been met then
4 TreeGenerate has one node with the most
common class in D as label
5 else
6 Find a ∈ A that best splits D using impurity
function
7 Label node with a
8 for possible value v of a do
Fig. 5. The performance of the CART methods implemented with different 9 D = the subset of X that have v = a
max tree depth settings in a single-intersection traffic signal control task. 10 A = attribute set A with the best split
attribute a
Algorithm 3: Iterative Dichotomiser 3 (ID3) 11 Growing TreeGenerate (Dv , A)
12 Connect the new node to the root node
Input: Attributes set A = {a1 , a2 , . . . , ad }; Training
with label v
set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}. 13 end
1 Function: TreeGenerate (D, A) 14 end
Output: A decision tree with node as the root 15 end
2 Generate node for the tree
3 if all members of examples are in the same class C
then
et al., “Human-level control through deep reinforcement learning,”
4 Root node = C nature, vol. 518, no. 7540, pp. 529–533, 2015.
5 else if attributes is empty then [13] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
6 Set root node with the most common value of Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot et al., “Mastering the game of go with deep neural networks
target attributes in examples and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
7 else [14] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
8 Set A with the element in attributes that region policy optimization,” in Int. Conf. Mach. Learn. PMLR, 2015,
pp. 1889–1897.
maximizes information gain value [15] Z. Wang, C. Chen, H.-X. Li, D. Dong, and T.-J. Tarn, “Incremental
9 end reinforcement learning with prioritized sweeping for dynamic environ-
10 end ments,” IEEE/ASME Trans. Mechatronics, vol. 24, no. 2, pp. 621–632,
11 end 2019.
[16] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision
12 A is the decision attribute for root node tree,” arXiv preprint arXiv:1711.09784, 2017.
13 for each possible value v of A do [17] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement
14 Add a branch below root node, testing for A = v learning via policy extraction,” arXiv preprint arXiv:1805.08328, 2018.
[18] J. R. Quinlan, “Induction of decision trees,” Mach. learn., vol. 1, no. 1,
15 Set Dv as the subset of examples with A pp. 81–106, 1986.
16 if Examples v is empty then [19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification
17 Set below branch add leaf with the most and regression trees. Routledge, 2017.
[20] M. Bramer, Principles of data mining. Springer, 2007, vol. 180.
common value of target attribute in examples [21] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo–
else simulation of urban mobility: an overview,” in Pro. 3rd Int. Con. Ad.
18 Below branch add sub node Syst. Sim. ThinkMind, 2011.
[22] L. Prashanth and S. Bhatnagar, “Reinforcement learning with function
TreeGenerate (Dv , A) approximation for traffic signal control,” IEEE Trans. Intell. Transp.
19 end Syst., vol. 12, no. 2, pp. 412–421, 2010.
20 end [23] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signal
control with actor-critic methods in a real-world traffic network with
21 end
different traffic disruption events,” Transp. Res. Pt. C-Emerg. Technol.,
vol. 85, pp. 732–752, 2017.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.