0% found this document useful (0 votes)
35 views7 pages

IEEE 2021 DQL Extracting Decision Tree

This paper discusses the extraction of decision trees from trained deep reinforcement learning (DRL) models to enhance interpretability in traffic signal control systems. The authors propose a method that utilizes the Deep Q-network (DQN) to learn an expert policy, collects state-action pairs, and employs decision tree algorithms like ID3 and CART to explain the learned policy. Experimental results demonstrate that the extracted decision trees effectively clarify the decision-making process of DRL agents in traffic management scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views7 pages

IEEE 2021 DQL Extracting Decision Tree

This paper discusses the extraction of decision trees from trained deep reinforcement learning (DRL) models to enhance interpretability in traffic signal control systems. The authors propose a method that utilizes the Deep Q-network (DQN) to learn an expert policy, collects state-action pairs, and employs decision tree algorithms like ID3 and CART to explain the learned policy. Experimental results demonstrate that the extracted decision trees effectively clarify the decision-making process of DRL agents in traffic management scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Extracting Decision Tree from Trained Deep

Reinforcement Learning in Traffic Signal Control


Yuanyang Zhu Xiao Yin
Department of Control and Systems Engineering Department of Control and Systems Engineering
Nanjing University, Nanjing, China Nanjing University, Nanjing, China
[email protected] [email protected]

Ruyu Li Chunlin Chen


Department of Control and Systems Engineering Department of Control and Systems Engineering
2021 International Conference on Cyber-Physical Social Intelligence (ICCSI) | 978-1-6654-2621-3/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCSI53130.2021.9736263

Nanjing University, Nanjing, China Nanjing University, Nanjing, China


[email protected] [email protected]

Abstract—Deep reinforcement learning (DRL) has achieved advance for mimicking the cognitive process of human experts
promising results on traffic signal control systems. However, due to solve decision-making problems. Although most of these
to the complexity of the decisions of deep neural networks, methods are good at explainability, their performance may not
it is a great challenge to explain and visualize the policy
of reinforcement learning (RL) agents. The decision tree can be sufficient to meet the requirements due to the complexity
provide useful information for experts responsible for making of high-dimension problems.
reliable decisions. In this paper, we employ decision trees to In recent years, the notion of eXplainable Artificial Intel-
extract models with readable interpretations from expert policy ligence (XAI), especially eXplainable reinforcement learning
achieved by DRL methods. We evaluate our methods via a (RL) has renewed interest with the recent successes in the
single-intersection traffic signal control task on the simulation
platform of Urban MObility (SUMO). The experimental results technique of deep neural networks. It is a great demand
demonstrate that the extracted decision trees can be used to that non-expert users are to trust and manage the intelligent
understand the learning process and the learned optimal policy system. A major factor contributing to the tension between
of the DRL methods. performance and interpretability is the increasing opacity of
Index Terms—Deep reinforcement learning, Decision tree, machine learning models, especially neural networks defined
Traffic signal control systems
by complex structures and a large number of parameters [11].
I. I NTRODUCTION Currently, many state-of-the-art deep reinforcement learning
(DRL) methods employ neural networks technique for ap-
With the increasing traffic volumes and limited road ca- proximating their function. With stronger function approxi-
pacities, inefficient traffic light cycle control causes numer- mators, DRL has shown great promise in various domains,
ous problems, such as long delay and waste of energy. To e.g. complex high dimensional games [12], [13] and high-
address the traffic congestion problem, researchers developed dimension continuous control tasks [14]. It may be acceptable
intelligent transportation systems (ITSs) that utilize machine in some cases when losing explanation capability in support of
learning techniques to control traffic flow over collected traffic RL performance, especially in video games or board games.
information [1], [2]. However, in many other practical applications, especially
Recent development in sensing and communication tech- medicine, law, finance, and transportation, the assurance about
nologies have obtained great progress in the field of ITSs the reasoning behind these complex models is necessary to
owning to the ability to collect traffic data in real time, reduce decision risk. Generally, the real context is simulated
especially in the traffic signal control systems (TSCS) [3], [4]. by oversimplifying complex dynamics formatting the digital
Many algorithms have been proposed for systems operating in environment.
TSCS. Fuzzy logic methods were first proposed to solve traffic Compared to reinforcement learning, decision trees are
congestion over a single [5] and multiple intersections [6]. Ex- regarded as models with readable interpretations for humans
pert systems were designed to solve congested traffic task with by visualizing the decision path [15]. But decision trees suffer
a knowledge-based methodology [7]. Besides, the techniques from low accuracy for weak expressivity. The soft decision tree
inspired by biological behavior, such as genetic algorithms [8], method creates a type of soft decision tree by using the training
swarm optimization algorithms [9] and ant colony optimization neural networks, which generalizes better than the one learned
algorithms [10], are employed to solve traffic congestion. directly from the training data [16]. Besides, the verifiable
These methods need to know the prior knowledge of the reinforcement learning via policy extraction (VIPER) achieves
system model and consider extensive possible situations in performance equal to that of the original policy [17]. Hence,
978-1-6654-2621-3/21$31.00 © 2021 IEEE we employ the VIPER method to improve the interpretations

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
of the trained DRL model for traffic signal control systems. for all actions. After training the network, we will obtain the
In this paper, we propose an algorithm of extracting decision optimal policy by selecting the action with the largest Q-
trees from trained reinforcement learning in TSCS to improve function.
the interpretability. First, we employ the Deep Q-network
B. Decision tree: ID3 and CART
(DQN) algorithm to learn the expert policy in TSCS. Second,
we utilize the expert policy model to collect state-action pairs Decision trees provide a white box structure for approxi-
by interacting with the TSCS. Third, we use the decision tree mating the target function, in which the learned function is
method, e.g., iterative dichotomiser 3 (ID3) and classification represented by a binary tree. A decision tree is a flowchart-
and regression trees (CART), to fit the dataset of the collected like structure in which each internal node represents a test
state-action pairs for explaining the expert policy. Finally, of an attribute, each branch represents the result of the test,
we evaluate the proposed method in a single-intersection and each leaf node represents a class label. The regression
traffic signal control task on the simulation platform of Urban and classification rule are generated as it is traversed down
MObility (SUMO). from root to leaf nodes. According to decision tree algorithms,
we can split each node recursively. Iterative dichotomiser 3
II. PRELIMINARIES (ID3) [18] selects the highest information gain attribute to
A. Reinforcement learning split attribute by using entropy function and information gain:
m
We model the traffic signal control task as a Partially X
Observable Markov Decision Process (POMDP). POMDP is Entropy = −pi log2 pi , (7)
i=1
composed of six components: 1) a set of real state space Z;
2) a set of observed state space S; 3) a set of actions A; 4) where pi is the probability of class i and m is the number of
a state transition function P; 5) a reward function R; 6) a classes for the target attribute.
discount factor γ ∈ [0, 1), which can be represented as a tuple The classification and regression trees (CART) method [19]
(Z, S, A, P, R, γ). measures the divergences between the probability distributions
A policy, π : S × A → [0, 1], is defined as a mapping from of the target attribute’s values with the Gini index:
the state space S to the action space A by interacting with the k
X
environment, and we have Gini index = 1 − p2i , (8)
X i=1
π(a|s) = 1, ∀s ∈ S. (1)
a∈A
where pi is the probability of class i. The tree selects recur-
sively the attribute with the lowest Gini index [20].
The goal is to obtain an optimal policy
III. E XTRACTING D ECISION T REE FROM T RAINED DRL
π ∗ = arg maxJ(π), (2)
π In this section, we present the framework of extracting
which maximizes the discounted reward function decision trees from trained DRL with specific implementations
"∞ # of obtaining the expert policy model, getting the trajectories
X
t from environments and extracting the decision tree.
J(π) = Eτ ∼π(τ ) [r(τ )] = Eτ ∼π(τ ) γ rt , (3)
t=0 A. Overall architecture
where τ = (s0 , a0 , s1 , a1 , ...) is the learning episode, π(τ ) = We focus on the interpretability of the deep reinforcement
p(s0 )Π∞t=0 π(at |st )p(st+1 |st , at ), rt is the immediate reward learning (DRL) in the traffic signal control system. Reinforce-
received on the transition from st to st+1 under action at . ment learning techniques promise to produce traffic signal
Deep Q-network (DQN) is a variation of the classic Q- controls that perceive, learn, decide, and act on their own.
learning algorithm, which utilizes the deep learning architec- However, they will be unable to explain their decisions and
ture with parameter θ to approximate Q-function Q(s, a) [12], actions to human users. We propose the extracting decision
i.e., tree from the trained DRL model in the traffic signal control
Q(s, a) ≈ Q(s, a; θ). (4) system to improve the interpretability of DRL.
The whole learning process in our methods is shown in
It learns the action-value function Q∗ corresponding to the Fig. 1. First, we employ the DQN algorithm to learn the
optimal policy by minimizing the loss: expert policy π ∗ for traffic signal control. Considering that the
h i
2
L(θ) = Es,a,r,s0 (Q∗ (s, a | θ) − y) , (5) different levels of the expert policy may have different effects
on collecting the state-action pairs dataset, we save the expert
y = r + δ max
0
Q∗ (s0 , a0 ) , (6) policy when the total rewards are up to a higher level with tiny
a fluctuation. Next, after obtaining the expert policy, we employ
where y is a target Q function whose parameters are pe- policy π ∗ to collect data by interacting with the environment
riodically updated with the most recent θ to help stabilize for a few episodes. To better explain the expert policy, we
the learning process. In general, the agent trains the neural simulate all the cases in the same traffic signal control task.
network in such a way that it predicts the cumulative rewards Finally, we employ the decision tree algorithm to fit the

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Traffic light control task control problems. When the total reward of the agent gets
closer to the optimal without divergence, we consider it better
to achieve the efficiency and quality of the learned policy.
C. Getting trajectories
After obtaining the expert policy π ∗ , we are still unable to
understand why the traffic signal control system takes these
decisions and actions. To better explain the expert policy, we
employ the policy extraction method to transform the expert
policy into a different format which could be a smaller neural
network or the other format such as a decision tree. There are
Deep Q-network many methods to replicate the expert policy. Considering that
Get the expert policy π* the VIPER method fits a single decision tree regardless of the
internal structure of the expert policy, we employ the VIPER
method to implement policy extraction.
The VIPER method achieves trajectories by executing the
expert policy π ∗ in the environment E. The agent observes
state st and then takes action at by querying the expert
policy π ∗ at every time step. We obtain a state-action pair
Expert policy π* (st , at ) which we will save to fit a decision tree. The agent
will transit to a new state st+1 by executing the action
Collect the trajectories
at , and the process repeats until reaching the terminal. We
record the sequence of the states and actions as the trajectory
T = {(s1 , a1 ), (s2 , a2 ), . . . , (st , at ), sterminal }. Thus we can
obtain a dataset of state-action pairs via interacting with the
environment using the expert policy for each episode.
D. Extracting the decision tree
Dataset Decision tree The dataset is then utilized to learn a decision tree with a
Extract the decision tree
supervised learning method where the states and actions are
used as features and labels, respectively. The decision tree is
Explainable policy built by splitting the dataset with CART and ID3 methods,
learning from the root node of the tree and selecting the
Fig. 1. Overall architecture of extracting decision trees from the trained deep successor children.2 To measure the divergence between the
Q-network.
probability distributions of the features, we employ the Gini
index as an impurity-based criterion. Besides, we utilize the
collected dataset and explain the DRL model disregarding its pre-pruning methods to prevent the generation of unimportant
inner processing or internal representations. branches resulting in overfitting the training data and poorly
generalizing to test data. Compared to neural networks, the
B. Obtaining the expert policy model decision tree can explain how they arrive to a decision. The
In this stage, we can employ any learning methods to obtain obtained decision tree is a binary tree, in which the branch
the expert policy. For example, it could be created by utilizing nodes represent a condition regarding the input feature (which
any traffic signal control algorithms (e.g., evolutionary algo- is the state space), and the leaf nodes represent the action
rithms and artificial neural networks) or reinforcement learning labels.
algorithms. The core task of this stage is utilizing any existing E. Integrated Algorithm
method to obtain a traffic signal control policy. The expert
The integrated algorithm of extracting the decision tree from
policy, π ∗ : S × A → [0, 1], defines how a model or object
DRL is summarized as in Algorithm 1. First, in the traffic
interacts with the environment by mapping from perceived en-
P signal control task, the agent obtains the expert policy π ∗
vironmental states to actions, and a∈A π(a|s) = 1, ∀s ∈ S.
by using the DQN algorithm (Line 3). Second, we employ
That is, at each step t, the traffic signal controller can observe
the expert policy π ∗ to generate state-action pairs dataset by
state st , query the expert policy π ∗ to obtain action at , receive
interacting with the environment (Lines 4-12). Third, after
a new observation st+1 by taking action at , and repeat.
obtaining the dataset, we employ the decision tree methods
To obtain the expert policy, we employ DRL, specifically the
to fit the expert policy and improve the explainability of the
DQN algorithm which combines Q-learning and deep neural
DRL model (Lines 13 and 14).
networks. 1 The DRL algorithm is widely used for discrete
2 See Appendix B and C for more details of the ID3 and CART algorithms,
1 See Appendix A for more details of the DQN algorithm. respectively.

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: Extracting Decision Tree from DRL
Input: Learning rate α; replay memory size M ;
greedy ε; minibatch size B; discount factor γ.
Output: expert policy π ∗ , decision tree
0
1 Initialize parameters θ, θ in the primary and target
neural networks with random values.
2 Initialize the replay memory m to be empty.

3 π ← Get expert policy using the DQN algorithm.
4 for total reward keep a high and stable performance
do each episode
(a) Case A (b) Case B
5 Initialize state s
6 for η up to Tmax do each step of episode
7 Choose an action a according to the expert
policy π ∗ .
8 Take action a, observe reward r0 , new state s0 .
9 Store the state-action pair (st , at ) into the
dataset. D.
10 s ← s0
11 end
12 end
13 Clean the data.
(c) Case C (d) Case D
14 Learn the decision tree using the CART or ID3
algorithms. Fig. 2. Cases A, B, C and D are four possible actions.

IV. E XPERIMENT all-yellow phase which appears lasting for ty [s] seconds
before the second action taken. Here, we prefer the
In this section, we conduct experiments in a single- duration of ty = 2s to ensure a safe switch between
intersection traffic signal control task on the Urban MObility two different continuous phases.
(SUMO) platform [21]. The SUMO platform can achieve real- • Reward setting: The setting of reward is formulated as
time visualization on the network and simulated entities. It follows:
also has various vehicle behaviour models, roads with several X
lanes and traffic lights. Besides, SUMO supports built-on rth = − (queuet+∆t [l] + a · waitt+∆t [l]) , (9)
computational frameworks for RL and provides APIs for road ih∈ε,l∈Lih
network design, traffic volume simulation and traffic light
control. Specifically, SUMO can control the time of traffic where a is a trade-off coefficient, waitt and queue
signals according to the given policy. In addition, we can measure the cumulative delay of the first vehicle and the
import real-world road networks from OpenStreetMap to it, queue length along each incoming lane, respectively. We
and the vehicles can get into the system from any position in set ∆t = 5s, which means the interaction between the
the network. RL agents and the traffic environment appears every 5s.

A. RL settings B. Experimental settings


To implement the DQN algorithm for the single-intersection As shown in Fig. 2a, the single-intersection is made up of
traffic signal control task, the settings of the states, actions and both N-S and E-W four-lane arterial streets, with speed limits
reward functions are listed as follows. of 20m/s and 11m/s, respectively. Besides, the length of each
• State space: The definition of the state combines the lane is 750m and the limit of turn speed is 5.5m/s. There are
ideas of [22] and [23]. The state is defined for one 1000 cars appearing randomly at four lanes for each episode.
intersection with a piece of position information within To better explain the expert policy, we employ the DQN
150m to the intersection. The position dimension is a algorithm to learn the expert policy in the single-intersection
binary value, in which if there is a vehicle in a grid, the traffic signal control task, where the neural network has 1419
value in the grid is 1; otherwise it is 0. parameters consisting of two hidden layers with 24 and 48
• Action space: As shown in Fig. 2, in our experimental units with ReLu activation. The hyperparameters are the same
settings, there are four phases for the traffic lights, for all tested algorithms in each group of experiments: learning
including N-S straight phase, N-S left-turn phase, E-W rate α = 0.01, memory size M = 50000, batch size 100 and
straight phase and E-W left-turn phase. If the continuous discount factor γ = 0.75. The max steps of each episode are
two executing the actions are different, we will take an set as 5400, which means that each episode lasts 5400 seconds.

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
the E-W lanes, the decision tree model gives action 2, which
means executing the E-W straight phase to reduce the traffic
waiting time. To better visually present the explainability of
the expert policy, we set the max depth of CART algorithms as
3. As shown in Fig. 4, we can easily understand the decision
process in this reduced case.
When lower complexity brings more understandability, the
accuracy of the decision tree is slipped. As shown in Fig. 5,
the CART achieves better performance when the max depth
of the CART increases. The max depth of the decision tree 5
is specified by 190 parameters, which means this tree requires
about 13% of the parameters needed to specify the DQN.
Besides, with less parameters learned, decision tree methods
require less training time than the DQN. Generally, using an
appropriate tree depth setting of the decision tree can have a
Fig. 3. The performance of the tested methods implemented by DQN, ID3 better explanation without lossing the performance.
and CART in a single-intersection traffic signal control task.
V. C ONCLUSION

C. Experiment Results In this paper, we propose a method of extracting decision


tree from deep reinforcement learning for the traffic signal
In the experiment, we first utilize the DQN algorithm to control task. We obtain the expert policy from the trained
obtain the expert policy. As shown in Fig. 3, we can easily DQN method and employ the ID3 and CART methods to
find that the performance of the DQN algorithm keeps stable extract the domain knowledge for the traffic signal control
when the number of the episode is up to 200, and save the task. The achieved decision tree has a simpler structure and is
model as the expert policy model. more understandable regardless of the structure of the complex
We employ the expert policy to generate 20 trajectories for expert policy model, which can push them toward algorith-
training the decision tree. When training the decision tree, the mically transparent models. Our future work will focus on
dataset is divided into training and testing set according to more explainable models for more practical machine learning
8 : 2. In ID3 algorithm, we employ the information gain as application scenarios.
the impurity measure. To prevent overfitting, we control the
size of the tree by limiting the max tree depth to 9. To avoid A PPENDIX A: DQN ALGORITHM
low-variance in regression problem, we set the minimal split Algorithm 2 shows the pseudocode of DQN and more
samples and samples leaf to create arbitrarily small leaves and details can be found in [12].
guarantee that each leaf has a minimum size, respectively. In
the CART algorithm, we employ the Gini index to measure the A PPENDIX B: ID3 ALGORITHM
divergence between the probability distributions of the target
Algorithm 3 shows the pseudocode of ID3 and more details
attribute’s values. We set the minimal split samples as 10 and
can be found in [17].
the other parameters are the same as those of ID3.
First, we employ the ID3 and CART methods to fit the A PPENDIX C: CART ALGORITHM
training dataset. To avoid the overfitting problem, we set
more limit parameters, resulting in reducing the accuracy, and Algorithm 4 shows the pseudocode of CART and more
obtain 83.32% and 84.25% accuracy in the testing dataset. details can be found in [18].
Then we utilize the learned ID3 and CART model to verify
ACKNOWLEDGMENT
their performance in the traffic signal control task. As shown
in Fig. 3, compared to the expert policy π ∗ , the ID3 and This work was supported in part by the National
CART methods obtain slightly superior performance. The total Key Research and Development Program of China (No.
rewards of ID3 and CART methods are increased by 4.93% 2018AAA0101100) and the National Natural Science Foun-
and 3.94%, respectively, which may benefit from its structure dation of China (Nos. 62073160 & 62006111).
for learning the domain knowledge of the expert policy.
R EFERENCES
With the achieved decision tree, we can easily fulfill ev-
ery constraint for transparency. For example, when the state [1] Z. Cao, H. Guo, J. Zhang, D. Niyato, and U. Fastenrath, “Finding the
shortest path in stochastic vehicle routing: A cardinality minimization
satisfies the condition if state[40] = 1 and state[49] = 1 approach,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 6, pp. 1688–
and state[50] = 0 and state[20] = 1 and state[1] = 1 1702, 2015.
and state[2] = 0 and state[21] = 0 and state[41] = 1 [2] D. Wilkie, J. Van den Berg, M. Lin, and D. Manocha, “Self-aware traffic
route planning,” in Proc. 25th AAAI Conf. Artif. Intell., 2011.
and state[42] = 0, the agent will take action 2 given by the [3] S. Sen and K. L. Head, “Controlled optimization of phases at an
decision tree. Owing to the fact that there are more vehicles in intersection,” Transp. Sci., vol. 31, no. 1, pp. 5–17, 1997.

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
state[49] ≤ 0.5
gini = 0.704
samples = 13852
value = [3713, 5347, 3612, 1180]
class = action: 1

True False

state[9] ≤ 0.5 state[40] ≤ 0.5


0.562 0.686
7432 6420
[1148, 4570, 1361, 353] [2565, 777, 2251, 827]
action: 1 action: 0

state[29] ≤ 0.5 state[60] ≤ 0.5 state[30] ≤ 0.5 state[50] ≤ 0.5


0.424 0.669 0.625 0.594
5998 1434 3492 2928
[676, 4441, 727, 154] [472, 129, 634, 199] [1941, 531, 545, 475] [624, 246, 1706, 352]
action: 1 action: 2 action: 0 action: 2

0.317 0.631 0.628 0.582 0.554 0.633 0.503 0.678


5317 681 1023 411 2584 908 2009 919
[472, 4349, 381, 115] [204, 92, 346, 39] [226, 92, 554, 151] [246, 37, 80, 48] [1608, 106, 464, 406] [333, 425, 81, 69] [454, 185, 1328, 42] [170, 61, 378, 310]
action: 1 action: 2 action: 2 action: 0 action: 0 action: 1 action: 2 action: 2

Fig. 4. Demonstration of how a CART makes a decision given the information of the state. The max depth of the CART is set as 3.

Algorithm 2: Deep Q-learning with experience replay


Input: Learning rate α; replay memory size M ; greedy ε; minibatch size B; discount factor γ.
Output: expert policy π ∗ , decision tree
1 Initialize replay memory D to capacity N
2 Initialize action-value function Q with random weights θ

3 Initialize traget action-value function Q̂ with weights θ = θ
4 for episode = 1, M do
5 Initialize sequence s1 = {x1 } and preprocessed sequence φ1 = φ (s1 )
6 for t = 1, T do
7 With probability  select a random action at
8 Otherwise select at = argmaxa Q (φ (st ) , a; θ)
9 Execute action at in emulator and observe reward rt and observation st1
10 Set st+1 = st , at , st+1 and preprocess φt+1 = φ (st+1 )
11 Store transition (φt , at , rt , φt+1 ) in D
12 Sample random
 minibatch of transitions (φj , aj , rj , φj+1 ) from D
rj if episode terminates at step j + 1
13 Set yj =
rj + γ maxa0 Q̂ (φj+1 , a0 ; θ− ) otherwise
2
14 Perform a gradient descent step on (yj − Q (φj , aj ; θ)) with respect to the network parameters θ
15 Every C steps reset Q̂ = Q
16 end
17 end

[4] S. G. Shelby, D. M. Bullock, D. Gettman, R. S. Ghaman, Z. A. Sabra, [7] N. V. Findler, S. Surender, and Ş. Catrava, “On-line decisions about
and N. Soyke, “An overview and performance evaluation of acs lite–a permitted/protected left-hand turns in distributed traffic signal control,”
low cost adaptive signal control system,” in Transp. Res. Board, vol. Eng. Appl. Artif. Intell., vol. 10, no. 3, pp. 315–320, 1997.
190, 2008, pp. 130–137. [8] Z. Liu, “A survey of intelligence methods in urban traffic signal control,”
[5] C. P. Pappis and E. H. Mamdani, “A fuzzy logic controller for a trafc Int. J. Comput. Sci. Netw. Secur., vol. 7, no. 7, pp. 105–112, 2007.
junction,” IEEE Trans. Sys., Man, Cybern., vol. 7, no. 10, pp. 707–717, [9] C. Dong, Z. Liu, and X. Liu, “Chaos-particle swarm optimization
1977. algorithm and its application to urban traffic control,” Int. J. Comput.
[6] W. Choi, H. Yoon, K. Kim, I. Chung, and S. Lee, “A traffic light Sci. Netw. Secur., vol. 6, no. 1B, pp. 97–101, 2006.
controlling flc considering the traffic congestion,” in Proc. AFSS Int. [10] Y. Wen and T. Wu, “Reduced-order rolling horizon optimization of traffic
Conf. Fuzzy Syst. Springer, 2002, pp. 69–75. control based on ant algorithm,” J. Zhejiang Univ. Eng. Sci., vol. 39,

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.
Algorithm 4: Classification and Regression Trees
(CART)
Input: Attributes set A = {a1 , a2 , . . . , ad }; Training
set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}.
1 Function: TreeGenerate (D, A)
Output: A decision tree with node as the root
2 Generate node for the tree
3 if all stopping criteria have been met then
4 TreeGenerate has one node with the most
common class in D as label
5 else
6 Find a ∈ A that best splits D using impurity
function
7 Label node with a
8 for possible value v of a do
Fig. 5. The performance of the CART methods implemented with different 9 D = the subset of X that have v = a
max tree depth settings in a single-intersection traffic signal control task. 10 A = attribute set A with the best split
attribute a
Algorithm 3: Iterative Dichotomiser 3 (ID3) 11 Growing TreeGenerate (Dv , A)
12 Connect the new node to the root node
Input: Attributes set A = {a1 , a2 , . . . , ad }; Training
with label v
set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}. 13 end
1 Function: TreeGenerate (D, A) 14 end
Output: A decision tree with node as the root 15 end
2 Generate node for the tree
3 if all members of examples are in the same class C
then
et al., “Human-level control through deep reinforcement learning,”
4 Root node = C nature, vol. 518, no. 7540, pp. 529–533, 2015.
5 else if attributes is empty then [13] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
6 Set root node with the most common value of Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot et al., “Mastering the game of go with deep neural networks
target attributes in examples and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
7 else [14] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
8 Set A with the element in attributes that region policy optimization,” in Int. Conf. Mach. Learn. PMLR, 2015,
pp. 1889–1897.
maximizes information gain value [15] Z. Wang, C. Chen, H.-X. Li, D. Dong, and T.-J. Tarn, “Incremental
9 end reinforcement learning with prioritized sweeping for dynamic environ-
10 end ments,” IEEE/ASME Trans. Mechatronics, vol. 24, no. 2, pp. 621–632,
11 end 2019.
[16] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision
12 A is the decision attribute for root node tree,” arXiv preprint arXiv:1711.09784, 2017.
13 for each possible value v of A do [17] O. Bastani, Y. Pu, and A. Solar-Lezama, “Verifiable reinforcement
14 Add a branch below root node, testing for A = v learning via policy extraction,” arXiv preprint arXiv:1805.08328, 2018.
[18] J. R. Quinlan, “Induction of decision trees,” Mach. learn., vol. 1, no. 1,
15 Set Dv as the subset of examples with A pp. 81–106, 1986.
16 if Examples v is empty then [19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification
17 Set below branch add leaf with the most and regression trees. Routledge, 2017.
[20] M. Bramer, Principles of data mining. Springer, 2007, vol. 180.
common value of target attribute in examples [21] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo–
else simulation of urban mobility: an overview,” in Pro. 3rd Int. Con. Ad.
18 Below branch add sub node Syst. Sim. ThinkMind, 2011.
[22] L. Prashanth and S. Bhatnagar, “Reinforcement learning with function
TreeGenerate (Dv , A) approximation for traffic signal control,” IEEE Trans. Intell. Transp.
19 end Syst., vol. 12, no. 2, pp. 412–421, 2010.
20 end [23] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signal
control with actor-critic methods in a real-world traffic network with
21 end
different traffic disruption events,” Transp. Res. Pt. C-Emerg. Technol.,
vol. 85, pp. 732–752, 2017.

no. 6, p. 835, 2005.


[11] R. K.-M. Sheh, ““ why did you do that?” explainable intelligent robots,”
in Proc. 31th AAAI Conf. Artif. Intell., 2017.
[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski

Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech. Downloaded on February 16,2025 at 08:32:46 UTC from IEEE Xplore. Restrictions apply.

You might also like