0% found this document useful (0 votes)
44 views15 pages

A Deep Reinforcement Learning-Based Resource

This document presents a novel deep reinforcement learning-based resource scheduler named SMART for massive MIMO networks, addressing the challenges of scheduling users with correlated channels to maximize spectral efficiency while ensuring fairness. The proposed model utilizes the Soft Actor-Critic (SAC) framework combined with K-Nearest Neighbors (KNN) to effectively manage large action spaces and reduce computational complexity. Comprehensive simulations demonstrate that SMART achieves near-optimal performance in spectral efficiency and fairness with significantly lower computational demands compared to traditional optimal scheduling methods.

Uploaded by

rezakohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views15 pages

A Deep Reinforcement Learning-Based Resource

This document presents a novel deep reinforcement learning-based resource scheduler named SMART for massive MIMO networks, addressing the challenges of scheduling users with correlated channels to maximize spectral efficiency while ensuring fairness. The proposed model utilizes the Soft Actor-Critic (SAC) framework combined with K-Nearest Neighbors (KNN) to effectively manage large action spaces and reduce computational complexity. Comprehensive simulations demonstrate that SMART achieves near-optimal performance in spectral efficiency and fairness with significantly lower computational demands compared to traditional optimal scheduling methods.

Uploaded by

rezakohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This work is accepted by IEEE Transactions on Machine Learning in Communications and Networking (TMLCN).

Copyright is owned by IEEE.

A Deep Reinforcement Learning-Based Resource


Scheduler for Massive MIMO Networks
Qing An, Santiago Segarra, Chris Dick, Ashutosh Sabharwal, Rahman Doost-Mohammady

Abstract—The large number of antennas in massive MIMO from multiple users becomes challenging when their channels
systems allows the base station to communicate with multiple are correlated. In networks with high user mobility, the chan-
users at the same time and frequency resource with multi- nels of individual users and their correlations with other users
user beamforming. However, highly correlated user channels
could drastically impede the spectral efficiency that multi-user within each RB are rapidly fluctuating. This dynamic nature of
arXiv:2303.00958v2 [cs.IT] 14 Sep 2023

beamforming can achieve. As such, it is critical for the base channel characteristics substantially increases the challenges
station to schedule a suitable group of users in each time and associated with achieving optimal resource scheduling for
frequency resource block to achieve maximum spectral efficiency massive MIMO networks. Specifically, fair scheduling of radio
while adhering to fairness constraints among the users. In this resources while maximizing spectral efficiency is essential in
paper, we consider the resource scheduling problem for massive
MIMO systems with its optimal solution known to be NP-hard. real deployments. The formulation of the optimal Proportion-
Inspired by recent achievements in deep reinforcement learning ally Fair (Opt-PF) scheduling problem typically results in an
(DRL) to solve problems with large action sets, we propose integer linear optimization (ILP) problem with an NP-hard
SMART, a dynamic scheduler for massive MIMO based on solution [1]. The large complexity associated with solving
the state-of-the-art Soft Actor-Critic (SAC) DRL model and the an ILP, when the number of users and resource blocks is
K-Nearest Neighbors (KNN) algorithm. Through comprehensive
simulations using realistic massive MIMO channel models as well large, prohibits designing optimal yet computationally feasible
as real-world datasets from channel measurement experiments, schedulers that can work in the time-stringent 5G and beyond
we demonstrate the effectiveness of our proposed model in vari- standards. There is a large body of work [2]–[5] that design
ous channel conditions. Our results show that our proposed model heuristics or approximation algorithms with low complexity
performs very close to the optimal proportionally fair (Opt-PF) to optimize the spectral efficiency of the networks. However,
scheduler in terms of spectral efficiency and fairness with more
than one order of magnitude lower computational complexity in they either do not evaluate fairness at all or demonstrate poor
medium network sizes where Opt-PF is computationally feasible. fairness. This is due to the fact that designing low-complexity
Our results also show the feasibility and high performance of our approximation algorithms for multi-objective combinatorial
proposed scheduler in networks with a large number of users and optimization problems is typically hard [6].
resource blocks. In the field of artificial intelligence and machine learning,
Index Terms—Massive MIMO, Resource Scheduling, Deep Markov Decision Processes (MDPs) [7] have emerged as
Reinforcement Learning. a powerful mathematical framework for modeling decision-
making problems under uncertainty. MDPs represent sequen-
I. I NTRODUCTION tial decision processes as a set of states, actions, and transition
probabilities, where the goal is to find an optimal policy

M ASSIVE multiple-input multiple-output (MIMO) is one


of the key technologies poised to radically improve the
spectral efficiency of the current 5G networks and beyond.
that maximizes a predefined objective function, such as ex-
pected cumulative rewards. However, solving MDPs can be
computationally demanding, especially for complex problems
Through the use of tens or hundreds of antennas at the base with large state and action spaces. To address this challenge,
station, it can perform multi-user beamforming to serve tens of Deep Reinforcement Learning (DRL) [8] has gained signifi-
users in the same time-frequency resource block (RB). How- cant attention in recent years. DRL combines reinforcement
ever, scheduling which users to serve simultaneously in each learning algorithms with deep neural networks to approximate
RB plays an important role in achieving the large throughput value functions or policies, enabling the handling of high-
gains promised by the massive MIMO technology. Beam- dimensional state spaces. By leveraging the representation
forming performance can be significantly degraded if there power of deep neural networks, DRL algorithms have achieved
is a substantial correlation in the wireless channels among remarkable successes in solving continuous and discrete action
the scheduled users, as this correlation makes it challenging space problems in various domains, including robotics [9],
to effectively focus signal energy when transmitting toward game playing [10], and energy management [11]. Notably,
scheduled users. Similarly, separating the signals received DRL has also been applied to solve complex combinatorial
optimization tasks. For instance, [12] has adopted DRL to
Qing An, Santiago Segarra, Ashutosh Sabharwal and Rahman Doost-
Mohammady are with the Department of Electrical and Computing Engineer- solve the traveling salesman problem, a classic combinatorial
ing, Rice University, Houston, TX, USA. E-mail:{qa4, segarra, ashu, doost} optimization problem. Similarly, [13] solves the covering
@rice.edu, Chris Dick is with NVIDIA Corporation, Santa Clara, CA, USA. salesman problem through a DRL model. This motivates
E-mail: [email protected]
This work was supported by the U.S. National Science Foundation under the need to explore DRL as a potential tool to solve the
Grants CNS-1827940, CNS-2016727, and CNS-2120447 optimal proportionally fair resource scheduling for massive
MIMO networks. Instead of using an explicit mathematical
model, decision optimization in a wireless resource scheduler
can be represented as a Markov Decision Process (MDP)
whose observations and actions are guided by a well-defined
reward function. A DRL agent can then approach an opti-
mum MDP solution by learning from its interactions with
the wireless environment. The choice of the DRL model to
solve the resource scheduling problem is crucial in achieving
high performance and scalability in terms of the number
of users in real-world massive MIMO networks. In the re- Fig. 1: System Model.
cent years, many DRL models for decision making in dis-
crete action space that fit the resource scheduling problem
have been proposed. Deep Q-Network (DQN) [10], Double We demonstrate that our scheduler framework can operate
DQN [14], Advantage Actor-Critic (A2C), Asynchronous Ad- independently on different resource blocks and, at the same
vantage Actor-Critic (A3C) [15], Actor-Critic with Experi- time, achieve close to optimal performance.
ence Replay (ACER) [16], and Proximal Policy Optimization We evaluate the effectiveness of SMART in various channel
(PPO) [17] are a few examples. However, all these models conditions in both simulated as well as real-world channel
are shown to struggle with large discrete action spaces that traces through a comparison of its performance with state-of-
are typically present in combinatorial optimization problems, the-art scheduling algorithms, including heuristic-based and
a phenomenon known as action dimensional disaster [18]. DRL-based models. We comprehensively demonstrate the ef-
Another class of DRL models that deal with continuous action fectiveness of our proposed method in achieving near-optimal
spaces has been used and adapted for discrete action spaces spectral efficiency while simultaneously maintaining superior
in various domains. For instance, Deep Deterministic Policy inter-user fairness very close to the Opt-PF scheduler. We
Gradient (DDPG) [19] is a popular continuous-based DRL experimentally analyze the computational complexity of our
model used to solve a variety of decision problems with large method and demonstrate its efficiency. We also provide guide-
discrete action spaces [20], including resource scheduling in lines on how our proposed system can be deployed on real-
massive MIMO [21], [22]. However, DDPG is known to be world 5G and beyond systems while achieving the latency
very sensitive to hyper-parameter tuning in actual training, required for the 5G new radio (NR) standard.
especially in high-dimensional and complicated tasks [23].
In this paper, we present a novel DRL framework for the II. S YSTEM M ODEL AND E XISTING W ORK
resource scheduling problem in massive MIMO networks. The A. System Model
novelty of our framework is three-fold:
We consider a single-cell network with a massive MIMO
First, we propose a DRL-based scheduler design named
base station (BS) with M antennas serving L single-antenna
SMART, based on the recently proposed soft actor-critic
users in its cell. The base station uses orthogonal frequency di-
(SAC) model [24]. The SAC model has superior sample effi-
vision multiplexing (OFDM) and performs MU-MIMO trans-
ciency by incorporating an entropy term in its value function
mission and reception to N < L users such that N ≤ M . We
and automatic tuning of hyper-parameters. Therefore, it can
consider time-division duplex (TDD) operation, where all L
converge to the optimal solution in large multi-dimensional
users periodically send orthogonal pilot sequences to the BS
action spaces much faster than the existing models such as
for channel estimation. We assume that the scheduler possesses
DDPG. Given that SAC is by design used for continuous
full knowledge of the channel condition of all users associated
space problems, we propose to combine SAC with K-Nearest
with the BS and the channel for each user does not change
Neighbors (KNN) algorithm to generate discrete outputs cor-
during a transmission time interval (TTI). Subsequently, the
responding to user scheduling decisions in massive MIMO
BS selects a set of N users for data transmission and reception
networks. To achieve the scalability required for real-world
through beamforming based on their estimated channel and
massive MIMO networks with a large number of users, we
assigns their modulation schemes, and communicates that
propose a novel dimension division strategy that maps the
information through the control channel. Using their assigned
discrete action set for scheduling to multiple dimensions.
modulation scheme, the selected users will transmit their
Second, we significantly reduce the state space and, thus,
symbols at the same RB in the uplink and receive them
the complexity of the proposed SMART model for massive
simultaneously in the downlink. A simplified system model is
MIMO by using user grouping labels as the model states
depicted in Fig. 1. For the uplink, we consider the following
instead of the raw channel state information (CSI) matrix. The
signal model
user grouping labels indicate which users have less correlated
y = Hu + n, (1)
channel vectors, hence, are more suitable to be scheduled at the
same time. This reduces the computational complexity of the where y is the M × 1 received signal vector at the BS, H is
model in both training and inference by 2× without sacrificing the M × N channel matrix, and u is the N × 1 transmitted
spectral efficiency or fairness. symbols vector by the users. Additionally, n is M × 1 receiver
Third, we demonstrate the scalability of SMART to a complex noise vector with a circular Gaussian distribution,
large number of resource blocks consistent with 5G systems. n ∼ CN (0, σ 2 I) where σ 2 is the noise variance and I is the
identity matrix. Note that, the value of N can vary in each scheduler, finds the resource scheduling solution that maxi-
TTI depending on the current channel condition and it can be mizes the following objective [27], [28]
bounded by a maximum value Nmax . We assume the BS uses B X
L
zero forcing (ZF) for beamforming. The BS calculates the ZF
X
t
argmax wl,b xtl,b , (5)
beamformer using the estimated channel Ĥ as xtl,b b l
H L
W = Ĥ(Ĥ Ĥ)−1 . (2)
X
s.t. xtl,b ≤ Nmax
l=1
The BS then performs receive beamforming on the received
xtl,b ∈ {0, 1}
signal to estimate the transmit symbol vector x̂ as
t
t
rl,b
wl,b = PB , (6)
t
û = WH y. (3) b pl,b
(
t−1 t−1 t−1
For simplicity, we only consider the uplink, but the above t pl,b + rl,b , if xl,b =1
pl,b =
model is extendable to the downlink as well. The above signal pt−1
l,b , otherwise
model is for a single subcarrier in an OFDM system, but the t
where wl,b denotes the weighted rate, which we calculate as
same model applies to all subcarriers. t
the ratio of instantaneous rate rl,b to received rate ptl,b until
An RB is the smallest scheduling granularity in 5GNR,
TTI t on all RBs. Normalizing the instantaneous rate with
which contains resources in the time and frequency domain.
the total received rate guarantees that all users have a fair
One RB in 5G is made up of 12 consecutive subcarriers in the
chance of getting selected by the scheduler even when they
frequency domain [25]. In the time domain, the composition
are experiencing a poor channel.
of RBs in 5G is more flexible and can vary between one
Both optimization problems in (4) and (5) are NP-hard since
OFDM symbol and the entire slot (1 ms in numerology 0). The
they can be reformulated as an Integer Linear Programming
quality of the wireless channel changes dramatically over time,
(ILP) problem [29]. Specifically, we can reformat (5) when
across users, and among different frequency bands. It is shown
B = 1 as the following ILP problem,
in [26] that wireless channel capacity might fluctuate by up to
9 times in 20 MHz LTE bandwidth with over 100 RBs. This argmax wT x (7)
x
effect is more pronounced in 5G since it typically has a wider
bandwidth (i.e. 40 MHz to 400 MHz). Consequently, user s.t. JL,L x ≤ Nmax JL
selection decisions will vary across RBs due to the frequency x ∈ {0, 1}L
selectivity of the channel. Thus, it is essential to take into
account resource scheduling for every RB individually. In our where w is a vector of all users instantaneous rates, x is user
design, we first focus on resource scheduler design on a single binary selection vector. Also JL,L and JL are square matrix
RB and then extend to many RBs to show the adaptability of and vector of all ones with size L, respectively.
our proposed scheduler to 5G massive MIMO networks. Solving (7) by exhaustively searching through the combi-
nations of vector x has the complexity of O(2L ). Solving
Optimal Schedulers: In the literature, multiple schedulers
(4) and (5) through an exhaustive search, when B RBs are
are defined as optimal. The rate-optimal scheduler, known as
considered, the complexity will increase to O(2LB ). However,
emphOptimal Maximum Rate (Opt-MR), finds the resource
there are approximate algorithms for the Opt-PF problem with
scheduling solution in each TTI that maximizes the sum rate
polynomial complexity, such as the one proposed in [28]. We
B X
X L discuss and evaluate an approximate algorithm in §IV along
t
argmax rl,b xtl,b , (4) with other benchmarks.
xtl,b b=1 l=1
L
X B. Existing Work and Motivation
s.t. xtl,b ≤ Nmax
l=1 Recent work on resource scheduling in massive MIMO
xtl,b ∈ {0, 1} and MU-MIMO can be classified into two general categories:
heuristics schedulers, and AI-based schedulers. In this section,
where xtl,b represents the binary selection of user l at TTI t we provide an overview of some of the most relevant works
t
and RB b and rl,b is the instantaneous rate achieved by user in each category.
l at TTI t and RB b. We calculate the instantaneous rate as Heuristics Scheduler Designs: Many existing MU-MIMO
t
rl,b = log2 (1+SINRtl,b ), where SINRtl,b is the received signal scheduling works provide heuristics-based approximations to
to interference-plus-noise ratio from each beamformed user l the Opt-PF scheduler [5], [30], [31]. While they try to strike
at TTI t and RB b. We consider B as the maximum number a balance between complexity and performance, often their
of RBs being used in the system. complexity does not scale to large networks or they signifi-
Simply maximizing the sum rate ignores the notion of cantly underperform the optimal scheduling policies.
fairness where, depending on the channel conditions, some The scheduler proposed in [31] implements a multi-phase
users may never get selected. Therefore, a commonly used optimization to solve Eq. (5) in MU-MIMO settings. It narrows
scheduler, known as Optimal Proportionally Fair (Opt-PF) down the exhaustive search needed for the Opt-PF solution
using some relaxations of the optimization problem. For e.g., Additionally, applying the model to large networks results in
it decouples the user selection in different RBs. Moreover, a complicated network structure and a long model update time
in each RB, it reduces the number of choices based on the due to the use of a raw channel matrix as the input. This is
channel quality of each user before deciding the user selec- exacerbated further by complex-valued channels, which need
tion action based on the correlation of the remaining users. to be separated into real and imaginary parts before being fed
Through these sub-optimal relaxations, their method can be to the model. We implement a pointer network-based DRL
parallelized and efficiently implemented on a powerful GPU, scheduler as a benchmark and discuss these limitations in more
and hence can meet the stringent 5G-NR latency constraints detail in §IV.
(i.e., nearly 1ms). Despite the low-latency implementation,, Our Proposed Method: We propose SMART, a massive
this scheduler only scales to M = 12 and N = 4, and as a MIMO user scheduler based on the recently proposed soft
result, it has limited scalability to massive MIMO. In [5], two actor-critic (SAC) DRL model [23], [24] and the KNN
heuristics-based user scheduling algorithms are proposed and algorithm [38]. SAC has gained attraction in several real-
evaluated on channel datasets collected from a dense indoor time control problems such as robotic locomotion [39]. SAC
massive MIMO network with stationary users. However, the was originally designed to handle continuous action spaces.
algorithms sacrifice fairness in favor of spectral efficiency. However, the user scheduling is a discrete decision problem
They are also not evaluated under mobility scenarios. The where an appropriate set of users must be selected at each
work in [32] proposed a scheduler for massive MIMO that TTI. The work in [40] provides a modification of SAC for
schedules users with low correlation channels in the same time discrete action spaces, but we find that their modification is
slot. It first partitions users into groups through a user grouping still not suitable for large discrete action sets as it has serious
algorithm. The scheduler then goes through all groups and convergence issues in large-scale networks. Inspired by the
schedules all users in each group with a rate-fair method. As approach in [20], we use the KNN [38] to discretize SAC
we discuss later in the paper, this scheduling algorithm fails to adapt it to discrete action spaces. The basic idea is to
to work well in fast-varying channel environments when inter- use a continuous-based algorithm to generate an initial or
user channel correlations are continuously changing and it is “proto” continuous action first. Then, the K nearest discrete
unable to fairly allocate users across channel coherence blocks. actions are found by using the KNN algorithm. Among the K
AI-based Scheduler Designs: Due to the huge complexity nearest discrete actions, the one with the maximum Q value
of the optimization-based methods, several recent works [18], is selected. We further propose a novel dimension division
[21], [22], [33]–[36] have proposed DRL models for MIMO strategy that helps to scale up the size of the combinatorial
scheduling. A Q-learning-based DRL resource scheduling is action set (i.e., number of users in the network) and enhance
proposed in [34]. It models the user scheduling problem as a model convergence capability. Using this approach, we enable
Markov Decision Process (MDP) that outperforms the round- our model to dynamically select the users to maximize system
robin scheduler in terms of sum rate. However, the discrete spectral efficiency and inter-user fairness. More details are
DRL models are known to have difficulty in converging in illustrated in §III-C. In contrast to prior work, our proposed
large action sets [37]. The convergence issue is also true scheduler is more scalable and performs very close to the Opt-
for more advanced discrete DRL models, such as DQN and PF solution.
Double DQN. As such, discrete DRL models have limited
scalability to a large number of users for multi-user scheduling
III. SMART: A S CALABLE SAC-KNN- BASED MASSIVE
in massive MIMO networks. We will also demonstrate these
MIMO S CHEDULER
limitations in §IV.
The work in [21] proposes a DDPG-based user scheduler In this section, we first provide a brief introduction to
for massive MIMO networks. Its model outputs a probability SAC. Subsequently, we describe the design of our proposed
distribution over all selectable users and chooses the most scheduler based on the SAC DRL framework. We discuss how
promising UE combinations at each TTI. However, it includes we discretize the output of the SAC framework by applying the
a raw channel matrix in state space and the number of KNN algorithm and propose a dimension division strategy to
elements in action space equals the number of UEs. Large scale up the supported size of the action set. We also propose
state and action spaces hinder its scalability. This algorithm to reduce the complexity of the framework by using the user
is extended in [22] for both user scheduling and transmit grouping instead of the raw channel matrix as the input to
precoding based on DDPG. It considers multiple antennas the framework. Additionally, we discuss how we scale up the
and antenna correlation on the UE side as well. However, model to support as many RBs as needed for realistic 5G
their proposed scheduler has limited scalability and does not networks.
consider the evaluation of user fairness. We implement a
DDPG-based scheduler as one of our benchmarks and discuss
its performance with respect to our proposed scheduler. A. A Primer on SAC
A pointer network is investigated in [18] as the actor SAC is an off-policy Deep Reinforcement Learning (DRL)
in an actor-critic framework to convert the combinatorial model that employs a stochastic policy, in contrast to the
problem in multi-user scheduling into a sequential selection deterministic policy used in Deep Deterministic Policy Gra-
problem. However, sequential scheduling has slow inference, dient (DDPG). Instead of selecting the optimal action, a
which makes it undesirable for latency-sensitive 5G networks. stochastic policy outputs probabilities for all possible actions.
The optimal policy in SAC, defined in (8), aims to maximize
both the cumulative reward R and the policy entropy H.
" #
X

π = arg max E(st ,at )∼ρπ R(st , at ) + αH(π(·|st )) .
π
t
(8)
where the policy entropy H is defined as
X
H(π(·|st )) = − P (at |st ) × log(P (at |st )) (9)
By maximizing policy entropy, SAC encourages the model
to extensively explore the action space, facilitating the dis-
covery of global optima and enhancing sample efficiency.
Moreover, SAC samples transition from replay memory to
learn from past experience, similar to other off-policy algo-
rithms like DQN [10] and Double DQN [14]. In contrast to
on-policy models such as PPO [17] and A3C [15], which Fig. 2: Soft Actor-Critic Framework.
update their policies based on experiences generated by the
current policy, SAC has the ability to learn from a broader
spectrum of experiences. This characteristic enhances sample tasks with up to 21 action dimensions [24]. Specifically, SAC
efficiency and aids in facilitating convergence, especially in is demonstrated to work well in the design of autonomous
high-dimensional action spaces as demonstrated in [24]. robots where the actions of multiple parts of the robot must
In general, SAC has the following two major benefits: be decided simultaneously. As we discuss later, we use this
1) Strong exploration capability. SAC does not discard any feature of SAC as our advantage to deal with large discrete
action, even if it is not the best one. If multiple promising action sets in massive MIMO user scheduling.
actions are found, the stochastic policy will choose them
with equal probability. This feature helps SAC explore more B. SMART Scheduler Core Design
and not easily get trapped in local optima. In contrast, the In this section, we adapt the discretized SAC algorithm [23]
deterministic policy-based algorithms, such as DDPG [41], to formulate and build a Markov Decision Process (MDP)
save the action with the highest value resulting in fewer model to solve the user scheduling problem in massive MIMO
exploration opportunities. networks.
2) High robustness. Most applications of RL require the State space. We define the state space of user l at TTI
agent to perform well in the presence of disturbances in the t as slt := [γtl , ftl , gtl ] ∈ S := [Γ, F, G], where γtl indicates
environment. Because of the adopted stochastic and entropy maximum achievable spectral efficiency of user l at TTI t, ftl
maximizing algorithm, SAC explores as many potential actions indicates the total amount of transmitted data by user l up
as possible and, hence, it is able to deal with complicated and until TTI t, and gtl is the user group label of user l at TTI
dynamic environments (e.g., mobility scenarios in wireless t. The value of γtl can be calculated as the spectral efficiency
communication), including scenarios it has never encoun- of user l in SU-MIMO, where only user l is scheduled at
tered [42]. TTI t. The users with the same user grouping label gtl have
Fig. 2 shows the block diagram of the SAC framework. low channel correlation so they are preferred to be scheduled
Similar to any actor-critic architecture in DRL, the actor together. We will introduce more details on the user grouping
in SAC generates a policy from which an action is drawn strategy in §III-E.
based on the current state. The role of the critic is to assess Action space. The action space set A consists of discrete
the actor’s policy and guide the actor toward the optimal values, encoding the user-selection decision. We denote the
path through feedback. Unlike other actor-critic models, SAC action at time t as at ∈ A. Due to its combinatorial nature,
adjusts the Q function by a temperature coefficient (α in (8)), the action set grows exponentially with the number of users
which represents the weight of entropy. Furthermore, in [23], in the system. For instance, with a total of L users available,
the authors improve SAC with automatic entropy coefficient any number of users between 1 and Nmax can be scheduled
adjustment. This method significantly reduces the burden of each TTI t, and thus the total number of possible selections
at P
Nmax L
manually adjusting hyper-parameters in training and stabilizes is i=1 i .
its convergence. In contrast, hyper-parameter tuning and un- Reward. Our ultimate objective for resource scheduling is
stable environments are still big challenges for the majority of to maximize both the system’s spectral efficiency and fairness
state-of-the-art DRL models such as DDPG [43]. Another ad- among users. By system spectral efficiency, we refer to the
vantage of SAC is its robustness in handling multi-dimensional sum rate achieved by all users scheduled together at TTI t.
tasks. High-dimensional tasks are generally challenging to We use a normalized version of this quantity expressed by
deal with for DRL model due to a phenomenon known as γttotal . The normalization factor is calculated as follows. We
the curse of dimensionality [44]. However, due to the high measure the achievable rates for each user in the system if
sample efficiency boosted by entropy maximization, SAC that user were scheduled individually (SU-MIMO). We then
has demonstrated to perform very well in high-dimensional use the sum of the N largest rates out of the total L users as
the normalization factor. This will guarantee a value in [0, 1] D. Dimension Division
which then can be used in the reward function. To quantify One major drawback of mapping continuous actions to
fairness, we use Jain’s fairness index (JFI) [45], which can be discrete actions is the decision accuracy loss. The reason is
expressed at each TTI t as that, as the size of the discrete action set increases, the corre-
P 2 sponding distance between discrete actions in the continuous
L domain will become extremely small. The precision of each
l=1 flt
JF It = PL 2. (10) discrete action when mapped from a continuous action space
L l=1 (flt ) in the range [−1, 1] is equal to (1 − (−1))/2L , where 2L is the
total number of discrete actions. When this precision is smaller
As such, we include the normalized spectral efficiency and
than the network output precision, it will lead to decision
the JFI in the reward function of the MDP model. The reward
accuracy loss. This precision loss prohibits scaling up the size
Rt achieved at TTI t can be then formulated as
of the discrete action set. In order to improve the scalability
of our model to much larger action sets, i.e. larger number
Rt = βγttotal + (1 − β)JF It . (11) of users, we propose a novel strategy that we call dimension
division, where we break up the action space into multiple
In (11), β determines the relative importance of each item
dimensions. As discussed in §III-A, high-dimensional tasks
in the reward function based on the preference of the system
are generally challenging to deal with in DRL models. But
operator. Note that, both items are the range [0, 1] so that we
here, we particularly rely on the strength of the SAC model in
can effectively adjust their weights in the reward function with
handling multiple dimensions. The difference in our approach
parameter β.
is that we use this strength in a multi-dimensional discrete
action space. With D dimensions, we can reduce the number
C. Discrete Action SAC Design of actions in each dimension from 2L to (2L )1/D actions. As
such, mapping precision is also changed from (1−(−1))/2L to
Originally, SAC is a continuous action space model and (1−(−1))/(2L )1/D in each dimension. Based on this strategy,
thus, it cannot be directly applied to the massive MIMO user the continuous-action DRL model will generate proto actions
scheduling problem. There are existing discrete action space in D dimensions. We apply the approximate KNN to each
models, such as DQN [10] and Double DQN [14], that could proto action to generate the K nearest discrete actions in each
potentially be used to solve the problem. But as we will show dimension. Finally, the critic network will pick the discrete
in §IV, none of these methods can handle the large action set in action with the maximum Q value to form the final action (i.e.
massive MIMO user scheduling. Note that, the discrete action an integer number between 1 and 2L ). This final discrete action
space set in multi-user scheduling in massive MIMO increases is then mapped to a specific user combination from all possible
exponentially as the number of users grows. For example, with combinations of L users to be scheduled. Fig. 3 illustrates
M = 64 BS antennas and L = 64 single-antenna users, or the proposed workflow. In general, to scale up the number of
simply a 64 × 64 network size,P16 and Nmax = 16 in each TTI, supported users, it is important to strike a balance between the
the action set size has up to i=1 64 i ≈ 7 × 10
14
actions. number of dimensions and the size of each dimension. In §IV,
Several recent works have attempted to solve the large we demonstrate that the SMART scheduler is able to perform
discrete action space problem by discretizing the continuous- well with a number of users as high as L = 128 whereas
control-based DRL model. In this direction, [20] combines DDPG is unable to converge in that scenario.
DDPG with KNN to solve problems with large discrete action
sets (e.g., recommender systems and language models). More E. User Grouping
precisely, a KNN approximation [38] is used because of its Previous works on DRL-based massive MIMO schedul-
agile search in logarithmic time. Its fundamental idea is to ing [18], [21] use the full channel matrix as the input to their
first generate a so-called proto continuous action (i.e. a real DRL model. The size of the channel matrix is 2 × M × L. The
number in [−1, 1]) from the continuous action space DRL factor of 2 denotes the real and imaginary components of the
model. Then, KNN is used to calculate the l2 -norm between channel estimate since neural networks are usually designed
the proto action with actions in the discrete space represented and trained for real values. As the size of the system (M, L)
by integer numbers corresponding to different actions, sort increases and correspondingly the input size of the DRL model
them in ascending order, and pick the first K ones. Here, K grows, the model convergence becomes more difficult. In order
is a system hyper-parameter. Finally, after comparing the Q to scale up the model to support large network sizes, the input
values of these K discrete actions in the critic network, the size must be reduced. To reduce the input size, we adopt the
one with the highest Q value is chosen as the final action. user grouping labels calculated from the inter-user channel
Similarly, we propose to augment the SAC model with a correlation matrix to guide the DRL model.
KNN approximation model that can map the continuous action The inter-user channel correlation matrix measures the cor-
space to a discrete one. However, the model in [20] is shown relation between each pair of users in the network. Specifically,
to be effective for tasks with up to one million actions, far it is calculated as
below the number of scheduling actions encountered in a large
massive MIMO network. Next, we propose an idea to scale hi H hj
 
hi hj
the feasibility of the model to much larger action sets. ci,j = , = (12)
∥hi ∥2 ∥hj ∥2 ∥hi ∥2 ∥hj ∥2
(2L)1/d Algorithm 1 User Grouping Algorithm
D3 Critic

D2
Input: Channel matrix at TTI t: Ht , user set L and channel
(2L)1/d
Dd
0 (2L)1/d correlation threshold: cth
D1 K-Nearest Discrete Actions
Output: User group set G
(2L)1/d
1: Calculate channel correlations of all UE pairs ci,j , ∀i, j ∈
KNN L using Eq (12)
2: Initialize G = ∅
1
(2L)1/d
D3 D3 3: Let Lc = L
D2 D2
4: while Lc ̸= ∅ do
1 (2L)1/d
Dd
-1 1
Dd
0 (2L)1/d 5: Random pick UE i ∈ Lc and add to the empty user
D1 Proto Continuous Action D1 Final Discrete Action
1
with Max Q value group Gi
(2L)1/d
6: Iteratively search in Lc to find all UEs whose channel
correlations with all existing UEs in Gi are smaller
Actor
than cth and add them to Gi
User Set 7: User group G = G ∪ {Gi }
8: Update Lc = Lc \ Gi
9: end while
State
10: return User group set G

Wireless Channels
F. Scheduling Across RBs
As mentioned in §II-A, user channel quality varies sig-
nificantly across RBs. Consequently, the channel correlation
among users varies across the RBs as well. This leads to
different optimal scheduling solutions for each RB. However,
the scheduling decision on each RB will affect the decision
on other RBs, particularly as it relates to rate fairness. Since
the goal is to maximize both system spectral efficiency and
fairness for the whole system, as expressed in equations (4)
and (5), the optimal scheduling on all RBs needs to be
Fig. 3: SMART Architecture. jointly considered. One way to model this problem is to
have independent SMART frameworks, as described in §III-B-
§III-E, to make decisions on each RB, with the additional
modification that each framework uses the decision from other
where hi and hj are channel vectors of user i and user j in frameworks running on other RBs to calculate the new fairness
channel matrix H and ci,j is their channel correlation. in its state space and the new reward, akin to the formulation
To reduce the complexity of the channel matrix, we adapt of weighted rates in Eq. (5). A block diagram of such a
a similar user grouping method with [32], as shown in Algo- model is depicted in Fig. 4a. This model can be regarded
rithm 1. The algorithm uses the inter-user channel correlation as a cooperative multi-agent DRL framework where each
matrix calculated through equation (12) to partition users SMART framework responsible for a different RB acts as
with low correlation into separate sets, where the partitioning a separate agent that shares its decisions with other agents.
threshold is cth . During grouping, users in the same group (less We refer to this overall model as SMART-MA. In SMART-
correlated users) are assigned the same label. As discussed MA, agents of RBs are jointly optimized. The instantaneous
in §III-B, we only then need to assign a user group label to spectral efficiency of each user is aggregated from all RBs
each user in the state space instead of its complete channel and JFI is updated based on a global user scheduling decision
vector. With user grouping labels as input of the DRL model, rather than an individual RB’s decision. Consequently, the
the state space size will be significantly reduced. As an SMART agents of all RBs share the same reward function and
example, in a 64 × 64 network size, at each TTI, the state engage in cooperative learning. However, multi-agent DRL
of each user includes three variables: maximum achievable models are known to be difficult to converge, especially as
spectral efficiency, the total amount of transmitted data by the the number of agents scales up [46]. We demonstrate this by
user, and user group label. Thus, the total state space size is employing a multi-agent model in §IV. For fading channel
192. However, without user grouping, the real and imaginary models, the inter-user channel correlation across RBs will be
parts of the raw channel matrix must be fed to the DRL model largely random, and when dealing with a large number of RBs,
separately, which leads to a state space size of 8320. Such it is expected that the fairness across RBs will be smoothed
large-scale inputs will lead to complicated neural network out. With this assumption, and given the limitation of the
structure, high computation complexity in model updating, and multi-agent model, we propose to use a fully independent
excessive running time (cf. §IV-C). model for each RB referred to as SMART-SA and depicted
TABLE I: Simulation and Training Parameters
User Set 1 𝛾total 1
SMART RB 1 Parameter Value
Central
Channel Model 3GPP 3D UMi LOS
User Set 2 𝛾total 2
Fairness JFI SMART RB 2 Wireless System Bandwidth 20 MHz
Update Channels
Block System Carrier Frequency 3.6 GHz
𝛾total B
TTI Duration 1 ms
SMART RB B User Set B
Modulation 16QAM
Cell Radius 300 m
(a) SMART-MA UE Speed 0 & 2.8 m/s
Number of BS Antennas 16 & 64
Fairness Update User Set 1 𝛾total 1 Number of UEs 16 & 64
JFI 1 SMART RB 1 Wireless Channels
Block 1 Batch Size 256
Actor Learning Rate 5e-4
Fairness Update User Set 2 𝛾total 2 Critic Learning Rate 5e-4
JFI 2 SMART RB 2 Wireless Channels
Block 2
Alpha Learning Rate 3e-4
Automatic Entropy Tuning True
Optimizer Adam
User Set B 𝛾total B
Fairness Update
Block B JFI B SMART RB B Wireless Channels Episodes 800
Iterations In Episode 400
(b) SMART-SA Correlation Threshold cth in Algorithm 1 0.5
β in Eq. (11) 0.5
Fig. 4: Fully independent SMART (a), and multi-agent
SMART frameworks (b) for scheduling users across RBs.
randomly assigned, and they will bounce back into the area
upon reaching the boundary. We describe the experimental
in Fig. 4b. In the SMART-SA, an independent SMART DRL setup for the real-world measured channels in §IV-C4. We
model is implemented for each RB. Each RB possesses its own implement the system model in §II-A using Python. In terms
distinct state space (not depicted in the diagram) and generates of modulation scheme, we adopt 16-QAM in our wireless
a scheduled user set specific to that RB. Based on the user channel simulator and use Error Vector Magnitude (EVM)
scheduling decision made by the model, selected users are of the received constellation to derive SNR as demonstrated
allocated resources within the wireless environment, and the in [48].
instantaneous spectral efficiency γ total of each scheduled user We run our experiments on an NVIDIA DGX A100
can be determined. Sequentially, the accumulated amount of server [49]. Both actor and critic networks implement neural
transmitted data and JFI are updated in the respective fairness nets with two hidden fully connected layers and ReLU activa-
update block. tion functions. We use the Adam optimizer [50] to train our
In §IV, we demonstrate the effectiveness of SMART-SA for DRL model in PyTorch [51]. The most relevant parameters
a large number of RBs in getting close-to-optimal results. used in our simulations are shown in Table I.

IV. P ERFORMANCE E VALUATION


B. Benchmarks
In this section, we perform a comprehensive evaluation of
our proposed scheduler design. We compare SMART with In order to do a thorough comparison, we implement
multiple different schedulers with respect to their achieved various scheduler models as benchmarks including classical
normalized spectral efficiency and JFI in various channel and heuristics-based schedulers, discrete-control-based DRL
conditions. We also provide a comparison of the computational schedulers, continuous-control-based DRL schedulers, and
complexity of our DRL-based scheduler with other methods attention-mechanism-based RL schedulers.
and discuss the feasibility of our scheduler in real-time 5G Classical Scheduler: We consider Opt-PF, Opt-MR, an
settings. approximate PF (Approx-PF) and a heuristics-based algorithm
as classical schedulers. Algorithms of Opt-PF and Opt-MR are
introduced in §II-A. Given the exceedingly high computational
A. Experimental Setup complexity involved in employing optimal schedulers for
We perform our evaluations in both simulated channels large-scale networks, we devise a variation of an approximate
as well as real-world channels measured with a massive Proportional Fairness (Approx-PF) scheduler in [28] that offers
MIMO hardware platform. For simulated wireless channels, reduced complexity from Opt-PF presented in §II-A. The
we use the Quasi Deterministic Radio Channel Generator algorithmic details of this particular implementation can be
(QuaDRiGa) [47] software. Specifically, we generate the 3D found in Algorithm 2. In this approach, we first calculate a
Urban Micro (UMi) Line Of Sight (LOS) channel model. weighted-rate matrix similar to Opt-PF in (5) and then select
We consider two channel scenarios: static and mobile. For Nmax users with the highest weighted rates. Consequently,
static channels, we consider two different modes: 1) four the computational complexity is reduced significantly from
user clusters, and 2) random user placement. In the mobile O(2L ) to O(2Nmax ). However, this is still too complex in
scenario, the base station is positioned at the center of a large-scale networks and thus needs to be simplified further.
circular area with a radius of 300 meters. Users within this Unlike the approximate scheduler described in [28], we do not
circle move in various directions at different speeds, with an consider the individual data load of each user in our work.
average speed of 2.8 m/s. The initial positions of the users are Instead, we implement the user grouping in Algorithm 1 in
action sets, e.g., on massive MIMO user scheduling [20], [21].
Algorithm 2 Approximate Proportional Fairness (Approx- To compare SAC with a DDPG-based scheduler, we replace
PF) Algorithm the SAC module in our design with DDPG and use it as
our benchmark. For fairness of comparison, this benchmark
Input: Resource block set B, Channel matrix of resource
adopts the same dimension division strategy as our design to
block b at TTI t: Ht,b and user set L
generate multi-dimensional scheduling actions, particularly in
Output: Scheduled user set on resource block b: Ub
t evaluating 64×64 network size. Furthermore, we use the same
1: Calculate weighted rate wl,b for all L users on resource
state space and reward function as well as the epsilon-greedy
block b at TTI t using (5)
algorithm for this benchmark algorithm as in our proposed
2: Sort and select N users with the highest weighted rate
scheduler.
on resource block b to construct a subset of user Nb
Attention-mechanism-based RL Schedulers: We imple-
3: Do user grouping in user subset Nb as Algorithm 1
ment a pointer-network-based scheduler (PN) as proposed
4: Find the user group Ub with the most users as the
in [18] in an actor-critic architecture. The PN is used as the
scheduled user set on resource block b at TTI t
actor network, which consists of a long short-term memory
5: return Ub
(LSTM)-based encoder and decoder. The critic network is a
multi-layer perceptron (MLP) and is trained using stochastic
this user subset and select the group with the most users. gradient descent. A limitation of this model is that the number
User grouping strategy helps Approx-PF to avoid scheduling of scheduled users needs to be fixed. Thus, in our evaluation of
highly inter-correlated users, thereby improving overall system the PN scheduler, we set the number of scheduler users N to
performance and releasing the heavy complexity to O(N 2 ). be so that M/N ≈ 4.5 which is shown to be the near-optimal
As for the heuristics-based benchmark, we use the algorithm number for the ZF beamformer [53].
in [32]. This algorithm groups users based on their channel Our Proposed Scheduler: We implement two variants
correlation and allocates power to the users in the selected for our scheduler: 1) a variant with raw channel matrix as
group. It then proposes to schedule the groups in a round-robin input that we call SMART-Vanilla, and 2) a variant with user
fashion. We implement a variation of the scheduler proposed grouping labels as input (as described in §III-E) that we simply
in [32]. We assume perfect power control in our model to call SMART.
enable fair comparison with the modified algorithm. We refer In our evaluations, the Opt-PF scheduler serves as the
to this benchmark algorithm as RR-UG. As we demonstrate optimal benchmark for fairness while the Opt-MR scheduler is
later, this algorithm, while effective in static user scenarios, optimal for spectral efficiency. For thoroughness, we first rule
becomes ineffective in highly mobile channel scenarios where out the discrete DRL-based scheduler, i.e. DQN and Double
channel correlations are continuously changing. We expect DQN, due to their inability to scale to large network sizes.
a similar behavior by other heuristic methods that rely on Second, we compare the remaining benchmarks in a medium
channel correlation-based user grouping. 16 × 16 network size and in different channel conditions.
Discrete-control-based DRL Scheduler: There are several This allows comparison of the AI-based benchmarks with
DRL models for discrete action spaces in the literature. We Opt-PF and Opt-MR schedulers when they are still in a
select DQN [10] and Double DQN [14] with Prioritized computationally feasible range. Lastly, we increase the size
Experience Replay Buffer (PERB) [52] as two representative of the network to 64 × 64, which we consider a real-world
discrete-control-based DRL algorithms. The study in [16] network size. In this network size, both Opt-PF and Opt-MR
shows a comparison of these two model with other discrete schedulers become computationally infeasible and thus, we
DRL models such as ACER and A3C and shows the superior only compare our proposed schedulers with PN, DDPG, and
performance and convergence of our selected benchmarks. RR-UG.
We implement both discrete-control-based DRL models as
benchmarks and refer to them as PRTY-DQN and PRTY- C. Results
DDQN. To balance exploration and exploitation, we adopt the 1) Model Training and Convergence
epsilon-greedy algorithm in both models. For fair comparison We trained the SMART model, in a 64 × 64 network size,
against other benchmarks, we tune the hyper-parameters so for 800 epochs with 400 iterations in each epoch. To ensure
as to achieve the best possible performance [16], [17], [24]. model convergence and learning performance, we divide 8
Because of the simple neural network structure of PRTY-DQN dimensions in action space and 256 actions in the action set
and PRTY-DDQN, we adopt grid search to comprehensively of each dimension, as discussed in III-C. The training takes
identify the optimal hyper-parameters. For PRTY-DQN, we about five hundred epochs which is when the DRL model
implement 2-hidden-layer neural networks with 32 neurons in converges. During the training process, we employ the epsilon-
each layer. We use the same settings in the main network and greedy algorithm to effectively manage the trade-off between
the target network of PRTY-DDQN. For both models, we set exploration and exploitation. This is achieved by selecting
the same state space, action space, and reward function as our random actions or utilizing learned actions that yield the
proposed scheduler. highest reward. The value of epsilon denotes the probability
Continuous-control-based DRL Scheduler: Similar to of selecting random actions for exploration purposes. Initially,
SAC, DDPG is also a continuous-control-based DRL model we set epsilon to 1, and gradually decrease it to zero over a
that has been used to solve optimization problems with large span of five hundred epochs.
We also trained SMART for a 128 × 128 network. To deal
with this extremely large action set, we break it down into 16
dimensions with 256 actions in each dimension for sufficient
decision accuracy. With these parameters, we find that our
DRL model can still converge. Conversely, all other RL-based
benchmarks, except PN, fail to converge in this scenario.
However, as we show later, the training and inference time
for PN is significantly larger and its performance in terms of
fairness is inferior to our scheduler. It is important to highlight
that SMART-Vanilla cannot converge in networks of this size (a) (b)
either due to the excessive state space. This observation further Fig. 5: Spectral Efficiency and JFI comparison of SMART
emphasizes the motivation behind incorporating user grouping with DQN and Double DQN in user mobility scenario and
in SMART. 4 × 4 network size.
Convergence of PRTY-DQN and PRTY-DDQN: Discrete-
based DRL is intuitively a suitable choice to deal with dis-
crete combinatorial optimization problems, such as resource channel matrix as in SMART-Vanilla simplifies neural network
scheduling, by modeling them as MDPs. However, in problems structure while not impairing model performance.
with large action sets, the discrete-based DRL model is shown Medium network size: For thorough comparison of all
unable to converge during the training process [18], [54], an the other benchmarks, we consider the case for medium
effect known as the action dimension disaster [18]. We also 16 × 16 network size, and Nmax = 4. We only compare the
demonstrate this effect by training PRTY-DQN and PRTY- benchmarks with SMART-Vanilla for a fair comparison with
DDQN on multiple network sizes. Our experiments show that other AI-based schedulers which use the raw channel matrix
the largest network size that these models could converge is as input. To be able to reason about the performance of each
4 × 4, and Nmax = 2. In this configuration, the size of the scheduler, we start with a toy network scenario where the users
action set is 10. are static and placed in four clusters (4-cluster). The users in
2) Performance Comparison in Various Network Sizes each cluster share the same scatters and experience similar
In the testing phase, we run our simulation environment for small-scale fading, and thus their channel vectors are highly
additional 400 TTIs in the same cell and use the trained model correlated. Fig. 6 shows the spectral efficiency and JFI results
to schedule users while recording the spectral efficiency and in the four-cluster channel mode. It is evident from Fig. 6a
the JFI values across TTIs. For a fair comparison, we use the that SMART-Vanilla performs very close to Opt-PF scheduler,
exact same channels generated as input to all benchmarks. It which shows SMART-Vanilla is able to converge to the Opt-PF
is important to note that partial or outdated channel informa- solution almost perfectly. In terms of JFI, Fig. 6b shows that
tion could impair the performance of the resource scheduler, SMART-Vanilla closely follows the Opt-PF scheduler as well.
particularly in scenarios involving high-speed mobility. This Both schedulers underperform the Opt-MR scheduler in terms
impacts any system that relies on the channel information of spectral efficiency, but the Opt-MR scheduler is not doing
for scheduling decisions and thus is beyond the scope of our well with respect to JFI as expected, since it is only optimizing
work. Nevertheless, in this case, complementary methods that the spectral efficiency. Interestingly, Fig. 6a also shows the
perform channel prediction based on the partial or outdated DDPG-based scheduler significantly under-perform SMART-
channel information such as the ones proposed in [55]–[57] Vanilla. That shows DDPG fails to explore widely enough
can be used to enhance the performance of the scheduler. because of its sample inefficiency and therefore gets stuck
In the following, we provide evaluation results of various in a local optimal. Lastly, we observe that RR-UG achieves
benchmarks in multiple network sizes. In each network size, a good spectral efficiency and is almost close to SMART-
we plot the average spectral efficiency and JFI over all TTIs. Vanilla. This is expected as the user grouping algorithm groups
We also display error bars in each plot indicating the minimum the users into exactly four groups based on four clusters. Since
and maximum values of results across TTIs. the users do not move, RR-UG will continue to serve each
Small network size: We consider the 4 × 4 network config- group at a time. The results also show that SMART-Vanilla
uration in a mobile scenario, to compare the performance of can learn the inter-user correlation well, despite using the raw
PRTY-DQN and PRTY-DDQN with our proposed scheduler. channel matrix from each user. PN is able to achieve near-
Fig. 5 shows that PRTY-DDQN outperforms PRTY-DQN optimal spectral efficiency but undesirable JFI. The reason is
and SMART-Vanilla on both spectral efficiency and JFI. This that PN can not deal with varying state representations of
is due to decision accuracy loss imposed by mapping the the input [58]. Specifically, sequentially selecting the users
SAC output from continuous space to discrete space in our will affect the fairness in the state space of the MDP model.
scheduler, as discussed in §III-C. However, the limitation Therefore, PN fails to optimize the JFI, while still performing
on the scalability of PRTY-DDQN makes it impractical to well in terms of achieved spectral efficiency.
use in real-world network sizes. mportantly, we observe that Figs. 7a and 7c show the normalized spectral efficiency
the performance of SMART is almost the same as SMART- for random placement of static users in the cell and mobile
Vanilla. This is an important finding since it shows using users moving in random directions within the cell, respec-
user grouping labels as input to our model instead of the raw tively. In both scenarios, we observe that SMART-Vanilla
(a) (b) (a) (b)
Fig. 6: Spectral Efficiency and JFI comparison of SMART
and existing methods in 16 × 16 network size and Nmax = 4
in 4-clusters topology.

still performs very closely to the Opt-PF scheduler, while the


DDPG scheduler significantly underperforms SMART-Vanilla.
The PN performance also slightly drops compared to the 4-
cluster scenario. This can be attributed to the limitation of this (c) (d)
scheduler with respect to its predefined number of selected
Fig. 7: Spectral Efficiency and JFI comparison of SMART
users. Note that in the 4-cluster scenario, the predefined
and existing methods in 16 × 16 network size and Nmax = 4
number of scheduled users for PN is exactly the same as the
in static random user topology (a) and (b), and user mobility
number of users in each user group where users have very
scenario (c) and (d).
low correlation. However, in the random placement scenario,
this condition does not necessarily hold and the number of
scheduled users by PN could be smaller or larger than the
optimal set of users. The PN performance gets worse in the
mobility scenario since user grouping is changing over time. users which is also a realistic number in small cells [5].
For instance, PN could select user sets with high correlation In this case, we assume Nmax = 64 which means the
in most cases. scheduler can choose to beamform to up to all 64 users in one
RR-UG achieves a relatively good performance in random TTI. In this network size, the complexity of calculating the
placement topology, but it does not achieve the same level of results for Opt-MR and Opt-PF is too high.Thus, we include
performance as in the 4-cluster channel mode. The reason is Approx-PF as a benchmark instead of Opt-PF along with the
that in the setups with random user locations, the user groups results for SMART-Vanilla, SMART, PN, DDPG, RR-UG. As
could include a larger number of users than Nmax = 4, and shown in Figs. 8a and 8c, SMART-Vanilla outperforms PN,
thus the groups have to be broken into smaller subgroups to DDPG, RR-UG, and Approx-PF. By foregoing the exhaustive
be scheduled sequentially. This impairs the performance of search, Approx-PF aims to reduce computational complexity.
RR-UG. In the mobility scenario, the performance of RR-UG However, we can see that its performance falls short compared
drops even more. This is due to the variations in channels to SMART. Similar to our earlier results for medium network
and user groupings caused by mobility in each TTI. It shows size, the performance of RR-UG is close to SMART-Vanilla
that while RR-UG might be a favorable scheduler in static in static random user placement but drops significantly in
scenarios (due to its lower computational complexity as we the mobility scenario. To enable DDPG to converge in this
show later), in the mobility scenarios, it does not perform scenario, we applied the dimension division presented in III-C
that well. In Figs. 7b and 7d, we see SMART-Vanilla and to its implementation. However, DDPG is unable to perform
DDPG achieve high fairness values. A good fairness result for well in multi-dimensional action sets as discussed earlier. This
DDPG is expected as fairness is accounted for in the reward explains the observation that DDPG does not perform well in
function. Opt-MR and RR-UG do not achieve high fairness terms of spectral efficiency. As we observed in the small and
in both scenarios. For RR-UG, the fairness drops since the medium networks, the performance of SMART is compara-
user groupings change continuously, and thus the rate fairness ble to that of SMART-Vanilla in both channel scenarios. It
cannot be met efficiently despite the time fairness due to demonstrates the effectiveness of using user grouping labels
the Round-Robin scheduling of groups. It is evident that PN in the state space of SMART.
performs very poorly with respect to JFI, as discussed earlier.
Real-world network size: We consider a more realistic All schedulers, except PN, achieve high fairness in the static
network size with a 64-antenna massive MIMO base station1 random user placement scenario. In the mobility scenario, the
at the center of the cell. We also consider L = 64 connected fairness for RR-UG also drops significantly due to varying
user groupings across TTIs. Here, PN has the worst JFI for
1 Most commercial deployments of massive MIMO include 64-antenna base the same reason as we mentioned for the medium network
stations size.
(a) (b) (a) (b)

(c) (d) (c) (d)


Fig. 8: Spectral efficiency and JFI comparison of SMART Fig. 9: Spectral Efficiency and JFI comparison of SMART and
and existing methods in 64 × 64 network size in random user existing methods in 8 × 8 network size and Nmax = 4 with 2
topology (a) and (b), and user mobility scenario (c) and (d). Resource Blocks in static random user topology (a) and (b),
and user mobility scenario (c) and (d).

3) Multi-RB Scheduling Performance


Here, we consider the multi-RB scenario and evaluate the We used a 64-antenna RENEW [59] software-defined massive
performance of our model presented in III-F. As discussed, the MIMO base station and seven software-defined clients in a
multi-agent DRL models are generally difficult to converge. large open area inside a building hall. We fixed six of the
In fact, our SMART-MA model only converged with 2 RBs clients in a circle, 15m away from the base station. The
(B = 2) when M = 8, L = 8, and Nmax = 4. Thus, we use seventh node was placed on a robot where we moved the
this configuration to demonstrate the efficacy of SMART-SA, robot across the hall starting from the location of the first
with respect to SMART-MA. Computational complexities of client to the last. A drawing of the BS and client placements
Opt-PF and Opt-MR were also acceptable in this configuration are shown in Fig. 10. We moved the robot along the path
as presented in §II-A, and thus, we include them in the evalua- with different speeds, i.e. with 0.5m/s, 1m/s, and 2m/s. The
tion along with RR-UG. Since we showed the underwhelming mobile node’s antenna was facing the base station in all the
performance of DDPG and PN in the single-RB case, we experiments (LoS channel). We repeated the experiments to
exclude them from this evaluation. Fig. 9 shows the experiment measure both LoS and NLoS channels for the fixed clients.
results for B = 2. It is evident that SMART-SA outperforms In each measurement, we transmitted time-orthogonal uplink
SMART-MA on spectral efficiency but has a slightly lower pilots from all clients to the BS. The uplink pilots were based
JFI. The reason is that SMART-SA tries to maximize spectral on the 802.11 LTS OFDM signal, which contains 52 non-
efficiency on each RB and sacrifices fairness as opposed to zero subcarriers. We consider each subcarrier as an RB in our
SMART-MA which balances the two metrics across RBs. evaluation, i.e. B = 52. Based on the collected real-world
SMART-SA performs much better in terms of both JFI and dataset, we train and evaluate the performance of SMART in
spectral efficiency compared to RR-UG. For B > 2, SMART- the 64 × 7 MIMO configuration with 52 RBs in a slow-speed
MA, Opt-PF, and Opt-MR become infeasible. However, to mobility scenario.
demonstrate the performance of SMART-SA, we evaluate it Using these datasets, we evaluate the performance of
for B = 100 with a 64 × 64 network size and compare it with SMART. Due to convergence issues and excessive compu-
RR-UG. The evaluation results are shown in the simulation tational complexity of other schedulers for B > 2 as dis-
column of Table II. For the results, it is evident that a large cussed in §II-A, we are only comparing SMART-SA with
number of RBs will not degrade JFI in SMART-SA while still RR-UG. The results, listed in Table II, show that RR-UG
maintaining desirable spectral efficiency. It also reaffirms our underperforms SMART-SA in both spectral efficiency and
previous finding on the low performance of RR-UG in the JFI. More importantly, SMART-SA is capable of achieving
mobility scenario. near-optimal (i.e. about 0.996) JFI, which demonstrates the
4) Real-World Data Evaluation effectiveness of SMART-SA when applied to multiple RBs.
To evaluate our proposed scheduler in real-world environ- However, we can anticipate that RR-UG performance will
ments, we conducted a massive MIMO channel measurement get worse as the number of mobile users increases, which
experiment in an indoor setting on the Rice University campus. is consistent with the results of mobility scenarios in medium
TABLE II: Spectral Efficiency and JFI comparison of SMART and RR-UG with multiple RBs in simulation discussed in §IV-C3
and with real-world data discussed in §IV-C4
Simulation with B = 100 Real-world Data with B = 52
Performance Metrics
Random Placement Mobility Scenario LoS Slow-speed LoS High-speed NLoS Slow-speed
SMART-SA RR-UG SMART-SA RR-UG SMART-SA RR-UG SMART-SA RR-UG SMART-SA RR-UG
Normalized System Spectral Efficiency 0.500 0.254 0.400 0.063 0.713 0.662 0.670 0.584 0.488 0.481
JFI 0.977 0.940 0.950 0.696 0.996 0.952 0.995 0.951 0.986 0.980

also vary with the network size. For Opt-MR and Opt-PF,
the runtime increases exponentially with the network size and
thus is not listed for network sizes beyond 16 × 16. Even
though Approx-PF is feasible in real-world size networks
with much less complexity than Opt-PF, it still takes about
20× times longer than SMART to execute. Regarding other
schedulers, the runtime seems to increase linearly. Both DDPG
and SMART-Vanilla show similar results. Comparing SMART
and SMART-Vanilla results show that using user grouping
labels instead of the raw channel matrix reduces the runtime
1
of the model up to 50%. Tuning hyper-parameters to achieve
2

3
the best performance for both SMART and SMART-Vanilla,
4
5 6
7
SMART has 3 fewer hidden layers and half the number of
neurons in each layer to remain on par with the performance
of SMART-Vanilla. However, user grouping requires only an
additional 3.5 ms in 64 × 64 network size, a negligible portion
BS Location Fixed Clients Mobile Clients
of the total runtime. The runtime for PN is about 1.6x and
4x running time of SMART-vanilla in 16 × 16 and 64 × 64,
Fig. 10: Topology of the real-world indoor experiment. respectively. This is due to the fact that pointer networks are
auto-regressive and make decisions sequentially and thus have
slow inference. RR-UG shows the smallest runtime among all,
and real network size experiments. By running Algorithm 1 but it is not as spectrally efficient as SMART, especially in
on the datasets, we observe just one or two user groups in mobility scenarios.
most TTIs. Thus, RR-UG schedules all seven clients in one or
sometimes two TTIs. Therefore, RR-UG is rather competitive TABLE III: Wall-clock time in seconds per TTI
Scheduler
as SMART-SA here. For the purpose of showing the generality System Configuration
Opt-MR Opt-PF Approx-PF RR-UG DDPG PN SMART-Vanilla SMART
of our model, we use the model trained on the LoS slow- 16 × 16 0.15 0.21 - 0.0013 0.034 0.059 0.036 0.024
64 × 64 - - 0.604 0.0043 0.058 0.235 0.057 0.030
speed dataset and test it in the LoS high-speed mobility. The 128 × 128 - - - - - - - 0.071
results in Table II demonstrate the adaptability of SMART-
SA to different mobility scenarios. Compared with the slow-
speed mobility scenario, it is obvious that the performance
gap between SMART and RR-UG in the high-speed scenario D. Discussion and Future Work
is larger. This is because a high speed makes channel condition The results presented earlier offer good insights into the
and inter-user channel correlation vary more quickly than performance and computational complexity of the proposed
the slow speed. Faster varying inter-user channel correlation SMART scheduler with respect to the existing methods.
results in quicker variations of user grouping, which makes However, an important question is whether SMART can be
it challenging for RR-UG to adapt fast enough. However, deployed to operate in time-stringent 5G-NR systems. For
SMART is capable of dealing with this rapid change. For a realistic network size, Table III shows SMART takes as
comprehensiveness, we also test the trained model on NLoS much as 30 ms to run an iteration, 30× longer than one TTI
slow-speed topology. The results in Table II show SMART- in the least time-stringent mode of 5G-NR [31]. This may
SA’s superiority over RR-UG and its generality in real-world seem problematic for the adoption of SMART. To investigate
data, albeit not as good as it is in LoS high-speed. this, we run an experiment in a mobility scenario. We first
5) Computational Complexity train SMART offline as before and test the trained model on
We measure average wall-clock time per TTI for all the the testing dataset without online updates to the model. We
schedulers discussed in §IV-C2. For comparison fairness, we compare the spectral efficiency results for the offline trained
run all implementations on a single CPU core on the NVIDIA model with the previously presented results that include the
DGX server. The runtime values are listed in Table III for online updates. The results are shown in Fig. 11. We observe
three network sizes considered in §IV-C2. The results show that, even when we use the offline trained model with no
the runtimes of the schedulers are widely different and they online updates, the performance is remarkably close to when
R EFERENCES
[1] L. Li, M. Pal, and Y. R. Yang, “Proportional fairness in multi-rate
wireless LANs,” in IEEE INFOCOM 2008 - The 27th Conference on
Computer Communications, 2008, pp. 1004–1012.
[2] S. Huang, H. Yin, J. Wu, and V. C. M. Leung, “User selection
for multiuser MIMO downlink with zero-forcing beamforming,” IEEE
Transactions on Vehicular Technology, vol. 62, no. 7, pp. 3084–3097,
2013.
[3] K. Ko and J. Lee, “Multiuser MIMO user selection based on chordal
distance,” IEEE Transactions on Communications, vol. 60, no. 3, pp.
(a) (b) 649–654, 2012.
[4] N. Prasad, H. Zhang, H. Zhu, and S. Rangarajan, “Multiuser schedul-
Fig. 11: Evaluation of SMART with and without a model ing in the 3GPP LTE cellular uplink,” IEEE Transactions on Mobile
Computing, vol. 13, no. 1, pp. 130–145, 2014.
((online vs. offline) update in user mobility scenario and 64 × [5] C.-M. Chen, Q. Wang, A. Gaber, A. P. Guevara, and S. Pollin, “User
64 network size. scheduling and antenna topology in dense massive MIMO networks: An
experimental study,” IEEE Transactions on Wireless Communications,
vol. 19, no. 9, pp. 6210–6223, 2020.
the model is continuously updated. The performance can get [6] A. Chassein, M. Goerigk, A. Kasperski, and P. Zieliński, “Approxi-
mating combinatorial optimization problems with the ordered weighted
even closer when we do updates every few tens of TTIs. This averaging criterion,” European Journal of Operational Research, vol.
finding means that we can only look into the inference time 286, no. 3, pp. 828–838, 2020.
of the model as the scheduling decision time. For 16 × 16 [7] R. BELLMAN, “A markovian decision process,” Journal of Mathematics
and Mechanics, vol. 6, no. 5, pp. 679–684, 1957. [Online]. Available:
and 64 × 64 network sizes, the inference times for SMART http://www.jstor.org/stable/24900506
are 5.4 and 8.7 ms. Running the model on a single GPU core [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
on the NVIDIA DGX A100 server reduces the inference time Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
values to 1.2 and 1.6 ms, respectively. The inference runtime D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
values can be further reduced to sub-millisecond levels, as deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
required in 5G-NR, by a more efficient implementation such Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
[9] A. Dargazany, “Drl: Deep reinforcement learning for intelligent robot
as with CUDA [60] framework and parallelizing the DRL control – concept, literature, and future,” 2021.
model on several GPU cores. More importantly, the reassuring [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
performance of SMART-SA, demonstrated in §IV-C3, shows stra, and M. Riedmiller, “Playing Atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
that we can get similar runtime values for 100s of RBs, as its [11] P. Lissa, C. Deane, M. Schukat, F. Seri, M. Keane, and E. Barrett, “Deep
architecture allows us to fully parallelize it on different GPU reinforcement learning for home energy management system control,”
cores. Energy and AI, vol. 3, p. 100043, 2021.
[12] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural
Lastly, we have only considered saturated traffic for each combinatorial optimization with reinforcement learning,” arXiv preprint
user. A more generic design should consider the incoming traf- arXiv:1611.09940, 2016.
fic model as well as the quality of service (QoS) requirements, [13] K. Li, T. Zhang, R. Wang, Y. Wang, Y. Han, and L. Wang, “Deep
reinforcement learning for combinatorial optimization: Covering sales-
e.g. data rate and latency, for each user. Formulation of the man problems,” IEEE Transactions on Cybernetics, vol. 52, no. 12, pp.
scheduling problem and formally solving it using optimization 13 142–13 155, 2022.
techniques or heuristics-based approximation is a difficult task. [14] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
We believe AI-based methods such as the one proposed in this with double Q-learning,” in Proceedings of the AAAI conference on
artificial intelligence, vol. 30, no. 1, 2016.
paper provide a more promising avenue for solving the generic [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
case if enough training data exists. We leave the design of a D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
more comprehensive scheduler that considers parameters in reinforcement learning,” 2016.
[16] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu,
the higher layers of the network such as traffic models and and N. de Freitas, “Sample efficient actor-critic with experience replay,”
QoS constraints as future work. 2017.
[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017. [Online]. Available:
V. C ONCLUSION https://arxiv.org/abs/1707.06347
In this paper, we presented SMART, a resource scheduler [18] L. Chen, F. Sun, K. Li, R. Chen, Y. Yang, and J. Wang, “Deep
reinforcement learning for resource allocation in massive MIMO,” in
for massive MIMO networks based on the soft actor-critic 2021 29th European Signal Processing Conference (EUSIPCO), 2021,
DRL model. We demonstrated the effectiveness of our sched- pp. 1611–1615.
uler in achieving both spectral efficiency as well as fairness [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
very close to the optimal proportionally fair scheduler. We also learning,” 2015. [Online]. Available: https://arxiv.org/abs/1509.02971
showed that our model outperforms state-of-the-art massive [20] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,
MIMO schedulers in all scenarios, and particularly in mobility J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep
scenarios. We removed the need for raw channel matrices reinforcement learning in large discrete action spaces,” 2015. [Online].
Available: https://arxiv.org/abs/1512.07679
in training our DRL model by utilizing a user grouping [21] X. Guo, Z. Li, P. Liu, R. Yan, Y. Han, X. Hei, and G. Zhong, “A
algorithm based on the inter-user correlation matrix and, thus, novel user selection massive MIMO scheduling algorithm via real time
we significantly reduced the complexity of our model. We also DDPG,” in GLOBECOM 2020 - 2020 IEEE Global Communications
Conference, 2020, pp. 1–6.
provided guidelines as to how our scheduling model can be [22] H. Chen, Y. Liu, Z. Zheng, H. Wang, X. Liang, Y. Zhao, and J. Ren,
deployed in time-stringent 5G-NR systems. “Joint user scheduling and transmit precoder selection based on DDPG
for uplink multi-user MIMO systems,” in 2021 IEEE 94th Vehicular [46] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey
Technology Conference (VTC2021-Fall). IEEE, 2021, pp. 1–5. of multi-agent reinforcement learning,” IEEE Transactions on Systems,
[23] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft pp. 156–172, 2008.
actor-critic algorithms and applications,” 2018. [Online]. Available: [47] S. Jaeckel, L. Raschkowski, K. Börner, and L. Thiele, “QuaDRiGa: A
https://arxiv.org/abs/1812.05905 3-d multi-cell channel model with time evolution for enabling virtual
[24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- field trials,” IEEE Transactions on Antennas and Propagation, vol. 62,
policy maximum entropy deep reinforcement learning with a stochastic no. 6, pp. 3242–3256, 2014.
actor,” 2018. [Online]. Available: https://arxiv.org/abs/1801.01290 [48] R. A. Shafik, M. S. Rahman, A. R. Islam, and N. S. Ashraf, “On the error
[25] “5G NR physical channels and modulation (3GPP TS 38.211 version vector magnitude as a performance metric and comparative analysis,”
16.2.0 release 16),” https://www.etsi.org/deliver/etsi ts/138200 138299/ in 2006 International Conference on Emerging Technologies, 2006, pp.
138211/16.02.00 60/ts 138211v160200p.pdf, accessed: 2023-02-13. 27–31.
[26] Y. Chen, R. Yao, H. Hassanieh, and R. Mittal, “Channel-aware 5G RAN [49] “NVIDIA DGX Station A100,” https://www.nvidia.
slicing with customizable schedulers.” com/content/dam/en-zz/Solutions/Data-Center/dgx-station/
[27] V. Lau, “Proportional fair space-time scheduling for wireless commu- nvidia-dgx-station-a100-datasheet.pdf, accessed: 2022-07-27.
nications,” IEEE Transactions on Communications, vol. 53, no. 8, pp. [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
1353–1360, 2005. arXiv preprint arXiv:1412.6980, 2014.
[28] P. R. M., M. R., A. Kumar, and K. Kuchi, “Downlink resource allocation [51] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
for 5g-nr massive mimo systems,” in 2021 National Conference on A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
Communications (NCC), 2021, pp. 1–6. PyTorch,” 2017.
[29] C. Blair, “Theory of linear and integer programming (alexander [52] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
schrijver),” SIAM Review, vol. 30, no. 2, pp. 325–326, 1988. [Online]. replay,” 2016.
Available: https://doi.org/10.1137/1030065 [53] E. Björnson, E. G. Larsson, and T. L. Marzetta, “Massive MIMO:
[30] H. Liu, H. Gao, S. Yang, and T. Lv, “Low-complexity downlink user ten myths and one critical question,” IEEE Communications Magazine,
selection for massive MIMO systems,” IEEE Systems Journal, vol. 11, vol. 54, no. 2, pp. 114–123, 2016.
no. 2, pp. 1072–1083, 2017. [54] T. V. de Wiele, D. Warde-Farley, A. Mnih, and V. Mnih, “Q-learning
[31] Y. Chen, Y. Wu, Y. T. Hou, and W. Lou, “mCore: Achieving Sub- in enormous action spaces via amortized approximate maximization,”
millisecond Scheduling for 5G MU-MIMO Systems,” in IEEE INFO- 2020.
COM 2021 - IEEE Conference on Computer Communications, 2021, [55] C. Wu, X. Yi, Y. Zhu, W. Wang, L. You, and X. Gao, “Channel prediction
pp. 1–10. in high-mobility massive mimo: From spatio-temporal autoregression to
[32] H. Yang, “User scheduling in massive MIMO,” in 2018 IEEE 19th deep learning,” IEEE Journal on Selected Areas in Communications,
International Workshop on Signal Processing Advances in Wireless vol. 39, no. 7, pp. 1915–1930, 2021.
Communications (SPAWC), 2018, pp. 1–5. [56] Y. Han, S. Jin, C.-K. Wen, and X. Ma, “Channel estimation for extremely
large-scale massive mimo systems,” IEEE Wireless Communications
[33] J. Shi, W. Wang, J. Wang, and X. Gao, “Machine learning assisted user-
Letters, vol. 9, no. 5, pp. 633–637, 2020.
scheduling method for massive MIMO system,” in 2018 10th Interna-
[57] C.-J. Chun, J.-M. Kang, and I.-M. Kim, “Deep learning-based channel
tional Conference on Wireless Communications and Signal Processing
estimation for massive mimo systems,” IEEE Wireless Communications
(WCSP), 2018, pp. 1–6.
Letters, vol. 8, no. 4, pp. 1228–1231, 2019.
[34] G. Bu and J. Jiang, “Reinforcement learning-based user scheduling and
[58] M. Nazari, A. Oroojlooy, L. V. Snyder, and M. Takác, “Deep reinforce-
resource allocation for massive MU-MIMO system,” in 2019 IEEE/CIC
ment learning for solving the vehicle routing problem,” arXiv preprint
International Conference on Communications in China (ICCC), 2019,
arXiv:1802.04240, 2018.
pp. 641–646.
[59] R. Doost-Mohammady, O. Bejarano, L. Zhong, J. R. Cavallaro,
[35] V. H. L. Lopes, C. V. Nahum, R. M. Dreifuerst, P. Batista, A. Klautau, E. Knightly, Z. M. Mao, W. W. Li, X. Chen, and A. Sabharwal,
K. V. Cardoso, and R. W. Heath, “Deep reinforcement learning-based “RENEW: Programmable and observable massive MIMO networks,” in
scheduling for multiband massive MIMO,” IEEE Access, vol. 10, pp. 2018 52nd Asilomar Conference on Signals, Systems, and Computers,
125 509–125 525, 2022. 2018, pp. 1654–1658.
[36] C.-W. Huang, I. Althamary, Y.-C. Chou, H.-Y. Chen, and C.-F. Chou, [60] “CUDA Toolkit Documentation,” https://docs.nvidia.com/cuda/, ac-
“A DRL-based automated algorithm selection framework for cross- cessed: 2022-08-01.
layer QoS-aware scheduling and antenna allocation in massive MIMO
systems,” IEEE Access, vol. 11, pp. 13 243–13 256, 2023.
[37] Z. Zhao, Y. Liang, and X. Jin, “Handling large-scale action space in deep
Q network,” in 2018 International Conference on Artificial Intelligence
and Big Data (ICAIBD), 2018, pp. 93–96.
[38] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high
dimensional data,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 36, no. 11, pp. 2227–2240, 2014.
[39] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine,
“Learning to walk via deep reinforcement learning,” 2018. [Online].
Available: https://arxiv.org/abs/1812.11103
[40] P. Christodoulou, “Soft actor-critic for discrete action settings,” 2019.
[Online]. Available: https://arxiv.org/abs/1910.07207
[41] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” 2015. [Online]. Available: https://arxiv.org/abs/1509.02971
[42] B. Eysenbach and S. Levine, “Maximum entropy RL (provably) solves
some robust RL problems,” arXiv preprint arXiv:2103.06257, 2021.
[43] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,
“Deep reinforcement learning that matters,” in Proceedings of the AAAI
conference on artificial intelligence, vol. 32, no. 1, 2018.
[44] X. Hao, H. Mao, W. Wang, Y. Yang, D. Li, Y. Zheng, Z. Wang,
and J. Hao, “Breaking the curse of dimensionality in multi-agent
state space: A unified agent permutation framework,” 2022. [Online].
Available: https://arxiv.org/abs/2203.05285
[45] R. Jain, D. Chiu, and W. Hawe, “A quantitative measure of
fairness and discrimination for resource allocation in shared computer
systems,” CoRR, vol. cs.NI/9809099, 1998. [Online]. Available:
https://arxiv.org/abs/cs/9809099

You might also like