A Deep Reinforcement Learning-Based Resource
A Deep Reinforcement Learning-Based Resource
Abstract—The large number of antennas in massive MIMO from multiple users becomes challenging when their channels
systems allows the base station to communicate with multiple are correlated. In networks with high user mobility, the chan-
users at the same time and frequency resource with multi- nels of individual users and their correlations with other users
user beamforming. However, highly correlated user channels
could drastically impede the spectral efficiency that multi-user within each RB are rapidly fluctuating. This dynamic nature of
arXiv:2303.00958v2 [cs.IT] 14 Sep 2023
beamforming can achieve. As such, it is critical for the base channel characteristics substantially increases the challenges
station to schedule a suitable group of users in each time and associated with achieving optimal resource scheduling for
frequency resource block to achieve maximum spectral efficiency massive MIMO networks. Specifically, fair scheduling of radio
while adhering to fairness constraints among the users. In this resources while maximizing spectral efficiency is essential in
paper, we consider the resource scheduling problem for massive
MIMO systems with its optimal solution known to be NP-hard. real deployments. The formulation of the optimal Proportion-
Inspired by recent achievements in deep reinforcement learning ally Fair (Opt-PF) scheduling problem typically results in an
(DRL) to solve problems with large action sets, we propose integer linear optimization (ILP) problem with an NP-hard
SMART, a dynamic scheduler for massive MIMO based on solution [1]. The large complexity associated with solving
the state-of-the-art Soft Actor-Critic (SAC) DRL model and the an ILP, when the number of users and resource blocks is
K-Nearest Neighbors (KNN) algorithm. Through comprehensive
simulations using realistic massive MIMO channel models as well large, prohibits designing optimal yet computationally feasible
as real-world datasets from channel measurement experiments, schedulers that can work in the time-stringent 5G and beyond
we demonstrate the effectiveness of our proposed model in vari- standards. There is a large body of work [2]–[5] that design
ous channel conditions. Our results show that our proposed model heuristics or approximation algorithms with low complexity
performs very close to the optimal proportionally fair (Opt-PF) to optimize the spectral efficiency of the networks. However,
scheduler in terms of spectral efficiency and fairness with more
than one order of magnitude lower computational complexity in they either do not evaluate fairness at all or demonstrate poor
medium network sizes where Opt-PF is computationally feasible. fairness. This is due to the fact that designing low-complexity
Our results also show the feasibility and high performance of our approximation algorithms for multi-objective combinatorial
proposed scheduler in networks with a large number of users and optimization problems is typically hard [6].
resource blocks. In the field of artificial intelligence and machine learning,
Index Terms—Massive MIMO, Resource Scheduling, Deep Markov Decision Processes (MDPs) [7] have emerged as
Reinforcement Learning. a powerful mathematical framework for modeling decision-
making problems under uncertainty. MDPs represent sequen-
I. I NTRODUCTION tial decision processes as a set of states, actions, and transition
probabilities, where the goal is to find an optimal policy
D2
Input: Channel matrix at TTI t: Ht , user set L and channel
(2L)1/d
Dd
0 (2L)1/d correlation threshold: cth
D1 K-Nearest Discrete Actions
Output: User group set G
(2L)1/d
1: Calculate channel correlations of all UE pairs ci,j , ∀i, j ∈
KNN L using Eq (12)
2: Initialize G = ∅
1
(2L)1/d
D3 D3 3: Let Lc = L
D2 D2
4: while Lc ̸= ∅ do
1 (2L)1/d
Dd
-1 1
Dd
0 (2L)1/d 5: Random pick UE i ∈ Lc and add to the empty user
D1 Proto Continuous Action D1 Final Discrete Action
1
with Max Q value group Gi
(2L)1/d
6: Iteratively search in Lc to find all UEs whose channel
correlations with all existing UEs in Gi are smaller
Actor
than cth and add them to Gi
User Set 7: User group G = G ∪ {Gi }
8: Update Lc = Lc \ Gi
9: end while
State
10: return User group set G
Wireless Channels
F. Scheduling Across RBs
As mentioned in §II-A, user channel quality varies sig-
nificantly across RBs. Consequently, the channel correlation
among users varies across the RBs as well. This leads to
different optimal scheduling solutions for each RB. However,
the scheduling decision on each RB will affect the decision
on other RBs, particularly as it relates to rate fairness. Since
the goal is to maximize both system spectral efficiency and
fairness for the whole system, as expressed in equations (4)
and (5), the optimal scheduling on all RBs needs to be
Fig. 3: SMART Architecture. jointly considered. One way to model this problem is to
have independent SMART frameworks, as described in §III-B-
§III-E, to make decisions on each RB, with the additional
modification that each framework uses the decision from other
where hi and hj are channel vectors of user i and user j in frameworks running on other RBs to calculate the new fairness
channel matrix H and ci,j is their channel correlation. in its state space and the new reward, akin to the formulation
To reduce the complexity of the channel matrix, we adapt of weighted rates in Eq. (5). A block diagram of such a
a similar user grouping method with [32], as shown in Algo- model is depicted in Fig. 4a. This model can be regarded
rithm 1. The algorithm uses the inter-user channel correlation as a cooperative multi-agent DRL framework where each
matrix calculated through equation (12) to partition users SMART framework responsible for a different RB acts as
with low correlation into separate sets, where the partitioning a separate agent that shares its decisions with other agents.
threshold is cth . During grouping, users in the same group (less We refer to this overall model as SMART-MA. In SMART-
correlated users) are assigned the same label. As discussed MA, agents of RBs are jointly optimized. The instantaneous
in §III-B, we only then need to assign a user group label to spectral efficiency of each user is aggregated from all RBs
each user in the state space instead of its complete channel and JFI is updated based on a global user scheduling decision
vector. With user grouping labels as input of the DRL model, rather than an individual RB’s decision. Consequently, the
the state space size will be significantly reduced. As an SMART agents of all RBs share the same reward function and
example, in a 64 × 64 network size, at each TTI, the state engage in cooperative learning. However, multi-agent DRL
of each user includes three variables: maximum achievable models are known to be difficult to converge, especially as
spectral efficiency, the total amount of transmitted data by the the number of agents scales up [46]. We demonstrate this by
user, and user group label. Thus, the total state space size is employing a multi-agent model in §IV. For fading channel
192. However, without user grouping, the real and imaginary models, the inter-user channel correlation across RBs will be
parts of the raw channel matrix must be fed to the DRL model largely random, and when dealing with a large number of RBs,
separately, which leads to a state space size of 8320. Such it is expected that the fairness across RBs will be smoothed
large-scale inputs will lead to complicated neural network out. With this assumption, and given the limitation of the
structure, high computation complexity in model updating, and multi-agent model, we propose to use a fully independent
excessive running time (cf. §IV-C). model for each RB referred to as SMART-SA and depicted
TABLE I: Simulation and Training Parameters
User Set 1 𝛾total 1
SMART RB 1 Parameter Value
Central
Channel Model 3GPP 3D UMi LOS
User Set 2 𝛾total 2
Fairness JFI SMART RB 2 Wireless System Bandwidth 20 MHz
Update Channels
Block System Carrier Frequency 3.6 GHz
𝛾total B
TTI Duration 1 ms
SMART RB B User Set B
Modulation 16QAM
Cell Radius 300 m
(a) SMART-MA UE Speed 0 & 2.8 m/s
Number of BS Antennas 16 & 64
Fairness Update User Set 1 𝛾total 1 Number of UEs 16 & 64
JFI 1 SMART RB 1 Wireless Channels
Block 1 Batch Size 256
Actor Learning Rate 5e-4
Fairness Update User Set 2 𝛾total 2 Critic Learning Rate 5e-4
JFI 2 SMART RB 2 Wireless Channels
Block 2
Alpha Learning Rate 3e-4
Automatic Entropy Tuning True
Optimizer Adam
User Set B 𝛾total B
Fairness Update
Block B JFI B SMART RB B Wireless Channels Episodes 800
Iterations In Episode 400
(b) SMART-SA Correlation Threshold cth in Algorithm 1 0.5
β in Eq. (11) 0.5
Fig. 4: Fully independent SMART (a), and multi-agent
SMART frameworks (b) for scheduling users across RBs.
randomly assigned, and they will bounce back into the area
upon reaching the boundary. We describe the experimental
in Fig. 4b. In the SMART-SA, an independent SMART DRL setup for the real-world measured channels in §IV-C4. We
model is implemented for each RB. Each RB possesses its own implement the system model in §II-A using Python. In terms
distinct state space (not depicted in the diagram) and generates of modulation scheme, we adopt 16-QAM in our wireless
a scheduled user set specific to that RB. Based on the user channel simulator and use Error Vector Magnitude (EVM)
scheduling decision made by the model, selected users are of the received constellation to derive SNR as demonstrated
allocated resources within the wireless environment, and the in [48].
instantaneous spectral efficiency γ total of each scheduled user We run our experiments on an NVIDIA DGX A100
can be determined. Sequentially, the accumulated amount of server [49]. Both actor and critic networks implement neural
transmitted data and JFI are updated in the respective fairness nets with two hidden fully connected layers and ReLU activa-
update block. tion functions. We use the Adam optimizer [50] to train our
In §IV, we demonstrate the effectiveness of SMART-SA for DRL model in PyTorch [51]. The most relevant parameters
a large number of RBs in getting close-to-optimal results. used in our simulations are shown in Table I.
also vary with the network size. For Opt-MR and Opt-PF,
the runtime increases exponentially with the network size and
thus is not listed for network sizes beyond 16 × 16. Even
though Approx-PF is feasible in real-world size networks
with much less complexity than Opt-PF, it still takes about
20× times longer than SMART to execute. Regarding other
schedulers, the runtime seems to increase linearly. Both DDPG
and SMART-Vanilla show similar results. Comparing SMART
and SMART-Vanilla results show that using user grouping
labels instead of the raw channel matrix reduces the runtime
1
of the model up to 50%. Tuning hyper-parameters to achieve
2
3
the best performance for both SMART and SMART-Vanilla,
4
5 6
7
SMART has 3 fewer hidden layers and half the number of
neurons in each layer to remain on par with the performance
of SMART-Vanilla. However, user grouping requires only an
additional 3.5 ms in 64 × 64 network size, a negligible portion
BS Location Fixed Clients Mobile Clients
of the total runtime. The runtime for PN is about 1.6x and
4x running time of SMART-vanilla in 16 × 16 and 64 × 64,
Fig. 10: Topology of the real-world indoor experiment. respectively. This is due to the fact that pointer networks are
auto-regressive and make decisions sequentially and thus have
slow inference. RR-UG shows the smallest runtime among all,
and real network size experiments. By running Algorithm 1 but it is not as spectrally efficient as SMART, especially in
on the datasets, we observe just one or two user groups in mobility scenarios.
most TTIs. Thus, RR-UG schedules all seven clients in one or
sometimes two TTIs. Therefore, RR-UG is rather competitive TABLE III: Wall-clock time in seconds per TTI
Scheduler
as SMART-SA here. For the purpose of showing the generality System Configuration
Opt-MR Opt-PF Approx-PF RR-UG DDPG PN SMART-Vanilla SMART
of our model, we use the model trained on the LoS slow- 16 × 16 0.15 0.21 - 0.0013 0.034 0.059 0.036 0.024
64 × 64 - - 0.604 0.0043 0.058 0.235 0.057 0.030
speed dataset and test it in the LoS high-speed mobility. The 128 × 128 - - - - - - - 0.071
results in Table II demonstrate the adaptability of SMART-
SA to different mobility scenarios. Compared with the slow-
speed mobility scenario, it is obvious that the performance
gap between SMART and RR-UG in the high-speed scenario D. Discussion and Future Work
is larger. This is because a high speed makes channel condition The results presented earlier offer good insights into the
and inter-user channel correlation vary more quickly than performance and computational complexity of the proposed
the slow speed. Faster varying inter-user channel correlation SMART scheduler with respect to the existing methods.
results in quicker variations of user grouping, which makes However, an important question is whether SMART can be
it challenging for RR-UG to adapt fast enough. However, deployed to operate in time-stringent 5G-NR systems. For
SMART is capable of dealing with this rapid change. For a realistic network size, Table III shows SMART takes as
comprehensiveness, we also test the trained model on NLoS much as 30 ms to run an iteration, 30× longer than one TTI
slow-speed topology. The results in Table II show SMART- in the least time-stringent mode of 5G-NR [31]. This may
SA’s superiority over RR-UG and its generality in real-world seem problematic for the adoption of SMART. To investigate
data, albeit not as good as it is in LoS high-speed. this, we run an experiment in a mobility scenario. We first
5) Computational Complexity train SMART offline as before and test the trained model on
We measure average wall-clock time per TTI for all the the testing dataset without online updates to the model. We
schedulers discussed in §IV-C2. For comparison fairness, we compare the spectral efficiency results for the offline trained
run all implementations on a single CPU core on the NVIDIA model with the previously presented results that include the
DGX server. The runtime values are listed in Table III for online updates. The results are shown in Fig. 11. We observe
three network sizes considered in §IV-C2. The results show that, even when we use the offline trained model with no
the runtimes of the schedulers are widely different and they online updates, the performance is remarkably close to when
R EFERENCES
[1] L. Li, M. Pal, and Y. R. Yang, “Proportional fairness in multi-rate
wireless LANs,” in IEEE INFOCOM 2008 - The 27th Conference on
Computer Communications, 2008, pp. 1004–1012.
[2] S. Huang, H. Yin, J. Wu, and V. C. M. Leung, “User selection
for multiuser MIMO downlink with zero-forcing beamforming,” IEEE
Transactions on Vehicular Technology, vol. 62, no. 7, pp. 3084–3097,
2013.
[3] K. Ko and J. Lee, “Multiuser MIMO user selection based on chordal
distance,” IEEE Transactions on Communications, vol. 60, no. 3, pp.
(a) (b) 649–654, 2012.
[4] N. Prasad, H. Zhang, H. Zhu, and S. Rangarajan, “Multiuser schedul-
Fig. 11: Evaluation of SMART with and without a model ing in the 3GPP LTE cellular uplink,” IEEE Transactions on Mobile
Computing, vol. 13, no. 1, pp. 130–145, 2014.
((online vs. offline) update in user mobility scenario and 64 × [5] C.-M. Chen, Q. Wang, A. Gaber, A. P. Guevara, and S. Pollin, “User
64 network size. scheduling and antenna topology in dense massive MIMO networks: An
experimental study,” IEEE Transactions on Wireless Communications,
vol. 19, no. 9, pp. 6210–6223, 2020.
the model is continuously updated. The performance can get [6] A. Chassein, M. Goerigk, A. Kasperski, and P. Zieliński, “Approxi-
mating combinatorial optimization problems with the ordered weighted
even closer when we do updates every few tens of TTIs. This averaging criterion,” European Journal of Operational Research, vol.
finding means that we can only look into the inference time 286, no. 3, pp. 828–838, 2020.
of the model as the scheduling decision time. For 16 × 16 [7] R. BELLMAN, “A markovian decision process,” Journal of Mathematics
and Mechanics, vol. 6, no. 5, pp. 679–684, 1957. [Online]. Available:
and 64 × 64 network sizes, the inference times for SMART http://www.jstor.org/stable/24900506
are 5.4 and 8.7 ms. Running the model on a single GPU core [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
on the NVIDIA DGX A100 server reduces the inference time Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
values to 1.2 and 1.6 ms, respectively. The inference runtime D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through
values can be further reduced to sub-millisecond levels, as deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
required in 5G-NR, by a more efficient implementation such Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
[9] A. Dargazany, “Drl: Deep reinforcement learning for intelligent robot
as with CUDA [60] framework and parallelizing the DRL control – concept, literature, and future,” 2021.
model on several GPU cores. More importantly, the reassuring [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
performance of SMART-SA, demonstrated in §IV-C3, shows stra, and M. Riedmiller, “Playing Atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
that we can get similar runtime values for 100s of RBs, as its [11] P. Lissa, C. Deane, M. Schukat, F. Seri, M. Keane, and E. Barrett, “Deep
architecture allows us to fully parallelize it on different GPU reinforcement learning for home energy management system control,”
cores. Energy and AI, vol. 3, p. 100043, 2021.
[12] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural
Lastly, we have only considered saturated traffic for each combinatorial optimization with reinforcement learning,” arXiv preprint
user. A more generic design should consider the incoming traf- arXiv:1611.09940, 2016.
fic model as well as the quality of service (QoS) requirements, [13] K. Li, T. Zhang, R. Wang, Y. Wang, Y. Han, and L. Wang, “Deep
reinforcement learning for combinatorial optimization: Covering sales-
e.g. data rate and latency, for each user. Formulation of the man problems,” IEEE Transactions on Cybernetics, vol. 52, no. 12, pp.
scheduling problem and formally solving it using optimization 13 142–13 155, 2022.
techniques or heuristics-based approximation is a difficult task. [14] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
We believe AI-based methods such as the one proposed in this with double Q-learning,” in Proceedings of the AAAI conference on
artificial intelligence, vol. 30, no. 1, 2016.
paper provide a more promising avenue for solving the generic [15] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
case if enough training data exists. We leave the design of a D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep
more comprehensive scheduler that considers parameters in reinforcement learning,” 2016.
[16] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu,
the higher layers of the network such as traffic models and and N. de Freitas, “Sample efficient actor-critic with experience replay,”
QoS constraints as future work. 2017.
[17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” 2017. [Online]. Available:
V. C ONCLUSION https://arxiv.org/abs/1707.06347
In this paper, we presented SMART, a resource scheduler [18] L. Chen, F. Sun, K. Li, R. Chen, Y. Yang, and J. Wang, “Deep
reinforcement learning for resource allocation in massive MIMO,” in
for massive MIMO networks based on the soft actor-critic 2021 29th European Signal Processing Conference (EUSIPCO), 2021,
DRL model. We demonstrated the effectiveness of our sched- pp. 1611–1615.
uler in achieving both spectral efficiency as well as fairness [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
very close to the optimal proportionally fair scheduler. We also learning,” 2015. [Online]. Available: https://arxiv.org/abs/1509.02971
showed that our model outperforms state-of-the-art massive [20] G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap,
MIMO schedulers in all scenarios, and particularly in mobility J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin, “Deep
scenarios. We removed the need for raw channel matrices reinforcement learning in large discrete action spaces,” 2015. [Online].
Available: https://arxiv.org/abs/1512.07679
in training our DRL model by utilizing a user grouping [21] X. Guo, Z. Li, P. Liu, R. Yan, Y. Han, X. Hei, and G. Zhong, “A
algorithm based on the inter-user correlation matrix and, thus, novel user selection massive MIMO scheduling algorithm via real time
we significantly reduced the complexity of our model. We also DDPG,” in GLOBECOM 2020 - 2020 IEEE Global Communications
Conference, 2020, pp. 1–6.
provided guidelines as to how our scheduling model can be [22] H. Chen, Y. Liu, Z. Zheng, H. Wang, X. Liang, Y. Zhao, and J. Ren,
deployed in time-stringent 5G-NR systems. “Joint user scheduling and transmit precoder selection based on DDPG
for uplink multi-user MIMO systems,” in 2021 IEEE 94th Vehicular [46] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey
Technology Conference (VTC2021-Fall). IEEE, 2021, pp. 1–5. of multi-agent reinforcement learning,” IEEE Transactions on Systems,
[23] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2,
V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft pp. 156–172, 2008.
actor-critic algorithms and applications,” 2018. [Online]. Available: [47] S. Jaeckel, L. Raschkowski, K. Börner, and L. Thiele, “QuaDRiGa: A
https://arxiv.org/abs/1812.05905 3-d multi-cell channel model with time evolution for enabling virtual
[24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- field trials,” IEEE Transactions on Antennas and Propagation, vol. 62,
policy maximum entropy deep reinforcement learning with a stochastic no. 6, pp. 3242–3256, 2014.
actor,” 2018. [Online]. Available: https://arxiv.org/abs/1801.01290 [48] R. A. Shafik, M. S. Rahman, A. R. Islam, and N. S. Ashraf, “On the error
[25] “5G NR physical channels and modulation (3GPP TS 38.211 version vector magnitude as a performance metric and comparative analysis,”
16.2.0 release 16),” https://www.etsi.org/deliver/etsi ts/138200 138299/ in 2006 International Conference on Emerging Technologies, 2006, pp.
138211/16.02.00 60/ts 138211v160200p.pdf, accessed: 2023-02-13. 27–31.
[26] Y. Chen, R. Yao, H. Hassanieh, and R. Mittal, “Channel-aware 5G RAN [49] “NVIDIA DGX Station A100,” https://www.nvidia.
slicing with customizable schedulers.” com/content/dam/en-zz/Solutions/Data-Center/dgx-station/
[27] V. Lau, “Proportional fair space-time scheduling for wireless commu- nvidia-dgx-station-a100-datasheet.pdf, accessed: 2022-07-27.
nications,” IEEE Transactions on Communications, vol. 53, no. 8, pp. [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
1353–1360, 2005. arXiv preprint arXiv:1412.6980, 2014.
[28] P. R. M., M. R., A. Kumar, and K. Kuchi, “Downlink resource allocation [51] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
for 5g-nr massive mimo systems,” in 2021 National Conference on A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
Communications (NCC), 2021, pp. 1–6. PyTorch,” 2017.
[29] C. Blair, “Theory of linear and integer programming (alexander [52] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
schrijver),” SIAM Review, vol. 30, no. 2, pp. 325–326, 1988. [Online]. replay,” 2016.
Available: https://doi.org/10.1137/1030065 [53] E. Björnson, E. G. Larsson, and T. L. Marzetta, “Massive MIMO:
[30] H. Liu, H. Gao, S. Yang, and T. Lv, “Low-complexity downlink user ten myths and one critical question,” IEEE Communications Magazine,
selection for massive MIMO systems,” IEEE Systems Journal, vol. 11, vol. 54, no. 2, pp. 114–123, 2016.
no. 2, pp. 1072–1083, 2017. [54] T. V. de Wiele, D. Warde-Farley, A. Mnih, and V. Mnih, “Q-learning
[31] Y. Chen, Y. Wu, Y. T. Hou, and W. Lou, “mCore: Achieving Sub- in enormous action spaces via amortized approximate maximization,”
millisecond Scheduling for 5G MU-MIMO Systems,” in IEEE INFO- 2020.
COM 2021 - IEEE Conference on Computer Communications, 2021, [55] C. Wu, X. Yi, Y. Zhu, W. Wang, L. You, and X. Gao, “Channel prediction
pp. 1–10. in high-mobility massive mimo: From spatio-temporal autoregression to
[32] H. Yang, “User scheduling in massive MIMO,” in 2018 IEEE 19th deep learning,” IEEE Journal on Selected Areas in Communications,
International Workshop on Signal Processing Advances in Wireless vol. 39, no. 7, pp. 1915–1930, 2021.
Communications (SPAWC), 2018, pp. 1–5. [56] Y. Han, S. Jin, C.-K. Wen, and X. Ma, “Channel estimation for extremely
large-scale massive mimo systems,” IEEE Wireless Communications
[33] J. Shi, W. Wang, J. Wang, and X. Gao, “Machine learning assisted user-
Letters, vol. 9, no. 5, pp. 633–637, 2020.
scheduling method for massive MIMO system,” in 2018 10th Interna-
[57] C.-J. Chun, J.-M. Kang, and I.-M. Kim, “Deep learning-based channel
tional Conference on Wireless Communications and Signal Processing
estimation for massive mimo systems,” IEEE Wireless Communications
(WCSP), 2018, pp. 1–6.
Letters, vol. 8, no. 4, pp. 1228–1231, 2019.
[34] G. Bu and J. Jiang, “Reinforcement learning-based user scheduling and
[58] M. Nazari, A. Oroojlooy, L. V. Snyder, and M. Takác, “Deep reinforce-
resource allocation for massive MU-MIMO system,” in 2019 IEEE/CIC
ment learning for solving the vehicle routing problem,” arXiv preprint
International Conference on Communications in China (ICCC), 2019,
arXiv:1802.04240, 2018.
pp. 641–646.
[59] R. Doost-Mohammady, O. Bejarano, L. Zhong, J. R. Cavallaro,
[35] V. H. L. Lopes, C. V. Nahum, R. M. Dreifuerst, P. Batista, A. Klautau, E. Knightly, Z. M. Mao, W. W. Li, X. Chen, and A. Sabharwal,
K. V. Cardoso, and R. W. Heath, “Deep reinforcement learning-based “RENEW: Programmable and observable massive MIMO networks,” in
scheduling for multiband massive MIMO,” IEEE Access, vol. 10, pp. 2018 52nd Asilomar Conference on Signals, Systems, and Computers,
125 509–125 525, 2022. 2018, pp. 1654–1658.
[36] C.-W. Huang, I. Althamary, Y.-C. Chou, H.-Y. Chen, and C.-F. Chou, [60] “CUDA Toolkit Documentation,” https://docs.nvidia.com/cuda/, ac-
“A DRL-based automated algorithm selection framework for cross- cessed: 2022-08-01.
layer QoS-aware scheduling and antenna allocation in massive MIMO
systems,” IEEE Access, vol. 11, pp. 13 243–13 256, 2023.
[37] Z. Zhao, Y. Liang, and X. Jin, “Handling large-scale action space in deep
Q network,” in 2018 International Conference on Artificial Intelligence
and Big Data (ICAIBD), 2018, pp. 93–96.
[38] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high
dimensional data,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 36, no. 11, pp. 2227–2240, 2014.
[39] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine,
“Learning to walk via deep reinforcement learning,” 2018. [Online].
Available: https://arxiv.org/abs/1812.11103
[40] P. Christodoulou, “Soft actor-critic for discrete action settings,” 2019.
[Online]. Available: https://arxiv.org/abs/1910.07207
[41] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” 2015. [Online]. Available: https://arxiv.org/abs/1509.02971
[42] B. Eysenbach and S. Levine, “Maximum entropy RL (provably) solves
some robust RL problems,” arXiv preprint arXiv:2103.06257, 2021.
[43] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger,
“Deep reinforcement learning that matters,” in Proceedings of the AAAI
conference on artificial intelligence, vol. 32, no. 1, 2018.
[44] X. Hao, H. Mao, W. Wang, Y. Yang, D. Li, Y. Zheng, Z. Wang,
and J. Hao, “Breaking the curse of dimensionality in multi-agent
state space: A unified agent permutation framework,” 2022. [Online].
Available: https://arxiv.org/abs/2203.05285
[45] R. Jain, D. Chiu, and W. Hawe, “A quantitative measure of
fairness and discrimination for resource allocation in shared computer
systems,” CoRR, vol. cs.NI/9809099, 1998. [Online]. Available:
https://arxiv.org/abs/cs/9809099