Portfolio Optimization using Reinforcement Learning: A Markov Decision Process Approach
The Need for Reinforcement Learning in Finance
Introduction
Portfolio optimization – allocating wealth among assets to maximize return for a given risk –
is a cornerstone of finance. Traditional approaches (e.g. Markowitz mean-variance) assume
static distributions and often ignore market dynamics. In contrast, Reinforcement Learning
(RL) treats portfolio management as a sequential decision problem, modelling it as a Markov
Decision Process (MDP). In an MDP framework, the agent observes a state (market
conditions and current holdings), takes an action (rebalances the portfolio), and receives a
reward (e.g. profit or risk-adjusted return). Over time the agent learns a policy to maximize
cumulative return. Recent research shows that RL can adaptively adjust portfolios under
stochastic market movements. In particular, deep-RL methods (using neural networks) can
handle high-dimensional inputs and continuous action spaces relevant to multi-asset
portfolios. This work surveys RL methodologies (MDP, Q-learning, policy gradients, actor-
critic/DDPG) applied to stock and mutual fund portfolios and proposes a theoretical model
specific to stock-market portfolio optimization.
Literature Review
Early attempts at RL in finance date back to the 1990s, but gained traction with deep
learning. For example, Le et al. formulate stock trading as adjusting portfolio weights daily to
maximize profit under risk constraints. They emphasize including transaction costs and
volatility in the state and reward. More recent studies apply deep RL to portfolio selection.
Gao et al. discretized portfolio weights and used a Convolutional DQN, but noted that a
finite action set can limit performance. To handle continuous rebalancing, Deng et
al. propose a DDPG agent that directly outputs continuous portfolio weight vectors.
Empirically they report higher returns and Sharpe ratios than traditional strategies. Other
works use policy-gradient methods: e.g. PPO or actor-critic frameworks have been shown to
stabilize learning and yield good returns on stock data.
A number of studies demonstrate RL on real markets. For Indian equities, Chorasiya and
Kinger (2025) compare PPO, A2C, SAC and DDPG on an NSE stock portfolio; they report that
these DRL agents can build robust portfolios with high Sharpe ratios. Yue et al. (2022)
introduce an MDP-based model (SwanTrader) that integrates deep autoencoders; they
highlight that deep RL models can capture temporal patterns in turbulent markets. More
generally, Ndikum & Ndikum (2024) formalize portfolio MDPs and emphasize that an optimal
policy must maximize expected return over an infinite horizon. They discuss the Bellman
equation and value functions in the portfolio context, illustrating how RL decomposes the
investment task into state-value assessments.
In summary, the literature shows that RL methods—from Q-learning/DQN to policy-gradient
actor-critic (including DDPG and PPO)—have been successfully applied to stock portfolios.
These approaches automatically learn to trade without explicit price forecasting, instead
optimizing long-term return. This survey focuses on the theory and MDP modelling in such
applications, aiming to devise a clear framework for stock/mutual-fund portfolios.