Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2017, HAL (Le Centre pour la Communication Scientifique Directe)
We consider a model of game-theoretic learning based on online mirror descent (OMD) with asynchronous and delayed feedback information. Instead of focusing on specific games, we consider a broad class of continuous games defined by the general equilibrium stability notion, which we call λ-variational stability. Our first contribution is that, in this class of games, the actual sequence of play induced by OMD-based learning converges to Nash equilibria provided that the feedback delays faced by the players are synchronous and bounded. Subsequently, to tackle fully decentralized, asynchronous environments with (possibly) unbounded delays between actions and feedback, we propose a variant of OMD which we call delayed mirror descent (DMD), and which relies on the repeated leveraging of past information. With this modification, the algorithm converges to Nash equilibria with no feedback synchronicity assumptions and even when the delays grow superlinearly relative to the horizon of play.
2017
Online Mirror Descent (OMD) is an important and widely used class of adaptive learning algorithms that enjoys good regret performance guarantees. It is therefore natural to study the evolution of the joint action in a multi-agent decision process (typically modeled as a repeated game) where every agent employs an OMD algorithm. This well-motivated question has received much attention in the literature that lies at the intersection between learning and games. However, much of the existing literature has been focused on the time average of the joint iterates. In this paper, we tackle a harder problem that is of practical utility, particularly in the online decision making setting: the convergence of the last iterate when all the agents make decisions according to OMD. We introduce an equilibrium stability notion called variational stability (VS) and show that in variationally stable games, the last iterate of OMD converges to the set of Nash equilibria. We also extend the OMD learning dynamics to a more general setting where the exact gradient is not available and show that the last iterate (now random) of OMD converges to the set of Nash equilibria almost surely.
2018
We consider a game-theoretical multi-agent learning problem where the feedback information can be lost and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts reweighted online gradient descent learning dynamics, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accommodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, we make meaningful progress towards the ...
2018
We examine the long-run behavior of multi-agent online learning in games that evolve over time. Specifically, we focus on a wide class of policies based on mirror descent, and we show that the induced sequence of play (a) converges to Nash equilibrium in time-varying games that stabilize in the long run to a strictly monotone limit; and (b) it stays asymptotically close to the evolving equilibrium of the sequence of stage games (assuming they are strongly monotone). Our results apply to both gradient-based and payoff-based feedback – i.e., when players only get to observe the payoffs of their chosen actions.
2018
We consider a game-theoretical multi-agent learning problem where the feedback information can be lost during the learning process and rewards are given by a broad class of games known as variationally stable games. We propose a simple variant of the classical online gradient descent algorithm, called reweighted online gradient descent (ROGD) and show that in variationally stable games, if each agent adopts ROGD, then almost sure convergence to the set of Nash equilibria is guaranteed, even when the feedback loss is asynchronous and arbitrarily corrrelated among agents. We then extend the framework to deal with unknown feedback loss probabilities by using an estimator (constructed from past data) in its replacement. Finally, we further extend the framework to accomodate both asynchronous loss and stochastic rewards and establish that multi-agent ROGD learning still converges to the set of Nash equilibria in such settings. Together, these results contribute to the broad lanscape of m...
arXiv (Cornell University), 2023
The behaviour of multi-agent learning in many player games has been shown to display complex dynamics outside of restrictive examples such as network zero-sum games. In addition, it has been shown that convergent behaviour is less likely to occur as the number of players increase. To make progress in resolving this problem, we study Q-Learning dynamics and determine a sufficient condition for the dynamics to converge to a unique equilibrium in any network game. We find that this condition depends on the nature of pairwise interactions and on the network structure, but is explicitly independent of the total number of agents in the game. We evaluate this result on a number of representative network games and show that, under suitable network conditions, stable learning dynamics can be achieved with an arbitrary number of agents.
HAL (Le Centre pour la Communication Scientifique Directe), 2018
Reinforcement Learning (RL) for decentralized partially observable Markov decision processes (Dec-POMDPs) is lagging behind the spectacular breakthroughs of single-agent RL. That is because assumptions that hold in single-agent settings are often obsolete in decentralized multi-agent systems. To tackle this issue, we investigate the foundations of policy gradient methods within the centralized training for decentralized control (CTDC) paradigm. In this paradigm, learning can be accomplished in a centralized manner while execution can still be independent. Using this insight, we establish policy gradient theorem and compatible function approximations for decentralized multi-agent systems. Resulting actor-critic methods preserve the decentralized control at the execution phase, but can also estimate the policy gradient from collective experiences guided by a centralized critic at the training phase. Experiments demonstrate our policy gradient methods compare favorably against standard RL techniques in benchmarks from the literature.
Proceedings of the National Conference on Artificial …, 1998
Reinforcement learning can provide a robust and natural means for agents to learn how to coordinate their action choices in multiagent systems. We examine some of the factors that can influence the dynamics of the learning process in such a setting. We first distinguish reinforcement learners that are unaware of (or ignore) the presence of other agents from those that explicitly attempt to learn the value of joint actions and the strategies of their counterparts. We study (a simple form of) Q-learning in cooperative multiagent systems under these two perspectives, focusing on the influence of that game structure and exploration strategies on convergence to (optimal and suboptimal) Nash equilibria. We then propose alternative optimistic exploration strategies that increase the likelihood of convergence to an optimal equilibrium.
Frontiers of Information Technology & Electronic Engineering, 2021
Multi-agent reinforcement learning (MARL) has long been a significant and everlasting research topic in both machine learning and control. With the recent development of (single-agent) deep RL, there is a resurgence of interests in developing new MARL algorithms, especially those that are backed by theoretical analysis. In this paper, we review some recent advances a sub-area of this topic: decentralized MARL with networked agents. Specifically, multiple agents perform sequential decision-making in a common environment, without the coordination of any central controller. Instead, the agents are allowed to exchange information with their neighbors over a communication network. Such a setting finds broad applications in the control and operation of robots, unmanned vehicles, mobile sensor networks, and smart grid. This review is built upon several our research endeavors in this direction, together with some progresses made by other researchers along the line. We hope this review to inspire the devotion of more research efforts to this exciting yet challenging area.
PhD Thesis, 2017
It is estimated that, in the next decade, there will be tens of billions of interconnected devices in the world, each one sensing, streaming and processing data. In order to manage such huge amount of data, traditional architectures—where a fusion center gather all the data—may not satisfy performance specifications or cost constraints (e.g., data privacy, resilience, scalability or communication cost). Distributed computing is an interesting alternative that consists in moving the data processing to the devices—which become intelligent agents—so that the fusion-center is completely avoided. This thesis proposes and evaluates algorithms for two complementary multiagent scenarios. First, we consider cooperative distributed algorithms, where the nodes interact with each other to solve a social problem. Even if each agent has only access to very few data, it can approximate the performance of a centralized architecture through cooperation. In this context, we propose distributed component analysis methods—including principal component analysis, factor analysis and linear discriminant analysis—based on the consensus-averaging scheme. We also propose and analyze an off-policy reinforcement learning algorithm, where the agents explore the state-set independently and share some intermediate results (not the samples) with their neighbors in order to evaluate a common target policy. Finally, we introduce a distributed implementation of the cross-entropy method for black-box global (nonconvex) optimization, where the objective is unknown to the agents. The second scenario consists in dynamic potential games. This is a class of state-based time-varying games, where the agents influence each other and compete for a shared resource, so that they have to find an equilibrium. These kind of games can be formalized as multiobjective optimal control problems, which are generally difficult to solve. We extend previous analysis for these kind of games and guarantee existence of equilibrium under mild conditions. In addition, we propose a framework for finding—or even learning with reinforcement learning methods—an equilibrium strategy. We also study the applicability of this kind of games with a number of examples.
Computer and Information Sciences III, 2012
We consider a class of fully stochastic and fully distributed algorithms, that we prove to learn equilibria in games. Indeed, we consider a family of stochastic distributed dynamics that we prove to converge weakly (in the sense of weak convergence for probabilistic processes) towards their mean-field limit, i.e an ordinary differential equation (ODE) in the general case. We focus then on a class of stochastic dynamics where this ODE turns out to be related to multipopulation replicator dynamics. Using facts known about convergence of this ODE, we discuss the convergence of the initial stochastic dynamics: For general games, there might be non-convergence, but when convergence of the ODE holds, considered stochastic algorithms converge towards Nash equilibria. For games admitting Lyapunov functions, that we call Lyapunov games, the stochastic dynamics converge. We prove that any ordinal potential game, and hence any potential game is a Lyapunov game, with a multiaffine Lyapunov function. For Lyapunov games with a multiaffine Lyapunov function, we prove that this Lyapunov function is a super-martingale over the stochastic dynamics. This leads a way to provide bounds on their time of convergence by martingale arguments. This applies in particular for many classes of games that have been considered in literature, including several load balancing game scenarios and congestion games. 0
Cornell University - arXiv, 2022
The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in the artificial intelligence (AI) research community. However, many research endeavours have been focused on developing practical MARL algorithms whose effectiveness has been studied only empirically, thereby lacking theoretical guarantees. As recent studies have revealed, MARL methods often achieve performance that is unstable in terms of reward monotonicity or suboptimal at convergence. To resolve these issues, in this paper, we introduce a novel framework named Heterogeneous-Agent Mirror Learning (HAML) that provides a general template for MARL algorithmic designs. We prove that algorithms derived from the HAML template satisfy the desired properties of the monotonic improvement of the joint reward and the convergence to Nash equilibrium. We verify the practicality of HAML by proving that the current state-of-the-art cooperative MARL algorithms, HATRPO and HAPPO, are in fact HAML instances. Next, as a natural outcome of our theory, we propose HAML extensions of two well-known RL algorithms, HAA2C (for A2C) and HADDPG (for DDPG), and demonstrate their effectiveness against strong baselines on StarCraftII and Multi-Agent MuJoCo tasks.
Proceedings of the tenth ACM SIGEVO workshop on Foundations of genetic algorithms, 2009
One issue in multi-agent co-adaptive learning concerns convergence. When two (or more) agents play a game with different information and different payoffs, the general behaviour tends to be oscillation around a Nash equilibrium. Several algorithms have been proposed to force convergence to mixed-strategy Nash equilibria in imperfect-information games when the agents are aware of their opponent's strategy. We consider the effect on one such algorithm, the lagging anchor algorithm, when each agent must also infer the gradient information from observations, in the infinitesimal time-step limit. Use of an estimated gradient, either by opponent modelling or stochastic gradient ascent, destabilises the algorithm in a region of parameter space. There are two phases of behaviour. If the rate of estimation is low, the Nash equilibrium becomes unstable in the mean. If the rate is high, the Nash equilibrium is an attractive fixed point in the mean, but the uncertainty acts as narrow-band coloured noise, which causes dampened oscillations.
Automatica, 2012
Multi-agent systems arise in several domains of engineering and can be used to solve problems which are difficult for an individual agent to solve. Strategies for team decision problems, including optimal control, N-player games (H-infinity control and non-zero sum), and so on are normally solved for off-line by solving associated matrix equations such as the coupled Riccati equations or coupled Hamilton-Jacobi equations. However, using that approach players cannot change their objectives online in real time without calling for a completely new off-line solution for the new strategies. Therefore, in this paper we bring together cooperative control, reinforcement learning, and game theory to present a multiagent formulation for the online solution of team games. The notion of graphical games is developed for dynamical systems, where the dynamics and performance indices for each node depend only on local neighbor information. It is shown that standard definitions for Nash equilibrium are not sufficient for graphical games and a new definition of ''Interactive Nash Equilibrium'' is given. We give a cooperative policy iteration algorithm for graphical games that converges to the best response when the neighbors of each agent do not update their policies, and to the cooperative Nash equilibrium when all agents update their policies simultaneously. This is used to develop methods for online adaptive learning solutions of graphical games in real time along with proofs of stability and convergence.
IEEE Transactions on Systems, Man, and Cybernetics, 1994
A multi-person discrete game where the payoff after each play is stochastic is considered. The distribution of the random payoff is unknown to the players and further none of the players know the strategies or the actual moves of other players. A learning algorithm for the game based on a decentralized team of Learning Automata is presented. It is proved that all stable stationary points of the algorithm are Nash equilibria for the game. Two special cases of the game are also discussed, namely, game with common payoff and the relaxation labelling problem. The former has applications such as pattern recognition and the latter is a problem widely studied in computer vision. For the two special cases it is shown that the algorithm always converges to a desirable solution.
IEEE Transactions on Automatic Control, 2021
Competitive non-cooperative online decision-making agents whose actions increase congestion of scarce resources constitute a model for widespread modern large-scale applications. To ensure sustainable resource behavior, we introduce a novel method to steer the agents toward a stable population state, fulfilling the given coupled resource constraints. The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian. Assuming that the online learning agents have only noisy first-order utility feedback, we show that for a polynomially decaying agents' step size/learning rate, the population's dynamic will almost surely converge to generalized Nash equilibrium. A particular consequence of the latter is the fulfillment of resource constraints in the asymptotic limit. Moreover, we investigate the finite-time quality of the proposed algorithm by giving a nonasymptotic time decaying bound for the expected amount of resource constraint violation.
2007
In this work we propose a new paradigm for learning coordination in multi-agent systems. This approach is based on social interaction of people, specially in the fact that people communicate to each other what they think about their actions and this opinion has some influence in the behavior of each other. We propose a model in which multi-agents learn to coordinate their actions giving opinions about the actions of other agents and also being influenced with opinions of other agents about their actions. We use the proposed paradigm to develop a modified version of the Q-learning algorithm. The new algorithm is tested and compared with independent learning (IL) and joint action learning (JAL) in a grid problem with two agents learning to coordinate. Our approach shows to have more probability to converge to an optimal equilibrium than IL and JAL Q-learning algorithms, specially when exploration increases. Also, a nice property of our algorithm is that it does not need to make an entire model of all joint actions like JAL algorithms.
2009
We propose a new concept for the analysis of games, the TASP, which gives a precise prediction about non-equilibrium play in games whose Nash equilibria are mixed and are unstable under fictitious play-like learning. We show that, when players learn using weighted stochastic fictitious play and so place greater weight on recent experience, the time average of play often converges in these ���unstable��� games, even while mixed strategies and beliefs continue to cycle.
Reinforcement Learning, 2008
arXiv (Cornell University), 2024
When deployed in the world, a learning agent such as a recommender system or a chatbot often repeatedly interacts with another learning agent (such as a user) over time. In many such two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. To better understand such cases, we examine the learning dynamics of the two-agent system and the implications for each agent's objective. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks (such as Stackelberg equilibrium payoffs) result in worst-case linear regret for at least one player. To better capture these systems, we construct a relaxed regret benchmark that is tolerant to small learning errors by agents. We show that standard learning algorithms fail to provide sublinear regret, and we develop algorithms to achieve near-optimal O(T 2/3) regret for both players with respect to these benchmarks. We further design relaxed environments under which faster learning (O(√ T)) is possible. Altogether, our results take a step towards assessing how two-agent interactions in sequential and decentralized learning environments affect the utility of both agents.
Journal of Optimization Theory and Applications
In this paper, we propose non-model-based strategies for locally stable convergence to Nash equilibrium in quadratic noncooperative games where acquisition of information (of two different types) incurs delays. Two sets of results are introduced: (a) one, which we call cooperative scenario, where each player employs the knowledge of the functional form of his payoff and knowledge of other players' actions, but with delays; and (b) the second one, which we term the noncooperative scenario, where the players have access only to their own payoff values, again with delay. Both approaches are based on the extremum seeking perspective, which has previously been reported for real-time optimization problems by exploring sinusoidal excitation signals to estimate the Gradient (first derivative) and Hessian (second derivative) of unknown quadratic functions. In order to compensate distinct delays in the inputs of the players, we have employed predictor feedback. We apply a small-gain analysis as well as averaging theory in infinite dimensions, due to the infinite-dimensional state of the time delays, in order to obtain local convergence results for the unknown quadratic payoffs to a small neighborhood of the Nash equilibrium. We quantify the size of these residual sets and corroborate the theoretical results numerically on an example of a two-player game with delays. Keywords Extremum seeking • Nash equilibrium • (Non)cooperative games • Time delays • Predictor feedback • Averaging in infinite dimensions Mathematics Subject Classification 91A10 • 34K33 • 35Q93 • 93D05 • 93C35 • 93C40 Abbreviation ES Extremum seeking ODE Ordinary differential equation PDE Partial differential equation FDE Functional differential equation ISS Input-to-state stability Extended author information available on the last page of the article
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.