当前位置：网站首页>Reinforcement learning - Multi-Agent Reinforcement Learning

Reinforcement learning - Multi-Agent Reinforcement Learning

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Setting of multi-agent system
Multi agent system under cooperative relationship setting

Preface

This paper summarizes 《 Deep reinforcement learning 》 Chapter on Multi-Agent Reinforcement Learning in , If there is a mistake , Welcome to point out .

Setting of multi-agent system

A multi-agent system consists of multiple agents , Multiple agents share the environment , Agents interact . The action of an agent will change the state of the environment , Thus affecting other agents .

There are four common relationships between multi-agent

Full partnership ： The goals of multiple agents are the same , You will get the same reward after you make an action .
Perfect competition ： The gain of an agent will lead to the loss of some agents .
A mixture of cooperation and competition ： Multiple agents are divided into groups , The interests of the agents in the group are the same , Each group is in a competitive relationship , for example 《 Glory of Kings 》.
Egoism ： The action of an agent will change the environment , So that other agents benefit or suffer , This relationship looks like a subset of a perfectly competitive relationship .

Under multi-agent system , Some terms of reinforcement learning need to be redefined ：

Reward ： The relationship between multiple agents is different , It will lead to different rewards given by the environment . If there is a cooperative relationship between multiple agents , Then multiple agents get the same reward from the environment . If there is a competitive relationship between multiple agents , Then an agent who gets a positive reward from the environment will get a negative reward . The first $i$ Agents in $t$ The reward of every moment $R_t^i$ By state $S_t$ And the actions of all agents $A_t=[A_t^1,A_t^2,...,A_t^n]$ decision .
Action value function ： Because the actions performed by one agent will affect other agents , Therefore, in multi-agent system $i$ The action value function of agents $Q_\pi^i(S_t,A_t)$ It depends on the current state $S_t$ ( A single agent can see the environment state , It can also be the environmental state seen by multiple agents ), It also depends on the actions of all agents $A_t=[A_t^1,A_t^2,...,A_t^n]$ ( It also depends on the strategies of all agents )
State value function ： The state value function is the expectation of the action value function about the action , The actions performed by the agent depend on the strategy function , So the first $i$ The state value function of an agent $V_\pi^i(S_t)$ It also depends on the strategies of all agents .

In a multi-agent system , The state value function and action value function of a single agent are affected by other agents .

Multi agent system under cooperative relationship setting

Under the setting of multi-agent , A single agent may not be able to observe the global state . Set the first $i$ The states observed by agents are $O^i$ , Then it can be assumed that the sum of the local states observed by all agents is the global state , namely
$S=[O^1,O^2,...,O^m]$
Under partnership , Multiple agents get the same reward from the environment . Set the first $i$ The reward for an agent is $R^i$ , Then there are
$R^1=R^2=...=R^m$
Besides , The action value function and state value function of each agent depend on the strategy function of all agents $\pi(A^1|S;\theta^1)、\pi(A^2|S;\theta^2)、...、\pi(A^m|S;\theta^m)$ .

The objective function of strategy learning

Insert picture description here

Multi agent strategy learning algorithm under cooperative relationship MAC-A2C

Because the goals of multiple agents are the same , Get the same rewards from the environment , Therefore, multiple agents share a value network $V (s; w)$ , The status of the value network input is $s=[o^1,o^2,...,o^m]$ , among $o^i$ It means the first one $i$ States observed by agents . The output of the value network is the right state $s$ The score .

Each agent has its own strategy network , The first $i$ The strategy network of agents is recorded as $\pi(a^i|s;\theta^i)$ . Because there is a cooperative relationship between multiple agents , Therefore, the input status of the policy network is $s=[o^1,o^2,...,o^m]$ , among $o^i$ It means the first one $i$ States observed by agents .

The specific training process is ：
Insert picture description here
The decision-making process after training is ：

The above training process , The strategic network makes an inference , Need to synchronize once $m$ States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick. To be specific , The strategy network of each agent only makes decisions based on the current state observed by the agent , That is to say $i$ A strategic network consists of $\pi(a^i|s;\theta^i)$ Turn into $\pi(a^i|o^i;\theta^i)$ . Because the value network is still based on the overall state $s$ Calculate , Therefore, the state information observed by other agents will be transferred to the current agent through the policy gradient . The specific process of centralized training is ：
Insert picture description here
After model training , Use decentralized decision , The first $i$ Agents observe states based on themselves $o^i$ , Use strategic networks $\pi(a^i|o^i;\theta^i)$ The act of deciding to perform .

Multi-agent strategy learning algorithm under non cooperative relationship MAC-A2C

Multiple agents have common goals under cooperative relationships , Therefore, we can share a value network . Not in a cooperative relationship , Multiple agents have different goals , Therefore, each agent has its corresponding strategy network and value network . Set the first $i$ The states observed by agents are $o^i$ , Our strategic network is $\pi(a^i|s;\theta^i)$ 、 The value network is $v(s;w^i)$ , among $s$ The sum of the states observed for all agents , namely $s=[o^1,o^2,...,o^n]$ .

MAC-A2C The training process of is as follows , Attention is different from multi intelligence reinforcement learning under cooperative relationship , Under non cooperative relationships, different agents get different rewards from the environment .
Insert picture description here
The decision-making process after training is

The above training process , The strategic network makes an inference , Need to synchronize once $m$ States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick, This process can browse the previous chapter .

Nash equilibrium

In a non cooperative relationship , The interests of agents are different , An agent obtains a larger value of the state value function , It may cause another agent to obtain a smaller value of the state value function , So how to judge whether the algorithm converges ？ In a non cooperative relationship , Nash equilibrium is the criterion to judge whether the algorithm converges . To be specific , In a multi-agent system , When all other agents do not change their strategies , An agent $i$ Change strategy alone $\theta_i$ , Can't let it expect a return $J_i(\theta_1,...,\theta_m)$ Bigger （ The strategy gradient is 0）, That is, the agent cannot find a better strategy than the current strategy （ Parameters cannot be updated by gradient ）, This equilibrium state is called Nash equilibrium .