当前位置:网站首页>Reinforcement learning - Multi-Agent Reinforcement Learning
Reinforcement learning - Multi-Agent Reinforcement Learning
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
This paper summarizes 《 Deep reinforcement learning 》 Chapter on Multi-Agent Reinforcement Learning in , If there is a mistake , Welcome to point out .
Setting of multi-agent system
A multi-agent system consists of multiple agents , Multiple agents share the environment , Agents interact . The action of an agent will change the state of the environment , Thus affecting other agents .
There are four common relationships between multi-agent
- Full partnership : The goals of multiple agents are the same , You will get the same reward after you make an action .
- Perfect competition : The gain of an agent will lead to the loss of some agents .
- A mixture of cooperation and competition : Multiple agents are divided into groups , The interests of the agents in the group are the same , Each group is in a competitive relationship , for example 《 Glory of Kings 》.
- Egoism : The action of an agent will change the environment , So that other agents benefit or suffer , This relationship looks like a subset of a perfectly competitive relationship .
Under multi-agent system , Some terms of reinforcement learning need to be redefined :
- Reward : The relationship between multiple agents is different , It will lead to different rewards given by the environment . If there is a cooperative relationship between multiple agents , Then multiple agents get the same reward from the environment . If there is a competitive relationship between multiple agents , Then an agent who gets a positive reward from the environment will get a negative reward . The first i i i Agents in t t t The reward of every moment R t i R_t^i Rti By state S t S_t St And the actions of all agents A t = [ A t 1 , A t 2 , . . . , A t n ] A_t=[A_t^1,A_t^2,...,A_t^n] At=[At1,At2,...,Atn] decision .
- Action value function : Because the actions performed by one agent will affect other agents , Therefore, in multi-agent system i i i The action value function of agents Q π i ( S t , A t ) Q_\pi^i(S_t,A_t) Qπi(St,At) It depends on the current state S t S_t St( A single agent can see the environment state , It can also be the environmental state seen by multiple agents ), It also depends on the actions of all agents A t = [ A t 1 , A t 2 , . . . , A t n ] A_t=[A_t^1,A_t^2,...,A_t^n] At=[At1,At2,...,Atn]( It also depends on the strategies of all agents )
- State value function : The state value function is the expectation of the action value function about the action , The actions performed by the agent depend on the strategy function , So the first i i i The state value function of an agent V π i ( S t ) V_\pi^i(S_t) Vπi(St) It also depends on the strategies of all agents .
In a multi-agent system , The state value function and action value function of a single agent are affected by other agents .
Multi agent system under cooperative relationship setting
Under the setting of multi-agent , A single agent may not be able to observe the global state . Set the first i i i The states observed by agents are O i O^i Oi, Then it can be assumed that the sum of the local states observed by all agents is the global state , namely
S = [ O 1 , O 2 , . . . , O m ] S=[O^1,O^2,...,O^m] S=[O1,O2,...,Om]
Under partnership , Multiple agents get the same reward from the environment . Set the first i i i The reward for an agent is R i R^i Ri, Then there are
R 1 = R 2 = . . . = R m R^1=R^2=...=R^m R1=R2=...=Rm
Besides , The action value function and state value function of each agent depend on the strategy function of all agents π ( A 1 ∣ S ; θ 1 ) 、 π ( A 2 ∣ S ; θ 2 ) 、 . . . 、 π ( A m ∣ S ; θ m ) \pi(A^1|S;\theta^1)、\pi(A^2|S;\theta^2)、...、\pi(A^m|S;\theta^m) π(A1∣S;θ1)、π(A2∣S;θ2)、...、π(Am∣S;θm).
The objective function of strategy learning

Multi agent strategy learning algorithm under cooperative relationship MAC-A2C
Because the goals of multiple agents are the same , Get the same rewards from the environment , Therefore, multiple agents share a value network V ( s ; w ) V(s;w) V(s;w), The status of the value network input is s = [ o 1 , o 2 , . . . , o m ] s=[o^1,o^2,...,o^m] s=[o1,o2,...,om], among o i o^i oi It means the first one i i i States observed by agents . The output of the value network is the right state s s s The score .
Each agent has its own strategy network , The first i i i The strategy network of agents is recorded as π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi). Because there is a cooperative relationship between multiple agents , Therefore, the input status of the policy network is s = [ o 1 , o 2 , . . . , o m ] s=[o^1,o^2,...,o^m] s=[o1,o2,...,om], among o i o^i oi It means the first one i i i States observed by agents .
The specific training process is :
The decision-making process after training is :
The above training process , The strategic network makes an inference , Need to synchronize once m m m States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick. To be specific , The strategy network of each agent only makes decisions based on the current state observed by the agent , That is to say i i i A strategic network consists of π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi) Turn into π ( a i ∣ o i ; θ i ) \pi(a^i|o^i;\theta^i) π(ai∣oi;θi). Because the value network is still based on the overall state s s s Calculate , Therefore, the state information observed by other agents will be transferred to the current agent through the policy gradient . The specific process of centralized training is :
After model training , Use decentralized decision , The first i i i Agents observe states based on themselves o i o^i oi, Use strategic networks π ( a i ∣ o i ; θ i ) \pi(a^i|o^i;\theta^i) π(ai∣oi;θi) The act of deciding to perform .
Multi-agent strategy learning algorithm under non cooperative relationship MAC-A2C
Multiple agents have common goals under cooperative relationships , Therefore, we can share a value network . Not in a cooperative relationship , Multiple agents have different goals , Therefore, each agent has its corresponding strategy network and value network . Set the first i i i The states observed by agents are o i o^i oi, Our strategic network is π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi)、 The value network is v ( s ; w i ) v(s;w^i) v(s;wi), among s s s The sum of the states observed for all agents , namely s = [ o 1 , o 2 , . . . , o n ] s=[o^1,o^2,...,o^n] s=[o1,o2,...,on].
MAC-A2C The training process of is as follows , Attention is different from multi intelligence reinforcement learning under cooperative relationship , Under non cooperative relationships, different agents get different rewards from the environment .
The decision-making process after training is 
The above training process , The strategic network makes an inference , Need to synchronize once m m m States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick, This process can browse the previous chapter .
Nash equilibrium
In a non cooperative relationship , The interests of agents are different , An agent obtains a larger value of the state value function , It may cause another agent to obtain a smaller value of the state value function , So how to judge whether the algorithm converges ? In a non cooperative relationship , Nash equilibrium is the criterion to judge whether the algorithm converges . To be specific , In a multi-agent system , When all other agents do not change their strategies , An agent i i i Change strategy alone θ i \theta_i θi, Can't let it expect a return J i ( θ 1 , . . . , θ m ) J_i(\theta_1,...,\theta_m) Ji(θ1,...,θm) Bigger ( The strategy gradient is 0), That is, the agent cannot find a better strategy than the current strategy ( Parameters cannot be updated by gradient ), This equilibrium state is called Nash equilibrium .
边栏推荐
- raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly‘.format(pids_str))RuntimeErro
- 将项目部署到GPU上,并且运行
- 深度学习——Pay Attention to MLPs
- 强化学习——多智能体强化学习
- Marsnft: how do individuals distribute digital collections?
- 小程序开发流程详细是什么呢?
- 分布式集群架构场景优化解决方案:分布式ID解决方案
- 深度学习(增量学习)——(ICCV)Striking a Balance between Stability and Plasticity for Class-Incremental Learning
- 深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
- KubeSphere安装版本问题
猜你喜欢

What about the app store on wechat?

How to do wechat group purchase applet? How much does it usually cost?

【2】 Redis basic commands and usage scenarios

What should we pay attention to when making template application of wechat applet?

tensorboard可视化

TensorFlow2.1基本概念与常见函数

深度学习——MetaFormer Is Actually What You Need for Vision

How much is wechat applet development cost and production cost?

What is the detail of the applet development process?

分布式集群架构场景优化解决方案:分布式ID解决方案
随机推荐
transformer的理解
Distributed cluster architecture scenario optimization solution: distributed ID solution
深度学习(自监督:SimCLR)——A Simple Framework for Contrastive Learning of Visual Representations
KubeSphere安装版本问题
深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning
【六】redis缓存策略
深度学习(增量学习)——(ICCV)Striking a Balance between Stability and Plasticity for Class-Incremental Learning
速查表之各种编程语言小数|时间|base64等操作
What are the advantages of small program development system? Why choose it?
微信小程序开发语言一般有哪些?
vscode uniapp
Quick look-up table to MD5
NLP中常用的utils
Deep learning (self supervision: simpl) -- a simple framework for contractual learning of visual representations
微信小程序开发详细步骤是什么?
Which is more reliable for small program development?
Micro service architecture cognition and service governance Eureka
Uniapp WebView listens to the callback after the page is loaded
Dataset类分批加载数据集
Kubesphere installation version problem