当前位置:网站首页>Reinforcement learning - Multi-Agent Reinforcement Learning
Reinforcement learning - Multi-Agent Reinforcement Learning
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
This paper summarizes 《 Deep reinforcement learning 》 Chapter on Multi-Agent Reinforcement Learning in , If there is a mistake , Welcome to point out .
Setting of multi-agent system
A multi-agent system consists of multiple agents , Multiple agents share the environment , Agents interact . The action of an agent will change the state of the environment , Thus affecting other agents .
There are four common relationships between multi-agent
- Full partnership : The goals of multiple agents are the same , You will get the same reward after you make an action .
- Perfect competition : The gain of an agent will lead to the loss of some agents .
- A mixture of cooperation and competition : Multiple agents are divided into groups , The interests of the agents in the group are the same , Each group is in a competitive relationship , for example 《 Glory of Kings 》.
- Egoism : The action of an agent will change the environment , So that other agents benefit or suffer , This relationship looks like a subset of a perfectly competitive relationship .
Under multi-agent system , Some terms of reinforcement learning need to be redefined :
- Reward : The relationship between multiple agents is different , It will lead to different rewards given by the environment . If there is a cooperative relationship between multiple agents , Then multiple agents get the same reward from the environment . If there is a competitive relationship between multiple agents , Then an agent who gets a positive reward from the environment will get a negative reward . The first i i i Agents in t t t The reward of every moment R t i R_t^i Rti By state S t S_t St And the actions of all agents A t = [ A t 1 , A t 2 , . . . , A t n ] A_t=[A_t^1,A_t^2,...,A_t^n] At=[At1,At2,...,Atn] decision .
- Action value function : Because the actions performed by one agent will affect other agents , Therefore, in multi-agent system i i i The action value function of agents Q π i ( S t , A t ) Q_\pi^i(S_t,A_t) Qπi(St,At) It depends on the current state S t S_t St( A single agent can see the environment state , It can also be the environmental state seen by multiple agents ), It also depends on the actions of all agents A t = [ A t 1 , A t 2 , . . . , A t n ] A_t=[A_t^1,A_t^2,...,A_t^n] At=[At1,At2,...,Atn]( It also depends on the strategies of all agents )
- State value function : The state value function is the expectation of the action value function about the action , The actions performed by the agent depend on the strategy function , So the first i i i The state value function of an agent V π i ( S t ) V_\pi^i(S_t) Vπi(St) It also depends on the strategies of all agents .
In a multi-agent system , The state value function and action value function of a single agent are affected by other agents .
Multi agent system under cooperative relationship setting
Under the setting of multi-agent , A single agent may not be able to observe the global state . Set the first i i i The states observed by agents are O i O^i Oi, Then it can be assumed that the sum of the local states observed by all agents is the global state , namely
S = [ O 1 , O 2 , . . . , O m ] S=[O^1,O^2,...,O^m] S=[O1,O2,...,Om]
Under partnership , Multiple agents get the same reward from the environment . Set the first i i i The reward for an agent is R i R^i Ri, Then there are
R 1 = R 2 = . . . = R m R^1=R^2=...=R^m R1=R2=...=Rm
Besides , The action value function and state value function of each agent depend on the strategy function of all agents π ( A 1 ∣ S ; θ 1 ) 、 π ( A 2 ∣ S ; θ 2 ) 、 . . . 、 π ( A m ∣ S ; θ m ) \pi(A^1|S;\theta^1)、\pi(A^2|S;\theta^2)、...、\pi(A^m|S;\theta^m) π(A1∣S;θ1)、π(A2∣S;θ2)、...、π(Am∣S;θm).
The objective function of strategy learning

Multi agent strategy learning algorithm under cooperative relationship MAC-A2C
Because the goals of multiple agents are the same , Get the same rewards from the environment , Therefore, multiple agents share a value network V ( s ; w ) V(s;w) V(s;w), The status of the value network input is s = [ o 1 , o 2 , . . . , o m ] s=[o^1,o^2,...,o^m] s=[o1,o2,...,om], among o i o^i oi It means the first one i i i States observed by agents . The output of the value network is the right state s s s The score .
Each agent has its own strategy network , The first i i i The strategy network of agents is recorded as π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi). Because there is a cooperative relationship between multiple agents , Therefore, the input status of the policy network is s = [ o 1 , o 2 , . . . , o m ] s=[o^1,o^2,...,o^m] s=[o1,o2,...,om], among o i o^i oi It means the first one i i i States observed by agents .
The specific training process is :
The decision-making process after training is :
The above training process , The strategic network makes an inference , Need to synchronize once m m m States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick. To be specific , The strategy network of each agent only makes decisions based on the current state observed by the agent , That is to say i i i A strategic network consists of π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi) Turn into π ( a i ∣ o i ; θ i ) \pi(a^i|o^i;\theta^i) π(ai∣oi;θi). Because the value network is still based on the overall state s s s Calculate , Therefore, the state information observed by other agents will be transferred to the current agent through the policy gradient . The specific process of centralized training is :
After model training , Use decentralized decision , The first i i i Agents observe states based on themselves o i o^i oi, Use strategic networks π ( a i ∣ o i ; θ i ) \pi(a^i|o^i;\theta^i) π(ai∣oi;θi) The act of deciding to perform .
Multi-agent strategy learning algorithm under non cooperative relationship MAC-A2C
Multiple agents have common goals under cooperative relationships , Therefore, we can share a value network . Not in a cooperative relationship , Multiple agents have different goals , Therefore, each agent has its corresponding strategy network and value network . Set the first i i i The states observed by agents are o i o^i oi, Our strategic network is π ( a i ∣ s ; θ i ) \pi(a^i|s;\theta^i) π(ai∣s;θi)、 The value network is v ( s ; w i ) v(s;w^i) v(s;wi), among s s s The sum of the states observed for all agents , namely s = [ o 1 , o 2 , . . . , o n ] s=[o^1,o^2,...,o^n] s=[o1,o2,...,on].
MAC-A2C The training process of is as follows , Attention is different from multi intelligence reinforcement learning under cooperative relationship , Under non cooperative relationships, different agents get different rewards from the environment .
The decision-making process after training is 
The above training process , The strategic network makes an inference , Need to synchronize once m m m States observed by agents , This synchronization process is very time-consuming . In the real world , You can use centralized training + Decentralized decision trick, This process can browse the previous chapter .
Nash equilibrium
In a non cooperative relationship , The interests of agents are different , An agent obtains a larger value of the state value function , It may cause another agent to obtain a smaller value of the state value function , So how to judge whether the algorithm converges ? In a non cooperative relationship , Nash equilibrium is the criterion to judge whether the algorithm converges . To be specific , In a multi-agent system , When all other agents do not change their strategies , An agent i i i Change strategy alone θ i \theta_i θi, Can't let it expect a return J i ( θ 1 , . . . , θ m ) J_i(\theta_1,...,\theta_m) Ji(θ1,...,θm) Bigger ( The strategy gradient is 0), That is, the agent cannot find a better strategy than the current strategy ( Parameters cannot be updated by gradient ), This equilibrium state is called Nash equilibrium .
边栏推荐
- raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly‘.format(pids_str))RuntimeErro
- 神经网络优化
- Bert的使用方法
- 深度学习——MetaFormer Is Actually What You Need for Vision
- 深度学习——Patches Are All You Need
- 分布式集群架构场景优化解决方案:分布式ID解决方案
- Digital collections "chaos", 100 billion market change is coming?
- 【二】redis基础命令与使用场景
- 【1】 Introduction to redis
- Installing redis under Linux (centos7)
猜你喜欢

强化学习——Proximal Policy Optimization Algorithms

Create a virtual environment using pycharm

【4】 Redis persistence (RDB and AOF)

Deep learning (self supervision: Moco V2) -- improved bases with momentum contractual learning

Pytorch deep learning single card training and multi card training

matplotlib数据可视化

How much does it cost to make a small program mall? What are the general expenses?

Deep learning - patches are all you need

神经网络优化

小程序商城制作一个需要多少钱?一般包括哪些费用?
随机推荐
自动定时备份远程mysql脚本
【2】 Redis basic commands and usage scenarios
Tensorboard visualization
matplotlib数据可视化
Regular verification rules of wechat applet mobile number
Xshell suddenly failed to connect to the virtual machine
word2vec和bert的基本使用方法
使用神经网络实现对天气的预测
强化学习——多智能体强化学习
小程序开发要多少钱?两种开发方法分析!
深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
微信小程序开发语言一般有哪些?
深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning
小程序开发流程详细是什么呢?
What are the advantages of small program development system? Why choose it?
How to use Bert
Sort method for sorting
Shutter webivew input evokes camera albums
深度学习——Patches Are All You Need
2: Why read write separation