当前位置:网站首页>Reinforcement learning -- SARS in value learning
Reinforcement learning -- SARS in value learning
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
This paper introduces SARSA The algorithm needs to be used with the subsequent strategy learning content , Can't be used alone .
This paper is about 《 Deep reinforcement learning 》 Reading notes for , If there is a mistake , Welcome to point out
SARSA
DQN The purpose of equal value learning algorithm is to fit the optimal action value function , So as to control the agent to make decisions , and SARSA The purpose of the algorithm is to fit the action value function Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at), Used to evaluate strategies π \pi π The advantages and disadvantages of , More specific ,SARSA It is often used in strategy learning together with strategy network Actor- Critic Algorithm , A policy network represents a policy π \pi π, Its input is state , The output is the probability of each action , The strategy network is used to control the movement of agents , and SARSA The algorithm is used to train the value network , Evaluate the strategy of the strategy network , Help the strategy network find better strategies . This article only summarizes SARSA Algorithm .
As shown in the figure below ,SARSA The algorithm is used for training Actor- Critic Value network in Algorithm , Value network fitting action value function Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at), Its input is state s t s_t st, The size of output space is the number of actions , The output value represents the value of each action , namely Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at).

SARSA Algorithm training process
Let the weight of the value network be w n o w w_{now} wnow, The output of the value network is q ( s t , a t ; w n o w ) q(s_t,a_t;w_{now}) q(st,at;wnow)
- Observe the current state s t s_t st, Get the action executed by the agent according to the policy network a t a_t at
- Calculate with value network ( s t , a t ) (s_t,a_t) (st,at) Value : q ^ t = q ( x t , a t ; w n o w ) \hat q_t=q(x_t,a_t;w_{now}) q^t=q(xt,at;wnow)
- Agents perform actions a t a_t at, The environment returns to a new state s t + 1 s_{t+1} st+1 And rewards r t r_t rt
- State s t + 1 s_{t+1} st+1 Input into the policy network , Get new action a t + 1 a_{t+1} at+1
- Calculate with value network ( s t + 1 , a t + 1 ) (s_{t+1},a_{t+1}) (st+1,at+1) Value : q ^ t + 1 = q ( x t + 1 , a t + 1 ; w n o w ) \hat q_{t+1}=q(x_{t+1},a_{t+1};w_{now}) q^t+1=q(xt+1,at+1;wnow)
- The fitting target of value network is Behrman equation , The loss function is 1 2 [ q ^ t − [ r t + γ q ^ t + 1 ] ] 2 \frac{1}{2}[\hat q_t-[r_t+\gamma \hat q_{t+1}]]^2 21[q^t−[rt+γq^t+1]]2, Use back propagation to update jiazhihe network
- Update policy network
As the weight of the policy network changes, the policy will change , Frequent strategic changes may cause the value network to be unable to fit the action value function , therefore , Usually, the value network is updated many times , Just update the policy network .
Because the fitting goal of value network is Behrman equation , Therefore, there is no overestimation caused by maximization , But bootstrapping still exists , The target network can be introduced to solve , For details, please browse the Reinforcement learning —— Value learning DQN
SARSA Algorithm and DQN The difference between algorithms
- The purpose of the two is different ,DQN Used to fit the optimal action value function , So as to control the agent ,SARSA The algorithm is used to fit the action value function , Used to evaluate the quality of a strategy .
- DQN Fit the optimal action value function , So it's a different strategy , The behavior strategy and target strategy of the control agent can be different , You can use experience to playback arrays , and SARSA Algorithm fitting action value function , Used to evaluate the quality of a strategy , For the same strategy , The behavior strategy is consistent with the goal strategy , You cannot use experience to playback arrays . Behavior strategy refers to the strategy that controls the agent to perform actions , Target strategy refers to the strategy that the network needs to fit .
- DQN The optimization uses the optimal Behrman equation , and SARSA Use the Behrman equation .
边栏推荐
- Uniapp WebView listens to the callback after the page is loaded
- raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly‘.format(pids_str))RuntimeErro
- 【1】 Introduction to redis
- Interface anti duplicate submission
- NLP中常用的utils
- How much is wechat applet development cost and production cost?
- word2vec+回归模型实现分类任务
- 分布式集群架构场景化解决方案:集群时钟同步问题
- 速查表之转MD5
- Alpine, Debian replacement source
猜你喜欢
随机推荐
深度学习(增量学习)——ICCV2022:Contrastive Continual Learning
How much does it cost to make a small program mall? What are the general expenses?
word2vec和bert的基本使用方法
面试官:让你设计一套图片加载框架,你会怎么设计?
分布式集群架构场景优化解决方案:分布式调度问题
Alpine, Debian replacement source
tensorboard可视化
UNL-类图
卷积神经网络
Distinguish between real-time data, offline data, streaming data and batch data
Small program development solves the anxiety of retail industry
Deep learning (incremental learning) - (iccv) striking a balance between stability and plasticity for class incremental learning
2: Why read write separation
微信小程序开发语言一般有哪些?
Construction of redis master-slave architecture
Distributed cluster architecture scenario optimization solution: session sharing problem
Uniapp WebView listens to the callback after the page is loaded
使用神经网络实现对天气的预测
深度学习——Patches Are All You Need
Four perspectives to teach you to choose applet development tools?








