当前位置：网站首页>Reinforcement learning -- SARS in value learning

Reinforcement learning -- SARS in value learning

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
SARSA
- SARSA Algorithm training process
- SARSA Algorithm and DQN The difference between algorithms

Preface

This paper introduces SARSA The algorithm needs to be used with the subsequent strategy learning content , Can't be used alone .

This paper is about 《 Deep reinforcement learning 》 Reading notes for , If there is a mistake , Welcome to point out

SARSA

DQN The purpose of equal value learning algorithm is to fit the optimal action value function , So as to control the agent to make decisions , and SARSA The purpose of the algorithm is to fit the action value function $Q_{\pi}(s_t,a_t)$ , Used to evaluate strategies $\pi$ The advantages and disadvantages of , More specific ,SARSA It is often used in strategy learning together with strategy network Actor- Critic Algorithm , A policy network represents a policy $\pi$ , Its input is state , The output is the probability of each action , The strategy network is used to control the movement of agents , and SARSA The algorithm is used to train the value network , Evaluate the strategy of the strategy network , Help the strategy network find better strategies . This article only summarizes SARSA Algorithm .

As shown in the figure below ,SARSA The algorithm is used for training Actor- Critic Value network in Algorithm , Value network fitting action value function $Q_{\pi}(s_t,a_t)$ , Its input is state $s_t$ , The size of output space is the number of actions , The output value represents the value of each action , namely $Q_{\pi}(s_t,a_t)$ .

Insert picture description here

SARSA Algorithm training process

Let the weight of the value network be $w_{now}$ , The output of the value network is $q(s_t,a_t;w_{now})$

Observe the current state $s_t$ , Get the action executed by the agent according to the policy network $a_t$
Calculate with value network $s_t,a_t)$ Value ： $\hat q_t=q(x_t,a_t;w_{now})$
Agents perform actions $a_t$ , The environment returns to a new state $s_{t+1}$ And rewards $r_t$
State $s_{t+1}$ Input into the policy network , Get new action $a_{t+1}$
Calculate with value network $s_{t+1},a_{t+1})$ Value ： $\hat q_{t+1}=q(x_{t+1},a_{t+1};w_{now})$
The fitting target of value network is Behrman equation , The loss function is $\frac{1}{2}[\hat q_t-[r_t+\gamma \hat q_{t+1}]]^2$ , Use back propagation to update jiazhihe network
Update policy network

As the weight of the policy network changes, the policy will change , Frequent strategic changes may cause the value network to be unable to fit the action value function , therefore , Usually, the value network is updated many times , Just update the policy network .

Because the fitting goal of value network is Behrman equation , Therefore, there is no overestimation caused by maximization , But bootstrapping still exists , The target network can be introduced to solve , For details, please browse the Reinforcement learning —— Value learning DQN

SARSA Algorithm and DQN The difference between algorithms

The purpose of the two is different ,DQN Used to fit the optimal action value function , So as to control the agent ,SARSA The algorithm is used to fit the action value function , Used to evaluate the quality of a strategy .
DQN Fit the optimal action value function , So it's a different strategy , The behavior strategy and target strategy of the control agent can be different , You can use experience to playback arrays , and SARSA Algorithm fitting action value function , Used to evaluate the quality of a strategy , For the same strategy , The behavior strategy is consistent with the goal strategy , You cannot use experience to playback arrays . Behavior strategy refers to the strategy that controls the agent to perform actions , Target strategy refers to the strategy that the network needs to fit .
DQN The optimization uses the optimal Behrman equation , and SARSA Use the Behrman equation .

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518199217.html