当前位置：网站首页>[don't bother with intensive learning] video notes (III) 1. What is SARS?

[don't bother with intensive learning] video notes (III) 1. What is SARS?

2022-07-24 09:17:00 【Your sister Xuan】

The first 7 section What is? SARSA？

SARSA Is with the Q Learn similar algorithms , About Q The introduction of learning is in the previous notes , Here is mainly about ：
【 Don't bother to strengthen learning 】 Video notes （ Two ）1. What is? Q-Learning?
And Q To study the same ,SARSA Learning also uses theft “Q surface ”, By updating the Q Watch to learn .
Insert picture description here
As shown in the figure above ,SARSA The update of learning also has two parts ： reality Q Value and It is estimated that Q value . It is estimated that Q The value is directly from Q Selected in the table , But reality Q Value estimation method and Q Learning is different .
First , We have a sequence S、A、R、S‘、A’, When estimating the true value, you need to choose S‘ Your next move A’, And action A‘ Is not a choice Q The largest value in the table , It's a choice What is really going to happen that , That is, actions with certain randomness . Others are the same as Q Learn the same , Wait, then use the difference between the estimated value and the actual value to update the original Q surface .

And Q Learn to understand differences

Q The learning method is Off-Policy, Different strategies , It's about updating and sampling Q Values are different . and SARSA Learning is Same strategy Of （On-Policy）, All use $\epsilon$ - greedy （ Generally speaking ）, With stronger randomness . Here are Q Study （ above ） and SARSA Study （ below ） The pseudo code ：
Insert picture description here

It can be seen that , The two are very different in the updated part ,Q Learning and SARSA The learning process is described as follows ：

$\epsilon$ -Greedy Get status $s$ Next corresponding action $a$ $\rightarrow$ Interacting with the environment pays off $R$ And the next state $s^{'}$ $\rightarrow$ Direct selection Q The most valuable $Q (s^{'}, a^{'})$ Estimate the true value $\rightarrow$ Update parameters $\rightarrow$ Move to the next state
The previous step $\epsilon$ -Greedy Get the action $a$ $\rightarrow$ Interacting with the environment pays off $R$ And the next state $s^{'}$ $\rightarrow$ adopt $\epsilon$ -Greedy obtain $s^{'}$ Your next move $a^{'}$ $\rightarrow$ Use $Q (s^{'}, a^{'})$ Update parameters $\rightarrow$ Move to the next state and action

Q Use the maximum when learning and updating Q A worthy action , and SARSA Learning directly used The next time $\epsilon$ -Greedy The real action of sampling , obviously SARSA Learn to use real values , and Q Learning to use Greedy estimation Of “ True value ”.Q Learning and SARSA The difference in learning also reveals On-Policy And Off-Policy Similarities and differences .（Q Learning for Off-Policy Different strategies ,SARSA Learning for On-Policy Same strategy ）

Last one ：【 Don't bother to strengthen learning 】 Video notes （ Two ）3.Q_Learning The algorithm realizes maze walking
Next ：【 Don't bother to strengthen learning 】 Video notes （ 3、 ... and ）2.SARSA Learn to walk the maze

原网站

版权声明
本文为[Your sister Xuan]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221617232787.html