当前位置:网站首页>[don't bother with intensive learning] video notes (III) 1. What is SARS?
[don't bother with intensive learning] video notes (III) 1. What is SARS?
2022-07-24 09:17:00 【Your sister Xuan】
The first 7 section What is? SARSA?
SARSA Is with the Q Learn similar algorithms , About Q The introduction of learning is in the previous notes , Here is mainly about :
【 Don't bother to strengthen learning 】 Video notes ( Two )1. What is? Q-Learning?
And Q To study the same ,SARSA Learning also uses theft “Q surface ”, By updating the Q Watch to learn .
As shown in the figure above ,SARSA The update of learning also has two parts : reality Q Value and It is estimated that Q value . It is estimated that Q The value is directly from Q Selected in the table , But reality Q Value estimation method and Q Learning is different .
First , We have a sequence S、A、R、S‘、A’, When estimating the true value, you need to choose S‘ Your next move A’, And action A‘ Is not a choice Q The largest value in the table , It's a choice What is really going to happen that , That is, actions with certain randomness . Others are the same as Q Learn the same , Wait, then use the difference between the estimated value and the actual value to update the original Q surface .
And Q Learn to understand differences
Q The learning method is Off-Policy, Different strategies , It's about updating and sampling Q Values are different . and SARSA Learning is Same strategy Of (On-Policy), All use ϵ \epsilon ϵ- greedy ( Generally speaking ), With stronger randomness . Here are Q Study ( above ) and SARSA Study ( below ) The pseudo code :

It can be seen that , The two are very different in the updated part ,Q Learning and SARSA The learning process is described as follows :
- ϵ \epsilon ϵ-Greedy Get status s s s Next corresponding action a a a → \rightarrow → Interacting with the environment pays off R R R And the next state s ′ s' s′ → \rightarrow → Direct selection Q The most valuable Q ( s ′ , a ′ ) Q(s', a') Q(s′,a′) Estimate the true value → \rightarrow → Update parameters → \rightarrow → Move to the next state
- The previous step ϵ \epsilon ϵ-Greedy Get the action a a a → \rightarrow → Interacting with the environment pays off R R R And the next state s ′ s' s′ → \rightarrow → adopt ϵ \epsilon ϵ-Greedy obtain s ′ s' s′ Your next move a ′ a' a′ → \rightarrow → Use Q ( s ′ , a ′ ) Q(s',a') Q(s′,a′) Update parameters → \rightarrow → Move to the next state and action
Q Use the maximum when learning and updating Q A worthy action , and SARSA Learning directly used The next time ϵ \epsilon ϵ-Greedy The real action of sampling , obviously SARSA Learn to use real values , and Q Learning to use Greedy estimation Of “ True value ”.Q Learning and SARSA The difference in learning also reveals On-Policy And Off-Policy Similarities and differences .(Q Learning for Off-Policy Different strategies ,SARSA Learning for On-Policy Same strategy )
Last one :【 Don't bother to strengthen learning 】 Video notes ( Two )3.Q_Learning The algorithm realizes maze walking
Next :【 Don't bother to strengthen learning 】 Video notes ( 3、 ... and )2.SARSA Learn to walk the maze
边栏推荐
- [Luogu p3426] SZA template (string) (KMP)
- Tiktok video traffic golden release time
- OPENCV学习DAY5
- Will your NFT disappear? Dfinity provides the best solution for NFT storage
- Android系统安全 — 5.2-APK V1签名介绍
- Leetcode94 detailed explanation of middle order traversal of binary tree
- Let's test 5million pieces of data. How to use index acceleration reasonably?
- [don't bother to strengthen learning] video notes (IV) 2. Dqn realizes maze walking
- TT ecosystem - cross border in-depth selection
- Practice 4-6 number guessing game (15 points)
猜你喜欢

Data collection solution for forestry survey and patrol inspection

Xtrabackup realizes full backup and incremental backup of MySQL

One click openstack single point mode environment deployment - preliminary construction

Paclitaxel loaded tpgs reduced albumin nanoparticles /ga-hsa gambogic acid human serum protein nanoparticles

Virtual machine terminator terminal terminator installation tutorial
![[don't bother to strengthen learning] video notes (IV) 2. Dqn realizes maze walking](/img/53/5252c7c6989d142cc2ad6b1c6f513c.png)
[don't bother to strengthen learning] video notes (IV) 2. Dqn realizes maze walking

Leetcode102-二叉树的层序遍历详解

JS locate Daquan to get the brother, parent and child elements of the node, including robot instances

Re6:读论文 LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification fro

Getting started with web security - open source firewall pfsense installation configuration
随机推荐
S2b2b system standardizes the ordering and purchasing process and upgrades the supply chain system of household building materials industry
[translation] integration challenges in microservice architecture using grpc and rest
What is tiktok creator fund and how to withdraw it?
A null pointer exception is reported when the wrapper class inserts into the empty field of the database table
【汇编语言实战】一元二次方程ax2+bx+c=0求解(含源码与过程截屏,可修改参数)
【汇编语言实战】(二)、编写一程序计算表达式w=v-(x+y+z-51)的值(含代码、过程截图)
Tongxin UOS developer account has supported uploading the HAP package format of Hongmeng application
Un7.22: how to upload videos and pictures simultaneously with the ruoyi framework in idea and vs Code?
Definition and initialization of cv:: mat
【我的创作一周年纪念日】爱情是需要被纪念的,创作也是
[example of URDF exercise based on ROS] use of four wheeled robot and camera
Why does TCP shake hands three times instead of two times (positive version)
From single architecture to distributed architecture, there are many pits and bugs!
What is the "age limit" on tiktok and how to solve it?
Android系统安全 — 5.3-APK V2签名介绍
How to import CAD files into the map new earth and accurately stack them with the image terrain tilt model
我们说的组件自定义事件到底是什么?
(5) Cloud integrated gateway gateway +swagger documentation tool
Office fallback version, from 2021 to 2019
Rocky basics shell script Basics