当前位置:网站首页>Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
2022-07-06 15:17:00 【Slow ploughing of stupid cattle】
Catalog
Exercise 3.17
What is the Bellman equation for action values, that is, for ? It must give the action value in terms of the action values, , of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
Explain :
As above backup diagram Shown , from (s,a) Go to every possible s' The probability of is determined by p decision , The total return of each branch includes two parts , One is immediate return r, The second is the state s' State value function of ( Of course, there should be a discount ), It can be obtained. ( stay Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897] We have got this relationship ):
further , Same basis backup diagram You can get ( You can refer to Exercise 3.12) The expression of state value function expressed by action value function is as follows :
take (2) Plug in (1) Formula to get :
This is the Behrman equation about the action value function !
By the way , Because the state value function and the action value function can express each other , So from the mutual expression of the two , By substituting the elimination method to eliminate one, we can get another Behrman equation . For the derivation of Behrman equation of state value function, see Strengthen learning notes : Strategy 、 Value function and Behrman equation
Exercise 3.18
The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node, , given . This equation should include an expectation conditioned on following the policy . Then give a second equation in which the expected value is written out explicitly in terms of such that no expected value notation appears in the equation.
Explain : As shown in the figure above , From the State s You can start by The determined probability reaches each action node . The action function value of each action node is . state s The status value of is The expectations of the .
, So we can get the expectation that the state value function is the action value function :
Exercise 3.19
The value of an action , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward, , and the expected next state value, , given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of defined by (3.2), such that no expected value notation appears in the equation.
Explain :
from (s,a) Departure depends on probability Arrive at the branches as shown in the figure above . Access Rd k The total return includes , And the state value function of the next state , It belongs to the next moment t+1 Of , Equivalent to the moment t Multiply by the discount factor . Therefore, we can get the expectation that the total return of branches is the return of each branch ( Probability weighted mean ):
Exercise 3.20
Draw or describe the optimal state-value function for the golf example.
Exercise 3.21
Draw or describe the contours of the optimal action-value function for putting, , for the golf example.
Exercise 3.22
Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, and . What policy is optimal if = 0? If = 0.9? If = 0.5?
Exercise 3.23
Give the Bellman equation for for the recycling robot.
Exercise 3.24
Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.
Exercise 3.25
Give an equation for in terms of .
Explain : Is the optimal value function , By definition, it must be equal to in state s Take a certain action a, Then follow the optimal strategy to get the largest of the optimal action value functions , It can be obtained. :
Exercise 3.26
Give an equation for in terms of and the four-argument p.
Explain : Reference resources Exercise 3.19.
The optimal action value function must correspond to each next state s' The optimal state value function of , So there is :
Exercise 3.27
Give an equation for in terms of .
Explain : Policy is used to change from a certain state s Set out to choose the action , Optimal strategy means from any state s Choose the corresponding optimal action when starting , Write it down as , That is said , choice The probability of is 1, The probability of non optimal action is 0. Of course , It should be noted that , In a certain state s Next , There may be more than one optimal action . under these circumstances , Choose one of them . however , The action value function of multiple optimal actions must be equal .
First , state s The optimal action under satisfies the following equation :
secondly , The optimal strategy can be expressed as ( Here, for simplicity , Suppose there is only one optimal action in each state ):
Exercise 3.28
Give an equation for in terms of and the four-argument p.
Explain : combination 3.26 and 3.27( take 3.26 The solution of is substituted into 3.27 Solution ) You can get :
Exercise 3.29
Rewrite the four Bellman equations for the four value functions (, , ,and ) in terms of the three argument function p (3.4) and the two-argument function r (3.5).
Explain :
And so on , A little .
Go back to the general catalogue : General catalogue of reinforcement learning notes https://chenxiaoyuan.blog.csdn.net/article/details/121715424
Sutton-RLbook( The first 2 edition ) The first 3 For the first half of this chapter, see : Strengthen learning notes :Sutton-Book Chapter III problem solving (Ex1~Ex16)https://blog.csdn.net/chenxy_bwave/article/details/122522897
边栏推荐
- ucore lab5用户进程管理 实验报告
- The most detailed postman interface test tutorial in the whole network. An article meets your needs
- Should wildcard import be avoided- Should wildcard import be avoided?
- 线程及线程池
- UCORE lab1 system software startup process experimental report
- UCORE lab8 file system experiment report
- 转行软件测试必需要知道的知识
- pytest
- Want to learn how to get started and learn software testing? I'll give you a good chat today
- 想跳槽?面试软件测试需要掌握的7个技能你知道吗
猜你喜欢
Pedestrian re identification (Reid) - Overview
CSAPP shell lab experiment report
C language do while loop classic Level 2 questions
Daily code 300 lines learning notes day 9
Portapack application development tutorial (XVII) nRF24L01 launch B
Install and run tensorflow object detection API video object recognition system of Google open source
MySQL数据库(一)
What if software testing is too busy to study?
How to learn automated testing in 2022? This article tells you
Leetcode simple question: check whether two strings are almost equal
随机推荐
[Ogg III] daily operation and maintenance: clean up archive logs, register Ogg process services, and regularly back up databases
Pedestrian re identification (Reid) - Overview
想跳槽?面试软件测试需要掌握的7个技能你知道吗
Don't you even look at such a detailed and comprehensive written software test question?
Global and Chinese market of RF shielding room 2022-2028: Research Report on technology, participants, trends, market size and share
How to rename multiple folders and add unified new content to folder names
ArrayList set
Collection collection and map collection
[HCIA continuous update] advanced features of routing
MySQL数据库(四)事务和函数
Brief introduction to libevent
How to learn automated testing in 2022? This article tells you
Global and Chinese market of DVD recorders 2022-2028: Research Report on technology, participants, trends, market size and share
Global and Chinese markets of cobalt 2022-2028: Research Report on technology, participants, trends, market size and share
接口测试面试题及参考答案,轻松拿捏面试官
DVWA exercise 05 file upload file upload
Maximum nesting depth of parentheses in leetcode simple questions
Database monitoring SQL execution
Should wildcard import be avoided- Should wildcard import be avoided?
Global and Chinese markets of PIM analyzers 2022-2028: Research Report on technology, participants, trends, market size and share