当前位置:网站首页>Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
2022-07-06 15:17:00 【Slow ploughing of stupid cattle】
Catalog
Exercise 3.17
What is the Bellman equation for action values, that is, for ? It must give the action value
in terms of the action values,
, of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
Explain :
As above backup diagram Shown , from (s,a) Go to every possible s' The probability of is determined by p decision , The total return of each branch includes two parts , One is immediate return r, The second is the state s' State value function of ( Of course, there should be a discount ), It can be obtained. ( stay Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897] We have got this relationship ):
further , Same basis backup diagram You can get ( You can refer to Exercise 3.12) The expression of state value function expressed by action value function is as follows :
take (2) Plug in (1) Formula to get :
This is the Behrman equation about the action value function !
By the way , Because the state value function and the action value function can express each other , So from the mutual expression of the two , By substituting the elimination method to eliminate one, we can get another Behrman equation . For the derivation of Behrman equation of state value function, see Strengthen learning notes : Strategy 、 Value function and Behrman equation
Exercise 3.18
The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Give the equation corresponding to this intuition and diagram for the value at the root node, , in terms of the value at the expected leaf node,
, given
. This equation should include an expectation conditioned on following the policy
. Then give a second equation in which the expected value is written out explicitly in terms of
such that no expected value notation appears in the equation.
Explain : As shown in the figure above , From the State s You can start by The determined probability reaches each action node . The action function value of each action node is
. state s The status value of is
The expectations of the .
, So we can get the expectation that the state value function is the action value function :
Exercise 3.19
The value of an action , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action value, , in terms of the expected next reward,
, and the expected next state value,
, given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of
defined by (3.2), such that no expected value notation appears in the equation.
Explain :
from (s,a) Departure depends on probability Arrive at the branches as shown in the figure above . Access Rd k The total return includes
, And the state value function of the next state
,
It belongs to the next moment t+1 Of , Equivalent to the moment t Multiply by the discount factor . Therefore, we can get the expectation that the total return of branches is the return of each branch ( Probability weighted mean ):
Exercise 3.20
Draw or describe the optimal state-value function for the golf example.
Exercise 3.21
Draw or describe the contours of the optimal action-value function for putting, , for the golf example.
Exercise 3.22
Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, and
. What policy is optimal if
= 0? If
= 0.9? If
= 0.5?
Exercise 3.23
Give the Bellman equation for for the recycling robot.
Exercise 3.24
Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.
Exercise 3.25
Give an equation for in terms of
.
Explain : Is the optimal value function , By definition, it must be equal to in state s Take a certain action a, Then follow the optimal strategy to get the largest of the optimal action value functions , It can be obtained. :
Exercise 3.26
Give an equation for in terms of
and the four-argument p.
Explain : Reference resources Exercise 3.19.
The optimal action value function must correspond to each next state s' The optimal state value function of , So there is :
Exercise 3.27
Give an equation for in terms of
.
Explain : Policy is used to change from a certain state s Set out to choose the action , Optimal strategy means from any state s Choose the corresponding optimal action when starting , Write it down as , That is said , choice
The probability of is 1, The probability of non optimal action is 0. Of course , It should be noted that , In a certain state s Next , There may be more than one optimal action . under these circumstances , Choose one of them . however , The action value function of multiple optimal actions must be equal .
First , state s The optimal action under satisfies the following equation :
secondly , The optimal strategy can be expressed as ( Here, for simplicity , Suppose there is only one optimal action in each state ):
Exercise 3.28
Give an equation for in terms of
and the four-argument p.
Explain : combination 3.26 and 3.27( take 3.26 The solution of is substituted into 3.27 Solution ) You can get :
Exercise 3.29
Rewrite the four Bellman equations for the four value functions (,
,
,and
) in terms of the three argument function p (3.4) and the two-argument function r (3.5).
Explain :
And so on , A little .
Go back to the general catalogue : General catalogue of reinforcement learning notes https://chenxiaoyuan.blog.csdn.net/article/details/121715424
Sutton-RLbook( The first 2 edition ) The first 3 For the first half of this chapter, see : Strengthen learning notes :Sutton-Book Chapter III problem solving (Ex1~Ex16)https://blog.csdn.net/chenxy_bwave/article/details/122522897
边栏推荐
- Want to learn how to get started and learn software testing? I'll give you a good chat today
- C language do while loop classic Level 2 questions
- Mysql database (III) advanced data query statement
- Global and Chinese market of pinhole glossmeter 2022-2028: Research Report on technology, participants, trends, market size and share
- How to become a good software tester? A secret that most people don't know
- MySQL数据库(二)DML数据操作语句和基本的DQL语句
- 自动化测试中敏捷测试怎么做?
- UCORE lab2 physical memory management experiment report
- UCORE lab8 file system experiment report
- Automated testing problems you must understand, boutique summary
猜你喜欢
How to transform functional testing into automated testing?
Heap, stack, queue
Investment operation steps
软件测试有哪些常用的SQL语句?
CSAPP shell lab experiment report
C language do while loop classic Level 2 questions
Eigen User Guide (Introduction)
Automated testing problems you must understand, boutique summary
Threads et pools de threads
What is "test paper test" in software testing requirements analysis
随机推荐
JDBC介绍
Global and Chinese markets for GaN on diamond semiconductor substrates 2022-2028: Research Report on technology, participants, trends, market size and share
Build your own application based on Google's open source tensorflow object detection API video object recognition system (I)
Soft exam information system project manager_ Project set project portfolio management --- Senior Information System Project Manager of soft exam 025
HackTheBox-Emdee five for life
软件测试需求分析之什么是“试纸测试”
Example 071 simulates a vending machine, designs a program of the vending machine, runs the program, prompts the user, enters the options to be selected, and prompts the selected content after the use
Interview answering skills for software testing
Global and Chinese market of DVD recorders 2022-2028: Research Report on technology, participants, trends, market size and share
MySQL数据库(四)事务和函数
Stc-b learning board buzzer plays music 2.0
[HCIA continuous update] advanced features of routing
Global and Chinese markets of Iam security services 2022-2028: Research Report on technology, participants, trends, market size and share
CSAPP家庭作業答案7 8 9章
软件测试工作太忙没时间学习怎么办?
The latest query tracks the express logistics and analyzes the method of delivery timeliness
Video scrolling subtitle addition, easy to make with this technique
Want to learn how to get started and learn software testing? I'll give you a good chat today
自动化测试中敏捷测试怎么做?
Programmers, how to avoid invalid meetings?