当前位置:网站首页>Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)
2022-07-06 15:17:00 【Slow ploughing of stupid cattle】
Catalog
Exercise 3.17
What is the Bellman equation for action values, that is, for
? It must give the action value
in terms of the action values,
, of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

Explain :
As above backup diagram Shown , from (s,a) Go to every possible s' The probability of is determined by p decision , The total return of each branch includes two parts , One is immediate return r, The second is the state s' State value function of ( Of course, there should be a discount ), It can be obtained. ( stay Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897] We have got this relationship ):

further , Same basis backup diagram You can get ( You can refer to Exercise 3.12) The expression of state value function expressed by action value function is as follows :

take (2) Plug in (1) Formula to get :

This is the Behrman equation about the action value function !
By the way , Because the state value function and the action value function can express each other , So from the mutual expression of the two , By substituting the elimination method to eliminate one, we can get another Behrman equation . For the derivation of Behrman equation of state value function, see Strengthen learning notes : Strategy 、 Value function and Behrman equation
Exercise 3.18
The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node,
, in terms of the value at the expected leaf node,
, given
. This equation should include an expectation conditioned on following the policy
. Then give a second equation in which the expected value is written out explicitly in terms of
such that no expected value notation appears in the equation.
Explain : As shown in the figure above , From the State s You can start by
The determined probability reaches each action node . The action function value of each action node is
. state s The status value of is
The expectations of the .
![\begin{align} \mathbb{E}[X] &= \sum\limits_x x\cdot p(x) \\ Y &= g(X),\\ \mathbb{E}[Y] &= \sum\limits_x g(x)\cdot p(x) \end{align}](http://img.inotgo.com/imagesLocal/202202/13/202202131320305149_29.gif)
, So we can get the expectation that the state value function is the action value function :
![v_{\pi}(s) = \mathbb{E}_a[q_{\pi}(a,s)] = \sum\limits_{a}\pi(a|s)q_{\pi}(a,s)](http://img.inotgo.com/imagesLocal/202201/17/202201170129520876_19.gif%28s%29%20%3D%20%5Cmathbb%7BE%7D_a%5Bq_%7B%5Cpi%7D%28a%2Cs%29%5D%20%3D%20%5Csum%5Climits_%7Ba%7D%5Cpi%28a%7Cs%29q_%7B%5Cpi%7D%28a%2Cs%29)
Exercise 3.19
The value of an action
, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value,
, in terms of the expected next reward,
, and the expected next state value,
, given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of
defined by (3.2), such that no expected value notation appears in the equation.
Explain :
from (s,a) Departure depends on probability
Arrive at the branches as shown in the figure above . Access Rd k The total return includes
, And the state value function of the next state
,
It belongs to the next moment t+1 Of , Equivalent to the moment t Multiply by the discount factor . Therefore, we can get the expectation that the total return of branches is the return of each branch ( Probability weighted mean ):
![\begin{align} G_{t+1}[k] &= R_{t+1} + \gamma v_{\pi}(S_{t+1}) \\ q_{\pi}(a,s) &= \mathbb{E}[G_{t+1}] \\ &= \sum\limits_{s',r}p(s',r|s,a)(r + \gamma v_{\pi}(s')) \end{align}](http://img.inotgo.com/imagesLocal/202202/13/202202131320305149_11.gif)
Exercise 3.20
Draw or describe the optimal state-value function for the golf example.
Exercise 3.21
Draw or describe the contours of the optimal action-value function for putting,
, for the golf example.
Exercise 3.22
Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies,
and . What policy is optimal if
= 0? If
= 0.9? If
= 0.5?
Exercise 3.23
Give the Bellman equation for
for the recycling robot.
Exercise 3.24
Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.
Exercise 3.25
Give an equation for
in terms of
.
Explain :
Is the optimal value function , By definition, it must be equal to in state s Take a certain action a, Then follow the optimal strategy to get the largest of the optimal action value functions , It can be obtained. :

Exercise 3.26
Give an equation for
in terms of
and the four-argument p.
Explain : Reference resources Exercise 3.19.
The optimal action value function must correspond to each next state s' The optimal state value function of , So there is :
Exercise 3.27
Give an equation for
in terms of
.
Explain : Policy is used to change from a certain state s Set out to choose the action , Optimal strategy means from any state s Choose the corresponding optimal action when starting , Write it down as
, That is said , choice
The probability of is 1, The probability of non optimal action is 0. Of course , It should be noted that , In a certain state s Next , There may be more than one optimal action . under these circumstances , Choose one of them . however , The action value function of multiple optimal actions must be equal .
First , state s The optimal action under satisfies the following equation :

secondly , The optimal strategy can be expressed as ( Here, for simplicity , Suppose there is only one optimal action in each state ):

Exercise 3.28
Give an equation for
in terms of
and the four-argument p.
Explain : combination 3.26 and 3.27( take 3.26 The solution of is substituted into 3.27 Solution ) You can get :

Exercise 3.29
Rewrite the four Bellman equations for the four value functions (
,
,
,and
) in terms of the three argument function p (3.4) and the two-argument function r (3.5).
Explain :

And so on , A little .
Go back to the general catalogue : General catalogue of reinforcement learning notes
https://chenxiaoyuan.blog.csdn.net/article/details/121715424
Sutton-RLbook( The first 2 edition ) The first 3 For the first half of this chapter, see : Strengthen learning notes :Sutton-Book Chapter III problem solving (Ex1~Ex16)
https://blog.csdn.net/chenxy_bwave/article/details/122522897
边栏推荐
- ucore lab7 同步互斥 实验报告
- Global and Chinese markets of PIM analyzers 2022-2028: Research Report on technology, participants, trends, market size and share
- Investment should be calm
- What level do 18K test engineers want? Take a look at the interview experience of a 26 year old test engineer
- UCORE lab5 user process management experiment report
- Global and Chinese markets of electronic grade hexafluorobutadiene (C4F6) 2022-2028: Research Report on technology, participants, trends, market size and share
- Introduction to variable parameters
- JDBC介绍
- Differences between select, poll and epoll in i/o multiplexing
- Should wildcard import be avoided- Should wildcard import be avoided?
猜你喜欢

接口测试面试题及参考答案,轻松拿捏面试官

Servlet

China's county life record: go upstairs to the Internet, go downstairs' code the Great Wall '

ucore lab6 调度器 实验报告
![Cadence physical library lef file syntax learning [continuous update]](/img/0b/75a4ac2649508857468d9b37703a27.jpg)
Cadence physical library lef file syntax learning [continuous update]

Don't you even look at such a detailed and comprehensive written software test question?

如何成为一个好的软件测试员?绝大多数人都不知道的秘密

Capitalize the title of leetcode simple question

Soft exam information system project manager_ Project set project portfolio management --- Senior Information System Project Manager of soft exam 025

Introduction to safety testing
随机推荐
软件测试行业的未来趋势及规划
How to solve the poor sound quality of Vos?
[issue 18] share a Netease go experience
Rearrange spaces between words in leetcode simple questions
Expanded polystyrene (EPS) global and Chinese markets 2022-2028: technology, participants, trends, market size and share Research Report
Express
In Oracle, start with connect by prior recursive query is used to query multi-level subordinate employees.
Global and Chinese markets of electronic grade hexafluorobutadiene (C4F6) 2022-2028: Research Report on technology, participants, trends, market size and share
C language do while loop classic Level 2 questions
Dlib detects blink times based on video stream
Do you know the advantages and disadvantages of several open source automated testing frameworks?
Automated testing problems you must understand, boutique summary
UCORE lab5 user process management experiment report
Threads and thread pools
Leetcode simple question: check whether the numbers in the sentence are increasing
软件测试工作太忙没时间学习怎么办?
What are the business processes and differences of the three basic business modes of Vos: direct dial, callback and semi direct dial?
How to change XML attribute - how to change XML attribute
MySQL数据库(一)
ByteDance ten years of experience, old bird, took more than half a year to sort out the software test interview questions