当前位置：网站首页>Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)

Intensive learning notes: Sutton book Chapter III exercise explanation (ex17~ex29)

2022-07-06 15:17:00 【Slow ploughing of stupid cattle】

Catalog

Exercise 3.17

Exercise 3.17

What is the Bellman equation for action values, that is, for $q_{\pi}$ ? It must give the action value $q_{\pi}(s,a)$ in terms of the action values, $q_{\pi}(s',a')$ , of possible successors to the state–action pair (s, a). Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

Explain ：

As above backup diagram Shown , from (s,a) Go to every possible s' The probability of is determined by p decision , The total return of each branch includes two parts , One is immediate return r, The second is the state s' State value function of （ Of course, there should be a discount ）, It can be obtained. ( stay Exercise 3.13[https://blog.csdn.net/chenxy_bwave/article/details/122522897] We have got this relationship )：

$q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma v_{\pi}(s')) \cdots (1)$

further , Same basis backup diagram You can get （ You can refer to Exercise 3.12） The expression of state value function expressed by action value function is as follows ：

$v_{\pi}(s) = \sum\limits_{a}\pi(a|s)q_{\pi}(s,a) \cdots (2)$

take (2) Plug in (1) Formula to get ：

$q_{\pi}(s,a) = \sum\limits_{r,s'}p(r,s'|s,a)(r + \gamma \sum\limits_{a'}\pi(a'|s')q_{\pi}(s',a')) \cdots (3)$

This is the Behrman equation about the action value function ！

By the way , Because the state value function and the action value function can express each other , So from the mutual expression of the two , By substituting the elimination method to eliminate one, we can get another Behrman equation . For the derivation of Behrman equation of state value function, see Strengthen learning notes ： Strategy 、 Value function and Behrman equation

Exercise 3.18

The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node, $v_{\pi}(s)$ , in terms of the value at the expected leaf node, $q_{\pi}(s,a)$ , given S_t=s . This equation should include an expectation conditioned on following the policy $\pi$ . Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

Explain ： As shown in the figure above , From the State s You can start by $\pi(a|s)$ The determined probability reaches each action node . The action function value of each action node is $q_{\pi}(a,s)$ . state s The status value of is $q_{\pi}(a,s)$ The expectations of the .

$\begin{align} \mathbb{E}[X] &= \sum\limits_x x\cdot p(x) \\ Y &= g(X),\\ \mathbb{E}[Y] &= \sum\limits_x g(x)\cdot p(x) \end{align}$

$x \rightarrow a, \ g(x)\rightarrow q_{\pi}(a,s),\ p(x)\rightarrow \pi(a|s)$ , So we can get the expectation that the state value function is the action value function ：

$v_{\pi}(s) = \mathbb{E}_a[q_{\pi}(a,s)] = \sum\limits_{a}\pi(a|s)q_{\pi}(a,s)$

Exercise 3.19

The value of an action $q_{\pi}(s,a)$ , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value, $q_{\pi}(s,a)$ , in terms of the expected next reward, $R_{t+1}$ , and the expected next state value, $v(S_{t+1})$ , given that St=s and At=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s',r|s,a) defined by (3.2), such that no expected value notation appears in the equation.

Explain ：

from (s,a) Departure depends on probability p(s',r|s,a) Arrive at the branches as shown in the figure above . Access Rd k The total return includes $R_{t+1}=r_k$ , And the state value function of the next state $v_{\pi}(s_k')$ , $v_{\pi}(s_k')$ It belongs to the next moment t+1 Of , Equivalent to the moment t Multiply by the discount factor . Therefore, we can get the expectation that the total return of branches is the return of each branch （ Probability weighted mean ）:

$\begin{align} G_{t+1}[k] &= R_{t+1} + \gamma v_{\pi}(S_{t+1}) \\ q_{\pi}(a,s) &= \mathbb{E}[G_{t+1}] \\ &= \sum\limits_{s',r}p(s',r|s,a)(r + \gamma v_{\pi}(s')) \end{align}$

Exercise 3.20

Draw or describe the optimal state-value function for the golf example.

Exercise 3.21

Draw or describe the contours of the optimal action-value function for putting, q_*(s,putter) , for the golf example.

Exercise 3.22

Consider the continuing MDP shown to the right. The only decision to be made is that in the top state,where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $\pi_{left}$ and $\pi_{right}$ . What policy is optimal if $\gamma$ = 0? If $\gamma$ = 0.9? If $\gamma$ = 0.5?

Exercise 3.23

Give the Bellman equation for q_* for the recycling robot.

Exercise 3.24

Figure 3.5 gives the optimal value of the best state of the gridworld as
24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express
this value symbolically, and then to compute it to three decimal places.

Exercise 3.25

Give an equation for v_* in terms of q_* .

Explain ： v_* Is the optimal value function , By definition, it must be equal to in state s Take a certain action a, Then follow the optimal strategy to get the largest of the optimal action value functions , It can be obtained. ：

$v_*(s) = \max\limits_{a}q_*(s,a)$

Exercise 3.26

Give an equation for q_* in terms of v_* and the four-argument p.

Explain ： Reference resources Exercise 3.19.

The optimal action value function must correspond to each next state s' The optimal state value function of , So there is ：

$q_*(a,s) = \sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))$

Exercise 3.27

Give an equation for $\pi_*$ in terms of q_* .

Explain ： Policy is used to change from a certain state s Set out to choose the action , Optimal strategy means from any state s Choose the corresponding optimal action when starting , Write it down as a_*(s) , That is said , choice a_*(s) The probability of is 1, The probability of non optimal action is 0. Of course , It should be noted that , In a certain state s Next , There may be more than one optimal action . under these circumstances , Choose one of them . however , The action value function of multiple optimal actions must be equal .

First , state s The optimal action under satisfies the following equation ：

$a_*(s) = \arg\max\limits_{a}q_*(s,a)$

secondly , The optimal strategy can be expressed as （ Here, for simplicity , Suppose there is only one optimal action in each state ）：

$\begin{align} \pi_*(a|s) &= 1, \ a=\arg\max\limits_{a'}q_*(s,a') \\ \pi_*(a|s) &= 0, \ others \end{align}$

Exercise 3.28

Give an equation for $\pi_*$ in terms of v_* and the four-argument p.

Explain ： combination 3.26 and 3.27（ take 3.26 The solution of is substituted into 3.27 Solution ） You can get ：

$\begin{align} \pi_*(a) &= 1, \quad a=\arg\max\limits_{a'} \sum\limits_{s',r}p(s',r|s,a')(r+\gamma v_*(s')) \\ \pi_*(a) &= 0, \quad others \end{align}$

Exercise 3.29

Rewrite the four Bellman equations for the four value functions ( $v_{\pi}$ , v_* , $q_{\pi}$ ,and q_* ) in terms of the three argument function p (3.4) and the two-argument function r (3.5).

Explain ：

$\begin{align} v_{\pi}(s) &=\sum\limits_{a}\pi(a|s)\sum\limits_{s',r}p(s',r|s,a)(r+\gamma v_{\pi}(s')) \\ &= \sum\limits_{a}\pi(a|s)\bigg\{\sum\limits_{s',r} r p(s',r|s,a) + \sum\limits_{s',r}p(s',r|s,a)\gamma v_{\pi}(s')) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') \sum\limits_{r}p(s',r|s,a) \bigg\}\\ &=\sum\limits_{a}\pi(a|s)\bigg\{r(s,a) + \gamma \sum\limits_{s'} v_{\pi}(s') p(s'|s,a) \bigg\} \end{align}$