当前位置：网站首页>[reinforcement learning notes] V value and Q value in reinforcement learning

[reinforcement learning notes] V value and Q value in reinforcement learning

2022-06-09 03:19:00 【Allenpandas】

List of articles

1. Background knowledge
2. $V$ Values and $Q$ Understanding of value
3. $V$ Value introduction
4. $Q$ Value introduction
5. according to $Q$ Value calculation $V$ value
6. according to $V$ Value calculation $Q$ value
7. according to $V$ Value calculation $V$ value

1. Background knowledge

stay Markov chain in ： When an agent changes from a state $S$ , choice action $A$ , Will enter another state $S^{'}$ ; meanwhile , It will also give the agent Reward $R$ .

There are positive and negative rewards . It just means that we encourage agents to continue to do so in this state ; Negative words mean that we don't want the agent to do this . In reinforcement learning , We will use rewards R As a guide for agent learning , Expect the agent to get as many rewards as possible .

But more often , We can't just go through R To measure the quality of an action , We must take a long-term view of the problem . We should also calculate the future rewards to the current status , Then make a decision .

2. $V$ Values and $Q$ Understanding of value

$V$ value ： assessment state The value of , We call it $V$ value . It represents the agent In this state , The expectation of total reward until the final state .
$Q$ value ： assessment action The value of , We call it $Q$ value . It represents the agent After selecting this action , The expectation of total reward until the final state .

3. $V$ Value introduction

$V$ Value definition ： assessment state The value of , We call it $V$ value . It represents the agent In this state , The expectation of total reward until the final state .

$V$ Value calculation ： Is to calculate the current state $S$ To the final state , The expectation of total reward . Generally speaking, it's ： From a certain state , According to the strategy $\pi$ , When we get to the final state , The average value of the total amount of rewards finally obtained （ Reward expectations ）, Namely $V$ value .

【 give an example 】 The following is an example , From the State $s_0$ Two actions can be performed at the beginning , Namely $a_1$ and $a_2$ . From the State $s_0$ Start , Executive action $a_1$ , Total reward to the final status $R_1$ by +10; From the State $s_0$ Start , Executive action $a_2$ , Total reward to the final status $R_2$ by +20.
Insert picture description here

hypothesis 1： In state $s_0$ when , Executive action $a_1$ The probability of is 40%, Executive action $a_2$ The probability of is 60%, So from the State $s_0$ In the final state , The reward expectation is ：

$\begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 40\% \cdot 10+60\% \cdot 20 \\ &=16\\ \end{aligned}$

among , $p(a_1|s_0)$ It means in the state of $s_0$ when , Choose action $a_1$ Probability ; $p(a_2|s_0)$ It means in the state of $s_0$ when , Choose action $a_2$ Probability .

hypothesis 2： In state $S$ when , Executive action $a_1$ The probability of is 50%, Executive action $a_2$ The probability of is 50%. So from the State $S$ The reward expectation to the final status $\overline{R}$ by ：

$\begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 50\% \cdot 10+50\% \cdot 20 \\ &=15\\ \end{aligned}$

hypothesis 3： In state $S$ when , Executive action $a_1$ The probability of is 60%, Executive action $a_2$ The probability of is 40%. So from the State $S$ The reward expectation to the final status $\overline{R}$ by ：

$\begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 60\% \cdot 10+40\% \cdot 20 \\ &=14\\ \end{aligned}$

From the above three assumptions, we can see ： Take different strategies $\pi$ programme , resulting $V$ The value is different ！！！ in other words , $V$ Value and strategy $\pi$ Have a direct relationship .

4. $Q$ Value introduction

$Q$ Value definition ： assessment action The value of , We call it $Q$ value . It represents the agent After selecting this action , The expectation of total reward until the final state .

$Q$ Value calculation ： It's about taking action $A$ after , To the final state , The expectation of total reward . Generally speaking, it's ： Start from a certain action , When we get to the final state , The average value of the total amount of rewards finally obtained （ Reward expectations ）, Namely $Q$ value .

notes ： And V Values are different ,Q Values and policies $\pi$ It's not directly related to , It is related to the state transition probability of the environment （ The state transition probability of the environment is unknown , We can neither learn nor change ）.

【 give an example 】 The following is an example , Take action $a_1$ , Jump to state $s_1$ , To the final state , The reward is +10; Jump to state $s_2$ , To the final state , The reward is +20; Jump to state $s_3$ , To the final state , The reward is +5;
Insert picture description here

$\begin{aligned} Q(s_0) = {\overline R}(s_0) &= p(s_1|s_0,a_1) \cdot {R_1} + p(s_2|s_0,a_1) \cdot {R_2} + p(s_3|s_0,a_1) \cdot {R_3} \\ &= p(s_1|s_0,a_1) \cdot {10} + p(s_2|s_0,a_1) \cdot {20} + p(s_3|s_0,a_1) \cdot {5} \\ \end{aligned}$

among ,
$p(s_1|s_0,a_1)$ It means in the state of $s_0$ when , Choose action $a_1$ Then jump to the state $s_1$ Probability ;
$p(s_2|s_0,a_1)$ It means in the state of $s_0$ when , Choose action $a_1$ Then jump to the state $s_2$ Probability ;
$p(s_3|s_0,a_1)$ It means in the state of $s_0$ when , Choose action $a_1$ Then jump to the state $s_3$ Probability .

Be careful ： State transition probability $p(s_1|s_0,a_1)$ 、 $p(s_2|s_0,a_1)$ 、 $p(s_3|s_0,a_1)$ It is determined by the system , We can neither learn nor change .

5. according to $Q$ Value calculation $V$ value

$V$ The value represents the agent In this state , The expectation of total reward until the final state . A state of $V$ value , It is all the actions in this state $Q$ value , In the strategy $\pi$ The next expectation .
Insert picture description here

$\begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \end{aligned}$

among ,
$p(a_1|s_0)$ It means in the state of $s_0$ Next, choose the action $a_1$ Probability ;
$q(s_0,a_1)$ It means in the state of $s_0$ Next, choose the action $a_1$ After $Q$ value （ Reward expectations obtained ）;
$p(a_2|s_0)$ It means in the state of $s_0$ Next, choose the action $a_2$ Probability ;
$q(s_0,a_2)$ It means in the state of $s_0$ Next, choose the action $a_2$ After $Q$ value （ Reward expectations obtained ）;
$\pi(a|s_0)$ Means strategy $\pi$ In state $s_0$ Take a certain action when $a\in A,A=(a_1, a_2, a_3, …, a_n)$ Probability ;
$q_{\pi}(s_0, a)$ It means in the state of $s_0$ when , Take a certain action $a\in A,A=(a_1, a_2, a_3, …, a_n)$ Corresponding $Q$ value （ Reward expectations obtained ）.

6. according to $V$ Value calculation $Q$ value

Definition $q_\pi(s_0, a_1)$ In a state $s_0$ when , According to the strategy $\pi$ Take action $a_1$ Of $Q$ value .
Insert picture description here

$\begin{aligned} q_\pi(s_0, a_1) &= [p(s_1|s_0,a_1) \cdot v_\pi(s_1) + r_1] + [p(s_2|s_0,a_1) \cdot v_\pi(s_2) + r_2] + [p(s_3|s_0,a_1) \cdot v_\pi(s_3) + r_3]\\ &=[ r_1+r_2+r_3] + P(s'|s_0, a_1) \cdot v_{\pi}(s')\\ &=R_{s_0}^{a_1} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_1} \cdot v_\pi(s')\\ \end{aligned}$

among ,
$R_{s_0}^{a_1}$ It means in the state of $s_0$ when , Take action $a_1$ A reward for jumping to a new state ;
$\gamma$ It's the discount factor ;
$P_{ {s_0}s'}^{a_1}$ It means in the state of $s_0$ when , Take action $a_1$ , Jump to a new state $s^{'}$ State transition probability of ;
$v_\pi(s')$ It means to jump to a new state $s^{'}$ Of $V$ value .

7. according to $V$ Value calculation $V$ value

More time , We need to $V$ Value to calculate $V$ value . Accurately speaking , According to the following state $s^{'}$ Of $V$ Value to calculate the previous state $s$ Of $V$ value .

It is known that ：
$\begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \\ \end{aligned}$

therefore ：
$\begin{aligned} V_\pi(s)&= \sum\limits_{a\in A} \pi(a|s) \cdot[ R_{s}^a + \gamma\sum\limits_{s'\in S} P_{ss'}^{a} \cdot v_\pi(s') ] \\ \end{aligned}$