当前位置:网站首页>[reinforcement learning notes] V value and Q value in reinforcement learning

[reinforcement learning notes] V value and Q value in reinforcement learning

2022-06-09 03:19:00 Allenpandas

1. Background knowledge

stay Markov chain in : When an agent changes from a state S S S, choice action A A A, Will enter another state S ′ S' S; meanwhile , It will also give the agent Reward R R R.

There are positive and negative rewards . It just means that we encourage agents to continue to do so in this state ; Negative words mean that we don't want the agent to do this . In reinforcement learning , We will use rewards R As a guide for agent learning , Expect the agent to get as many rewards as possible .

But more often , We can't just go through R To measure the quality of an action , We must take a long-term view of the problem . We should also calculate the future rewards to the current status , Then make a decision .

2. V V V Values and Q Q Q Understanding of value

  • V V V value : assessment state The value of , We call it V V V value . It represents the agent In this state , The expectation of total reward until the final state .

  • Q Q Q value : assessment action The value of , We call it Q Q Q value . It represents the agent After selecting this action , The expectation of total reward until the final state .

3. V V V Value introduction

V V V Value definition : assessment state The value of , We call it V V V value . It represents the agent In this state , The expectation of total reward until the final state .

V V V Value calculation : Is to calculate the current state S S S To the final state , The expectation of total reward . Generally speaking, it's : From a certain state , According to the strategy π \pi π, When we get to the final state , The average value of the total amount of rewards finally obtained ( Reward expectations ), Namely V V V value .

【 give an example 】 The following is an example , From the State s 0 s_0 s0 Two actions can be performed at the beginning , Namely a 1 a_1 a1 and a 2 a_2 a2. From the State s 0 s_0 s0 Start , Executive action a 1 a_1 a1, Total reward to the final status R 1 R_1 R1 by +10; From the State s 0 s_0 s0 Start , Executive action a 2 a_2 a2, Total reward to the final status R 2 R_2 R2 by +20.
 Insert picture description here

hypothesis 1: In state s 0 s_0 s0 when , Executive action a 1 a_1 a1 The probability of is 40%, Executive action a 2 a_2 a2 The probability of is 60%, So from the State s 0 s_0 s0 In the final state , The reward expectation is :

V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 40 % ⋅ 10 + 60 % ⋅ 20 = 16 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 40\% \cdot 10+60\% \cdot 20 \\ &=16\\ \end{aligned} V=R=p(a1s0)R1+p(a2s0)R2=40%10+60%20=16

among , p ( a 1 ∣ s 0 ) p(a_1|s_0) p(a1s0) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Probability ; p ( a 2 ∣ s 0 ) p(a_2|s_0) p(a2s0) It means in the state of s 0 s_0 s0 when , Choose action a 2 a_2 a2 Probability .

hypothesis 2: In state S S S when , Executive action a 1 a_1 a1 The probability of is 50%, Executive action a 2 a_2 a2 The probability of is 50%. So from the State S S S The reward expectation to the final status R ‾ \overline{R} R by :

V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 50 % ⋅ 10 + 50 % ⋅ 20 = 15 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 50\% \cdot 10+50\% \cdot 20 \\ &=15\\ \end{aligned} V=R=p(a1s0)R1+p(a2s0)R2=50%10+50%20=15

hypothesis 3: In state S S S when , Executive action a 1 a_1 a1 The probability of is 60%, Executive action a 2 a_2 a2 The probability of is 40%. So from the State S S S The reward expectation to the final status R ‾ \overline{R} R by :

V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 60 % ⋅ 10 + 40 % ⋅ 20 = 14 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 60\% \cdot 10+40\% \cdot 20 \\ &=14\\ \end{aligned} V=R=p(a1s0)R1+p(a2s0)R2=60%10+40%20=14

From the above three assumptions, we can see : Take different strategies π \pi π programme , resulting V V V The value is different !!! in other words , V V V Value and strategy π \pi π Have a direct relationship .

4. Q Q Q Value introduction

Q Q Q Value definition : assessment action The value of , We call it Q Q Q value . It represents the agent After selecting this action , The expectation of total reward until the final state .

Q Q Q Value calculation : It's about taking action A A A after , To the final state , The expectation of total reward . Generally speaking, it's : Start from a certain action , When we get to the final state , The average value of the total amount of rewards finally obtained ( Reward expectations ), Namely Q Q Q value .

notes : And V Values are different ,Q Values and policies π \pi π It's not directly related to , It is related to the state transition probability of the environment ( The state transition probability of the environment is unknown , We can neither learn nor change ).

【 give an example 】 The following is an example , Take action a 1 a_1 a1, Jump to state s 1 s_1 s1, To the final state , The reward is +10; Jump to state s 2 s_2 s2, To the final state , The reward is +20; Jump to state s 3 s_3 s3, To the final state , The reward is +5;
 Insert picture description here

Q ( s 0 ) = R ‾ ( s 0 ) = p ( s 1 ∣ s 0 , a 1 ) ⋅ R 1 + p ( s 2 ∣ s 0 , a 1 ) ⋅ R 2 + p ( s 3 ∣ s 0 , a 1 ) ⋅ R 3 = p ( s 1 ∣ s 0 , a 1 ) ⋅ 10 + p ( s 2 ∣ s 0 , a 1 ) ⋅ 20 + p ( s 3 ∣ s 0 , a 1 ) ⋅ 5 \begin{aligned} Q(s_0) = {\overline R}(s_0) &= p(s_1|s_0,a_1) \cdot {R_1} + p(s_2|s_0,a_1) \cdot {R_2} + p(s_3|s_0,a_1) \cdot {R_3} \\ &= p(s_1|s_0,a_1) \cdot {10} + p(s_2|s_0,a_1) \cdot {20} + p(s_3|s_0,a_1) \cdot {5} \\ \end{aligned} Q(s0)=R(s0)=p(s1s0,a1)R1+p(s2s0,a1)R2+p(s3s0,a1)R3=p(s1s0,a1)10+p(s2s0,a1)20+p(s3s0,a1)5

among ,
p ( s 1 ∣ s 0 , a 1 ) p(s_1|s_0,a_1) p(s1s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 1 s_1 s1 Probability ;
p ( s 2 ∣ s 0 , a 1 ) p(s_2|s_0,a_1) p(s2s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 2 s_2 s2 Probability ;
p ( s 3 ∣ s 0 , a 1 ) p(s_3|s_0,a_1) p(s3s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 3 s_3 s3 Probability .

Be careful : State transition probability p ( s 1 ∣ s 0 , a 1 ) p(s_1|s_0,a_1) p(s1s0,a1) p ( s 2 ∣ s 0 , a 1 ) p(s_2|s_0,a_1) p(s2s0,a1) p ( s 3 ∣ s 0 , a 1 ) p(s_3|s_0,a_1) p(s3s0,a1) It is determined by the system , We can neither learn nor change .

5. according to Q Q Q Value calculation V V V value

V V V The value represents the agent In this state , The expectation of total reward until the final state . A state of V V V value , It is all the actions in this state Q Q Q value , In the strategy π \pi π The next expectation .
 Insert picture description here

V π ( s 0 ) = p ( a 1 ∣ s 0 ) ⋅ q ( s 0 , a 1 ) + p ( a 2 ∣ s 0 ) ⋅ q ( s 0 , a 2 ) = ∑ a ∈ A π ( a ∣ s 0 ) ⋅ q π ( s 0 , a ) \begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \end{aligned} Vπ(s0)=p(a1s0)q(s0,a1)+p(a2s0)q(s0,a2)=aAπ(as0)qπ(s0,a)

among ,
p ( a 1 ∣ s 0 ) p(a_1|s_0) p(a1s0) It means in the state of s 0 s_0 s0 Next, choose the action a 1 a_1 a1 Probability ;
q ( s 0 , a 1 ) q(s_0,a_1) q(s0,a1) It means in the state of s 0 s_0 s0 Next, choose the action a 1 a_1 a1 After Q Q Q value ( Reward expectations obtained );
p ( a 2 ∣ s 0 ) p(a_2|s_0) p(a2s0) It means in the state of s 0 s_0 s0 Next, choose the action a 2 a_2 a2 Probability ;
q ( s 0 , a 2 ) q(s_0,a_2) q(s0,a2) It means in the state of s 0 s_0 s0 Next, choose the action a 2 a_2 a2 After Q Q Q value ( Reward expectations obtained );
π ( a ∣ s 0 ) \pi(a|s_0) π(as0) Means strategy π \pi π In state s 0 s_0 s0 Take a certain action when a ∈ A , A = ( a 1 , a 2 , a 3 , … , a n ) a\in A,A=(a_1, a_2, a_3, …, a_n) aA,A=(a1,a2,a3,,an) Probability ;
q π ( s 0 , a ) q_{\pi}(s_0, a) qπ(s0,a) It means in the state of s 0 s_0 s0 when , Take a certain action a ∈ A , A = ( a 1 , a 2 , a 3 , … , a n ) a\in A,A=(a_1, a_2, a_3, …, a_n) aA,A=(a1,a2,a3,,an) Corresponding Q Q Q value ( Reward expectations obtained ).

6. according to V V V Value calculation Q Q Q value

Definition q π ( s 0 , a 1 ) q_\pi(s_0, a_1) qπ(s0,a1) In a state s 0 s_0 s0 when , According to the strategy π \pi π Take action a 1 a_1 a1 Of Q Q Q value .
 Insert picture description here

q π ( s 0 , a 1 ) = [ p ( s 1 ∣ s 0 , a 1 ) ⋅ v π ( s 1 ) + r 1 ] + [ p ( s 2 ∣ s 0 , a 1 ) ⋅ v π ( s 2 ) + r 2 ] + [ p ( s 3 ∣ s 0 , a 1 ) ⋅ v π ( s 3 ) + r 3 ] = [ r 1 + r 2 + r 3 ] + P ( s ′ ∣ s 0 , a 1 ) ⋅ v π ( s ′ ) = R s 0 a 1 + γ ∑ s ′ P s 0 s ′ a 1 ⋅ v π ( s ′ ) \begin{aligned} q_\pi(s_0, a_1) &= [p(s_1|s_0,a_1) \cdot v_\pi(s_1) + r_1] + [p(s_2|s_0,a_1) \cdot v_\pi(s_2) + r_2] + [p(s_3|s_0,a_1) \cdot v_\pi(s_3) + r_3]\\ &=[ r_1+r_2+r_3] + P(s'|s_0, a_1) \cdot v_{\pi}(s')\\ &=R_{s_0}^{a_1} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_1} \cdot v_\pi(s')\\ \end{aligned} qπ(s0,a1)=[p(s1s0,a1)vπ(s1)+r1]+[p(s2s0,a1)vπ(s2)+r2]+[p(s3s0,a1)vπ(s3)+r3]=[r1+r2+r3]+P(ss0,a1)vπ(s)=Rs0a1+γsPs0sa1vπ(s)

among ,
R s 0 a 1 R_{s_0}^{a_1} Rs0a1 It means in the state of s 0 s_0 s0 when , Take action a 1 a_1 a1 A reward for jumping to a new state ;
γ \gamma γ It's the discount factor ;
P s 0 s ′ a 1 P_{ {s_0}s'}^{a_1} Ps0sa1 It means in the state of s 0 s_0 s0 when , Take action a 1 a_1 a1, Jump to a new state s ′ s' s State transition probability of ;
v π ( s ′ ) v_\pi(s') vπ(s) It means to jump to a new state s ′ s' s Of V V V value .

7. according to V V V Value calculation V V V value

More time , We need to V V V Value to calculate V V V value . Accurately speaking , According to the following state s ′ s' s Of V V V Value to calculate the previous state s s s Of V V V value .

It is known that :
V π ( s 0 ) = p ( a 1 ∣ s 0 ) ⋅ q ( s 0 , a 1 ) + p ( a 2 ∣ s 0 ) ⋅ q ( s 0 , a 2 ) = ∑ a ∈ A π ( a ∣ s 0 ) ⋅ q π ( s 0 , a ) \begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \\ \end{aligned} Vπ(s0)=p(a1s0)q(s0,a1)+p(a2s0)q(s0,a2)=aAπ(as0)qπ(s0,a)

q π ( s 0 , a 1 ) = [ p ( s 1 ∣ s 0 , a 1 ) ⋅ v π ( s 1 ) + r 1 ] + [ p ( s 2 ∣ s 0 , a 1 ) ⋅ v π ( s 2 ) + r 2 ] + [ p ( s 3 ∣ s 0 , a 1 ) ⋅ v π ( s 3 ) + r 3 ] = [ r 1 + r 2 + r 3 ] + P ( s ′ ∣ s 0 , a 1 ) ⋅ v π ( s ′ ) = R s 0 a 1 + γ ∑ s ′ P s 0 s ′ a 1 ⋅ v π ( s ′ ) q π ( s 0 , a 2 ) = R s 0 a 2 + γ ∑ s ′ P s 0 s ′ a 2 ⋅ v π ( s ′ ) \begin{aligned} q_\pi(s_0, a_1) &= [p(s_1|s_0,a_1) \cdot v_\pi(s_1) + r_1] + [p(s_2|s_0,a_1) \cdot v_\pi(s_2) + r_2] + [p(s_3|s_0,a_1) \cdot v_\pi(s_3) + r_3]\\ &=[ r_1+r_2+r_3] + P(s'|s_0, a_1) \cdot v_{\pi}(s')\\ &=R_{s_0}^{a_1} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_1} \cdot v_\pi(s')\\ q_\pi(s_0, a_2) &=R_{s_0}^{a_2} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_2} \cdot v_\pi(s')\\ \end{aligned} qπ(s0,a1)qπ(s0,a2)=[p(s1s0,a1)vπ(s1)+r1]+[p(s2s0,a1)vπ(s2)+r2]+[p(s3s0,a1)vπ(s3)+r3]=[r1+r2+r3]+P(ss0,a1)vπ(s)=Rs0a1+γsPs0sa1vπ(s)=Rs0a2+γsPs0sa2vπ(s)

therefore :
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ⋅ [ R s a + γ ∑ s ′ ∈ S P s s ′ a ⋅ v π ( s ′ ) ] \begin{aligned} V_\pi(s)&= \sum\limits_{a\in A} \pi(a|s) \cdot[ R_{s}^a + \gamma\sum\limits_{s'\in S} P_{ss'}^{a} \cdot v_\pi(s') ] \\ \end{aligned} Vπ(s)=aAπ(as)[Rsa+γsSPssavπ(s)]

reference :
[1] Zhangsijun , How to understand the Q Values and V value ?

原网站

版权声明
本文为[Allenpandas]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/159/202206081005032585.html