当前位置:网站首页>[reinforcement learning notes] V value and Q value in reinforcement learning
[reinforcement learning notes] V value and Q value in reinforcement learning
2022-06-09 03:19:00 【Allenpandas】
List of articles
1. Background knowledge
stay Markov chain in : When an agent changes from a state S S S, choice action A A A, Will enter another state S ′ S' S′; meanwhile , It will also give the agent Reward R R R.
There are positive and negative rewards . It just means that we encourage agents to continue to do so in this state ; Negative words mean that we don't want the agent to do this . In reinforcement learning , We will use rewards R As a guide for agent learning , Expect the agent to get as many rewards as possible .
But more often , We can't just go through R To measure the quality of an action , We must take a long-term view of the problem . We should also calculate the future rewards to the current status , Then make a decision .
2. V V V Values and Q Q Q Understanding of value
V V V value : assessment state The value of , We call it V V V value . It represents the agent In this state , The expectation of total reward until the final state .
Q Q Q value : assessment action The value of , We call it Q Q Q value . It represents the agent After selecting this action , The expectation of total reward until the final state .
3. V V V Value introduction
V V V Value definition : assessment state The value of , We call it V V V value . It represents the agent In this state , The expectation of total reward until the final state .
V V V Value calculation : Is to calculate the current state S S S To the final state , The expectation of total reward . Generally speaking, it's : From a certain state , According to the strategy π \pi π, When we get to the final state , The average value of the total amount of rewards finally obtained ( Reward expectations ), Namely V V V value .
【 give an example 】 The following is an example , From the State s 0 s_0 s0 Two actions can be performed at the beginning , Namely a 1 a_1 a1 and a 2 a_2 a2. From the State s 0 s_0 s0 Start , Executive action a 1 a_1 a1, Total reward to the final status R 1 R_1 R1 by +10; From the State s 0 s_0 s0 Start , Executive action a 2 a_2 a2, Total reward to the final status R 2 R_2 R2 by +20.
hypothesis 1: In state s 0 s_0 s0 when , Executive action a 1 a_1 a1 The probability of is 40%, Executive action a 2 a_2 a2 The probability of is 60%, So from the State s 0 s_0 s0 In the final state , The reward expectation is :
V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 40 % ⋅ 10 + 60 % ⋅ 20 = 16 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 40\% \cdot 10+60\% \cdot 20 \\ &=16\\ \end{aligned} V=R=p(a1∣s0)⋅R1+p(a2∣s0)⋅R2=40%⋅10+60%⋅20=16
among , p ( a 1 ∣ s 0 ) p(a_1|s_0) p(a1∣s0) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Probability ; p ( a 2 ∣ s 0 ) p(a_2|s_0) p(a2∣s0) It means in the state of s 0 s_0 s0 when , Choose action a 2 a_2 a2 Probability .
hypothesis 2: In state S S S when , Executive action a 1 a_1 a1 The probability of is 50%, Executive action a 2 a_2 a2 The probability of is 50%. So from the State S S S The reward expectation to the final status R ‾ \overline{R} R by :
V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 50 % ⋅ 10 + 50 % ⋅ 20 = 15 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 50\% \cdot 10+50\% \cdot 20 \\ &=15\\ \end{aligned} V=R=p(a1∣s0)⋅R1+p(a2∣s0)⋅R2=50%⋅10+50%⋅20=15
hypothesis 3: In state S S S when , Executive action a 1 a_1 a1 The probability of is 60%, Executive action a 2 a_2 a2 The probability of is 40%. So from the State S S S The reward expectation to the final status R ‾ \overline{R} R by :
V = R ‾ = p ( a 1 ∣ s 0 ) ⋅ R 1 + p ( a 2 ∣ s 0 ) ⋅ R 2 = 60 % ⋅ 10 + 40 % ⋅ 20 = 14 \begin{aligned} V = \overline{R} &= p(a_1|s_0) \cdot {R_1} + p(a_2|s_0) \cdot {R_2} \\ &= 60\% \cdot 10+40\% \cdot 20 \\ &=14\\ \end{aligned} V=R=p(a1∣s0)⋅R1+p(a2∣s0)⋅R2=60%⋅10+40%⋅20=14
From the above three assumptions, we can see : Take different strategies π \pi π programme , resulting V V V The value is different !!! in other words , V V V Value and strategy π \pi π Have a direct relationship .
4. Q Q Q Value introduction
Q Q Q Value definition : assessment action The value of , We call it Q Q Q value . It represents the agent After selecting this action , The expectation of total reward until the final state .
Q Q Q Value calculation : It's about taking action A A A after , To the final state , The expectation of total reward . Generally speaking, it's : Start from a certain action , When we get to the final state , The average value of the total amount of rewards finally obtained ( Reward expectations ), Namely Q Q Q value .
notes : And V Values are different ,Q Values and policies π \pi π It's not directly related to , It is related to the state transition probability of the environment ( The state transition probability of the environment is unknown , We can neither learn nor change ).
【 give an example 】 The following is an example , Take action a 1 a_1 a1, Jump to state s 1 s_1 s1, To the final state , The reward is +10; Jump to state s 2 s_2 s2, To the final state , The reward is +20; Jump to state s 3 s_3 s3, To the final state , The reward is +5;
Q ( s 0 ) = R ‾ ( s 0 ) = p ( s 1 ∣ s 0 , a 1 ) ⋅ R 1 + p ( s 2 ∣ s 0 , a 1 ) ⋅ R 2 + p ( s 3 ∣ s 0 , a 1 ) ⋅ R 3 = p ( s 1 ∣ s 0 , a 1 ) ⋅ 10 + p ( s 2 ∣ s 0 , a 1 ) ⋅ 20 + p ( s 3 ∣ s 0 , a 1 ) ⋅ 5 \begin{aligned} Q(s_0) = {\overline R}(s_0) &= p(s_1|s_0,a_1) \cdot {R_1} + p(s_2|s_0,a_1) \cdot {R_2} + p(s_3|s_0,a_1) \cdot {R_3} \\ &= p(s_1|s_0,a_1) \cdot {10} + p(s_2|s_0,a_1) \cdot {20} + p(s_3|s_0,a_1) \cdot {5} \\ \end{aligned} Q(s0)=R(s0)=p(s1∣s0,a1)⋅R1+p(s2∣s0,a1)⋅R2+p(s3∣s0,a1)⋅R3=p(s1∣s0,a1)⋅10+p(s2∣s0,a1)⋅20+p(s3∣s0,a1)⋅5
among ,
p ( s 1 ∣ s 0 , a 1 ) p(s_1|s_0,a_1) p(s1∣s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 1 s_1 s1 Probability ;
p ( s 2 ∣ s 0 , a 1 ) p(s_2|s_0,a_1) p(s2∣s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 2 s_2 s2 Probability ;
p ( s 3 ∣ s 0 , a 1 ) p(s_3|s_0,a_1) p(s3∣s0,a1) It means in the state of s 0 s_0 s0 when , Choose action a 1 a_1 a1 Then jump to the state s 3 s_3 s3 Probability .
Be careful : State transition probability p ( s 1 ∣ s 0 , a 1 ) p(s_1|s_0,a_1) p(s1∣s0,a1)、 p ( s 2 ∣ s 0 , a 1 ) p(s_2|s_0,a_1) p(s2∣s0,a1)、 p ( s 3 ∣ s 0 , a 1 ) p(s_3|s_0,a_1) p(s3∣s0,a1) It is determined by the system , We can neither learn nor change .
5. according to Q Q Q Value calculation V V V value
V V V The value represents the agent In this state , The expectation of total reward until the final state . A state of V V V value , It is all the actions in this state Q Q Q value , In the strategy π \pi π The next expectation .
V π ( s 0 ) = p ( a 1 ∣ s 0 ) ⋅ q ( s 0 , a 1 ) + p ( a 2 ∣ s 0 ) ⋅ q ( s 0 , a 2 ) = ∑ a ∈ A π ( a ∣ s 0 ) ⋅ q π ( s 0 , a ) \begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \end{aligned} Vπ(s0)=p(a1∣s0)⋅q(s0,a1)+p(a2∣s0)⋅q(s0,a2)=a∈A∑π(a∣s0)⋅qπ(s0,a)
among ,
p ( a 1 ∣ s 0 ) p(a_1|s_0) p(a1∣s0) It means in the state of s 0 s_0 s0 Next, choose the action a 1 a_1 a1 Probability ;
q ( s 0 , a 1 ) q(s_0,a_1) q(s0,a1) It means in the state of s 0 s_0 s0 Next, choose the action a 1 a_1 a1 After Q Q Q value ( Reward expectations obtained );
p ( a 2 ∣ s 0 ) p(a_2|s_0) p(a2∣s0) It means in the state of s 0 s_0 s0 Next, choose the action a 2 a_2 a2 Probability ;
q ( s 0 , a 2 ) q(s_0,a_2) q(s0,a2) It means in the state of s 0 s_0 s0 Next, choose the action a 2 a_2 a2 After Q Q Q value ( Reward expectations obtained );
π ( a ∣ s 0 ) \pi(a|s_0) π(a∣s0) Means strategy π \pi π In state s 0 s_0 s0 Take a certain action when a ∈ A , A = ( a 1 , a 2 , a 3 , … , a n ) a\in A,A=(a_1, a_2, a_3, …, a_n) a∈A,A=(a1,a2,a3,…,an) Probability ;
q π ( s 0 , a ) q_{\pi}(s_0, a) qπ(s0,a) It means in the state of s 0 s_0 s0 when , Take a certain action a ∈ A , A = ( a 1 , a 2 , a 3 , … , a n ) a\in A,A=(a_1, a_2, a_3, …, a_n) a∈A,A=(a1,a2,a3,…,an) Corresponding Q Q Q value ( Reward expectations obtained ).
6. according to V V V Value calculation Q Q Q value
Definition q π ( s 0 , a 1 ) q_\pi(s_0, a_1) qπ(s0,a1) In a state s 0 s_0 s0 when , According to the strategy π \pi π Take action a 1 a_1 a1 Of Q Q Q value .
q π ( s 0 , a 1 ) = [ p ( s 1 ∣ s 0 , a 1 ) ⋅ v π ( s 1 ) + r 1 ] + [ p ( s 2 ∣ s 0 , a 1 ) ⋅ v π ( s 2 ) + r 2 ] + [ p ( s 3 ∣ s 0 , a 1 ) ⋅ v π ( s 3 ) + r 3 ] = [ r 1 + r 2 + r 3 ] + P ( s ′ ∣ s 0 , a 1 ) ⋅ v π ( s ′ ) = R s 0 a 1 + γ ∑ s ′ P s 0 s ′ a 1 ⋅ v π ( s ′ ) \begin{aligned} q_\pi(s_0, a_1) &= [p(s_1|s_0,a_1) \cdot v_\pi(s_1) + r_1] + [p(s_2|s_0,a_1) \cdot v_\pi(s_2) + r_2] + [p(s_3|s_0,a_1) \cdot v_\pi(s_3) + r_3]\\ &=[ r_1+r_2+r_3] + P(s'|s_0, a_1) \cdot v_{\pi}(s')\\ &=R_{s_0}^{a_1} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_1} \cdot v_\pi(s')\\ \end{aligned} qπ(s0,a1)=[p(s1∣s0,a1)⋅vπ(s1)+r1]+[p(s2∣s0,a1)⋅vπ(s2)+r2]+[p(s3∣s0,a1)⋅vπ(s3)+r3]=[r1+r2+r3]+P(s′∣s0,a1)⋅vπ(s′)=Rs0a1+γs′∑Ps0s′a1⋅vπ(s′)
among ,
R s 0 a 1 R_{s_0}^{a_1} Rs0a1 It means in the state of s 0 s_0 s0 when , Take action a 1 a_1 a1 A reward for jumping to a new state ;
γ \gamma γ It's the discount factor ;
P s 0 s ′ a 1 P_{ {s_0}s'}^{a_1} Ps0s′a1 It means in the state of s 0 s_0 s0 when , Take action a 1 a_1 a1, Jump to a new state s ′ s' s′ State transition probability of ;
v π ( s ′ ) v_\pi(s') vπ(s′) It means to jump to a new state s ′ s' s′ Of V V V value .
7. according to V V V Value calculation V V V value
More time , We need to V V V Value to calculate V V V value . Accurately speaking , According to the following state s ′ s' s′ Of V V V Value to calculate the previous state s s s Of V V V value .
It is known that :
V π ( s 0 ) = p ( a 1 ∣ s 0 ) ⋅ q ( s 0 , a 1 ) + p ( a 2 ∣ s 0 ) ⋅ q ( s 0 , a 2 ) = ∑ a ∈ A π ( a ∣ s 0 ) ⋅ q π ( s 0 , a ) \begin{aligned} V_\pi(s_0)&= p(a_1|s_0) \cdot q(s_0,a_1) + p(a_2|s_0) \cdot q(s_0,a_2) \\ &= \sum\limits_{a\in A} \pi(a|s_0) \cdot q_{\pi}(s_0, a) \\ \end{aligned} Vπ(s0)=p(a1∣s0)⋅q(s0,a1)+p(a2∣s0)⋅q(s0,a2)=a∈A∑π(a∣s0)⋅qπ(s0,a)
q π ( s 0 , a 1 ) = [ p ( s 1 ∣ s 0 , a 1 ) ⋅ v π ( s 1 ) + r 1 ] + [ p ( s 2 ∣ s 0 , a 1 ) ⋅ v π ( s 2 ) + r 2 ] + [ p ( s 3 ∣ s 0 , a 1 ) ⋅ v π ( s 3 ) + r 3 ] = [ r 1 + r 2 + r 3 ] + P ( s ′ ∣ s 0 , a 1 ) ⋅ v π ( s ′ ) = R s 0 a 1 + γ ∑ s ′ P s 0 s ′ a 1 ⋅ v π ( s ′ ) q π ( s 0 , a 2 ) = R s 0 a 2 + γ ∑ s ′ P s 0 s ′ a 2 ⋅ v π ( s ′ ) \begin{aligned} q_\pi(s_0, a_1) &= [p(s_1|s_0,a_1) \cdot v_\pi(s_1) + r_1] + [p(s_2|s_0,a_1) \cdot v_\pi(s_2) + r_2] + [p(s_3|s_0,a_1) \cdot v_\pi(s_3) + r_3]\\ &=[ r_1+r_2+r_3] + P(s'|s_0, a_1) \cdot v_{\pi}(s')\\ &=R_{s_0}^{a_1} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_1} \cdot v_\pi(s')\\ q_\pi(s_0, a_2) &=R_{s_0}^{a_2} + \gamma \sum\limits_{s'} P_{ {s_0}s'}^{a_2} \cdot v_\pi(s')\\ \end{aligned} qπ(s0,a1)qπ(s0,a2)=[p(s1∣s0,a1)⋅vπ(s1)+r1]+[p(s2∣s0,a1)⋅vπ(s2)+r2]+[p(s3∣s0,a1)⋅vπ(s3)+r3]=[r1+r2+r3]+P(s′∣s0,a1)⋅vπ(s′)=Rs0a1+γs′∑Ps0s′a1⋅vπ(s′)=Rs0a2+γs′∑Ps0s′a2⋅vπ(s′)
therefore :
V π ( s ) = ∑ a ∈ A π ( a ∣ s ) ⋅ [ R s a + γ ∑ s ′ ∈ S P s s ′ a ⋅ v π ( s ′ ) ] \begin{aligned} V_\pi(s)&= \sum\limits_{a\in A} \pi(a|s) \cdot[ R_{s}^a + \gamma\sum\limits_{s'\in S} P_{ss'}^{a} \cdot v_\pi(s') ] \\ \end{aligned} Vπ(s)=a∈A∑π(a∣s)⋅[Rsa+γs′∈S∑Pss′a⋅vπ(s′)]
reference :
[1] Zhangsijun , How to understand the Q Values and V value ?
边栏推荐
- On the full chain syntax of jsnpp framework
- Ccp-csp 201909-3 character drawing 100
- Karmada v1.2 release: open a new era of full-text search
- FPGA first try
- Formatting and parsing of simpledateformat time
- Ccf-csp 202012-3 file system with quota
- Electron desktop development (process)
- Exclusive application for consumer finance license by Weixin Jinke failed
- Redis the distance between two points and the center point
- 现在VB6.0已经和SQL连接了,但是使用查询功能时无法做到任意条件查询,网上的情况和我的也不太相符,请问该如何实现呢?
猜你喜欢

2003 -can t connect to MySQL server on localhost (10061 "unknown error")

Leetcode 1185. Day of the week

關於JS console.log() 是同步 or 异步引發的問題

Ccf-csp 201409-3 string matching

Leetcode 1442. Triple number prefix and + XOR forming two XOR equal arrays
![[detailed explanation of kubernetes 13] - dashboard deployment](/img/81/d4567d0ff0e4509ace3d62cb16bdcc.png)
[detailed explanation of kubernetes 13] - dashboard deployment

STM32 flash erase crash

Ccf-csp 201403-3 command line options

Leetcode 871. Minimum refuelling times priority queue

No cached version available for offline mode
随机推荐
No cached version available for offline mode
Multi scale aligned distillation for low resolution detection
Is it safe to open a stock account online? What is the stock account opening process?
Leetcode 974. And K divisible subarray prefix sum
Two Merged Sequences(CF 1144 G)(将序列拆分成升序序列和降序序列两部分)-DP
Definition and basic terms of tree
Linux MySQL installation tutorial (Graphic tutorial)
C classes and objects
Binding mode of QT overloaded signal slot function
Android 程序常用功能《清除缓存》
On the full chain syntax of jsnpp framework
Ccf-csp 202203-2 travel plan 100 points difference
Alook browser cookie acquisition tutorial
Leetcode 238. Product of arrays other than itself
Comprehensive description of network transformer (function, factory, type, model, principle, etc.)
Which securities firm should be selected for stock account opening? Is it safe to open an account
Ccf-csp 202109-3 pulse neural network, sometimes 100 points..
The difference between new and newinstance() for creating objects
About JS console Log() is a problem caused by synchronous or asynchronous
[machinetranslation] multilingual fairseq preprocess