当前位置:网站首页>Reinforcement learning - Basic Concepts
Reinforcement learning - Basic Concepts
2022-07-28 06:09:00 【Food to doubt life】
List of articles
Preface
All concepts in this article are extracted from 《 Deep reinforcement learning 》, If there is a mistake , Welcome to point out
Basic concepts
probability theory
- The random variable is an uncertainty , Usually in capital letters , Its value depends on a random event
- An experiment , The value of random variables is called the observed value , Usually in lowercase letters
- The probability of discrete random variables can be obtained through the probability mass function
- The probability of continuous random variables can be obtained by integrating the probability density function
Monte Carlo
- in short , Use the observed value to calculate the approximate result of the target , The more observations are used , The more accurate the result is , For example, random variables A A A The expectation is E ( A ) E(A) E(A), We can do it m Experiments , Get the random variable A A A Of m An observation , Yes m The observed values are averaged , As E ( A ) E(A) E(A) Approximate value ,m The bigger it is , The closer the approximation is E ( A ) E(A) E(A)
Strengthen learning basic concepts
The goal of strengthening learning : The goal of reinforcement learning is to find a decision rule ( Strategy ), Make the system get the maximum cumulative reward .
state : A summary of the current environment , For example, go game , The position of all the pieces on the current chessboard is the state , Status is the only basis for making decisions .
The state space : A collection of all possible states , The state space can be infinite , It can also be limited
action : Refers to the decision made , For example, in the Super Mario game , Mario can only turn left 、 towards the right 、 Up , Then action is one of the three
Action space : Refers to the set of all possible actions , In the case of Super Mario , The action space is { On 、 Left 、 Right }
agent : Refers to the subject of the action , Who does the action , Who is the agent , In the case of Super Mario , Mario is an agent
Reward : After the agent performs an action , A value returned by the environment to the agent , for instance , Primary school students ( agent ) I finished my homework ( action ), His parents let him fight for an hour, the glory of the king ( Reward ), Rewards depend on the current state s t s_t st, Actions performed by agents a a a, In some cases, it also depends on the state of the next moment s t + 1 s_{t+1} st+1
Environmental Science : Who can generate a new state , Who is the environment
State shift : Given state s, Agents perform actions a, The environment gives the state of the next moment through the state transition function
Agents interact with the environment : Observe the current state s,AI Calculate the probability of all actions with the strategy function , Then use the probability of action to do random sampling , Choose an action to be performed by the agent , After the agent performs the action , The environment generates a new state according to the state transition function , And give feedback to the agent for reward
Return : The sum of all rewards from the current moment to the end , Also known as cumulative rewards , set up t t t The return of time is a random variable U t U_t Ut, t t t The reward of the moment is R t R_t Rt, Then there are
U t = R t + R t + 1 + R t + 2 + . . . . . . U_t=R_t+R_{t+1}+R_{t+2}+...... Ut=Rt+Rt+1+Rt+2+......Discount return : Set the discount rate as γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ∈[0,1], Then the discount return is
U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+...... Ut=Rt+γRt+1+γ2Rt+2+......
The discount rate is a super parameterReturn U t U_t Ut The randomness of comes from t t t The action of the moment and t t t The action and state after the moment
Common function symbols
Action value function : Its mathematical expression is
Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[Ut∣St=st,At=at]
Its meaning is in strategy π \pi π Next , The agent is in a state s t s_t st Make an action a t a_t at The upper limit of return after ( The upper limit cannot exceed expectations , That's the average ), Its value depends on the strategy π \pi π as well as t t t The state of the moment s t s_t st And movement a t a_t atOptimal action value function : Its mathematical expression is
Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q_{*}(s_t,a_t)=\max_{\pi} Q_{\pi}(s_t,a_t) Q∗(st,at)=πmaxQπ(st,at)
When strategy π \pi π Is optimal , The action value function is the optimal action value function , Its value depends on t t t The state of the moment s t s_t st And movement a t a_t at,State transition function : The environment uses state transition functions to generate new States , The state transition function is usually a conditional probability density function , such as AI Play chess with humans ,AI After that , The subsequent state of the chessboard depends on where humans will put the pieces , The action of human putting chess pieces is random , Set the current agent status to S S S, Action for A A A, Then the state transition function is
P ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a ) P(s'|s,a)=P(S'=s'|S=s,A=a) P(s′∣s,a)=P(S′=s′∣S=s,A=a)Policy function : Make decisions based on the observed state , So as to control the agent , Set the status to S S S, Action for A A A, The conditional probability density function of the strategy function is
π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a|s)=P(A=a|S=s) π(a∣s)=P(A=a∣S=s)
That is, the current state is known , Make an action a Probability . The goal of reinforcement learning is learning strategy function , The definition of reward has a great impact on the effect of reinforcement learningState value function : Used to measure the current state , The greater the return in the future , The better the current state . Its mathematical expression is
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] V_{\pi}(s_t)=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t] Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]
Its relationship with the action value function is (~ The symbol cannot be displayed , use ∈ \in ∈ It means obeying a certain probability distribution )
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] = E A t ∈ π ( . ∣ s t ) [ E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t ] ] = E A t ∈ π ( . ∣ s t ) [ Q π ( s t , A t ) ] \begin{aligned} V_{\pi}(s_t)&=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t]\\ &=E_{A_t \in \pi(.|s_t)}[E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t]]\\ &=E_{A_t \in \pi(.|s_t)}[Q_{\pi}(s_t,A_t)] \end{aligned} Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]=EAt∈π(.∣st)[ESt+1,At+1,...,Sn,An[Ut∣St=st,At]]=EAt∈π(.∣st)[Qπ(st,At)]
The state value function depends on the current state , Give expectations for future returns , The action value function depends on the current state and action , Give expectations for future returns
Value learning and strategy learning
- Value learning : The goal of reinforcement learning is to learn the optimal action value function or the optimal state value function , Using the optimal action value function or the optimal state value function to control the motion of the agent
- Strategy learning : The goal of reinforcement learning is learning strategy function , Use the strategy function to control the action of the agent
边栏推荐
- Bert based data preprocessing in NLP
- 用于排序的sort方法
- Service reliability guarantee -watchdog
- Quick look-up table to MD5
- Digital collections become a new hot spot in tourism industry
- Distributed cluster architecture scenario optimization solution: session sharing problem
- Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
- Tensorboard visualization
- Micro service architecture cognition and service governance Eureka
- 分布式集群架构场景优化解决方案:分布式调度问题
猜你喜欢
随机推荐
Centos7 installing MySQL
Two methods of covering duplicate records in tables in MySQL
分布式集群架构场景优化解决方案:分布式ID解决方案
深度学习(增量学习)——ICCV2022:Contrastive Continual Learning
Invalid packaging for parent POM x, must be “pom“ but is “jar“ @
强化学习——不完全观测问题、MCTS
【四】redis持久化(RDB与AOF)
Notice of attack: [bean Bingbing] send, sell, cash, draw, prize, etc
速查表之各种编程语言小数|时间|base64等操作
Utils commonly used in NLP
Xshell suddenly failed to connect to the virtual machine
小程序商城制作一个需要多少钱?一般包括哪些费用?
面试官:让你设计一套图片加载框架,你会怎么设计?
【4】 Redis persistence (RDB and AOF)
tf.keras搭建神经网络功能扩展
How to choose an applet development enterprise
No module named yum
强化学习——基础概念
【6】 Redis cache policy
分布式集群架构场景化解决方案:集群时钟同步问题









