当前位置:网站首页>Reinforcement learning - Basic Concepts
Reinforcement learning - Basic Concepts
2022-07-28 06:09:00 【Food to doubt life】
List of articles
Preface
All concepts in this article are extracted from 《 Deep reinforcement learning 》, If there is a mistake , Welcome to point out
Basic concepts
probability theory
- The random variable is an uncertainty , Usually in capital letters , Its value depends on a random event
- An experiment , The value of random variables is called the observed value , Usually in lowercase letters
- The probability of discrete random variables can be obtained through the probability mass function
- The probability of continuous random variables can be obtained by integrating the probability density function
Monte Carlo
- in short , Use the observed value to calculate the approximate result of the target , The more observations are used , The more accurate the result is , For example, random variables A A A The expectation is E ( A ) E(A) E(A), We can do it m Experiments , Get the random variable A A A Of m An observation , Yes m The observed values are averaged , As E ( A ) E(A) E(A) Approximate value ,m The bigger it is , The closer the approximation is E ( A ) E(A) E(A)
Strengthen learning basic concepts
The goal of strengthening learning : The goal of reinforcement learning is to find a decision rule ( Strategy ), Make the system get the maximum cumulative reward .
state : A summary of the current environment , For example, go game , The position of all the pieces on the current chessboard is the state , Status is the only basis for making decisions .
The state space : A collection of all possible states , The state space can be infinite , It can also be limited
action : Refers to the decision made , For example, in the Super Mario game , Mario can only turn left 、 towards the right 、 Up , Then action is one of the three
Action space : Refers to the set of all possible actions , In the case of Super Mario , The action space is { On 、 Left 、 Right }
agent : Refers to the subject of the action , Who does the action , Who is the agent , In the case of Super Mario , Mario is an agent
Reward : After the agent performs an action , A value returned by the environment to the agent , for instance , Primary school students ( agent ) I finished my homework ( action ), His parents let him fight for an hour, the glory of the king ( Reward ), Rewards depend on the current state s t s_t st, Actions performed by agents a a a, In some cases, it also depends on the state of the next moment s t + 1 s_{t+1} st+1
Environmental Science : Who can generate a new state , Who is the environment
State shift : Given state s, Agents perform actions a, The environment gives the state of the next moment through the state transition function
Agents interact with the environment : Observe the current state s,AI Calculate the probability of all actions with the strategy function , Then use the probability of action to do random sampling , Choose an action to be performed by the agent , After the agent performs the action , The environment generates a new state according to the state transition function , And give feedback to the agent for reward
Return : The sum of all rewards from the current moment to the end , Also known as cumulative rewards , set up t t t The return of time is a random variable U t U_t Ut, t t t The reward of the moment is R t R_t Rt, Then there are
U t = R t + R t + 1 + R t + 2 + . . . . . . U_t=R_t+R_{t+1}+R_{t+2}+...... Ut=Rt+Rt+1+Rt+2+......Discount return : Set the discount rate as γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ∈[0,1], Then the discount return is
U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+...... Ut=Rt+γRt+1+γ2Rt+2+......
The discount rate is a super parameterReturn U t U_t Ut The randomness of comes from t t t The action of the moment and t t t The action and state after the moment
Common function symbols
Action value function : Its mathematical expression is
Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[Ut∣St=st,At=at]
Its meaning is in strategy π \pi π Next , The agent is in a state s t s_t st Make an action a t a_t at The upper limit of return after ( The upper limit cannot exceed expectations , That's the average ), Its value depends on the strategy π \pi π as well as t t t The state of the moment s t s_t st And movement a t a_t atOptimal action value function : Its mathematical expression is
Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q_{*}(s_t,a_t)=\max_{\pi} Q_{\pi}(s_t,a_t) Q∗(st,at)=πmaxQπ(st,at)
When strategy π \pi π Is optimal , The action value function is the optimal action value function , Its value depends on t t t The state of the moment s t s_t st And movement a t a_t at,State transition function : The environment uses state transition functions to generate new States , The state transition function is usually a conditional probability density function , such as AI Play chess with humans ,AI After that , The subsequent state of the chessboard depends on where humans will put the pieces , The action of human putting chess pieces is random , Set the current agent status to S S S, Action for A A A, Then the state transition function is
P ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a ) P(s'|s,a)=P(S'=s'|S=s,A=a) P(s′∣s,a)=P(S′=s′∣S=s,A=a)Policy function : Make decisions based on the observed state , So as to control the agent , Set the status to S S S, Action for A A A, The conditional probability density function of the strategy function is
π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a|s)=P(A=a|S=s) π(a∣s)=P(A=a∣S=s)
That is, the current state is known , Make an action a Probability . The goal of reinforcement learning is learning strategy function , The definition of reward has a great impact on the effect of reinforcement learningState value function : Used to measure the current state , The greater the return in the future , The better the current state . Its mathematical expression is
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] V_{\pi}(s_t)=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t] Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]
Its relationship with the action value function is (~ The symbol cannot be displayed , use ∈ \in ∈ It means obeying a certain probability distribution )
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] = E A t ∈ π ( . ∣ s t ) [ E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t ] ] = E A t ∈ π ( . ∣ s t ) [ Q π ( s t , A t ) ] \begin{aligned} V_{\pi}(s_t)&=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t]\\ &=E_{A_t \in \pi(.|s_t)}[E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t]]\\ &=E_{A_t \in \pi(.|s_t)}[Q_{\pi}(s_t,A_t)] \end{aligned} Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]=EAt∈π(.∣st)[ESt+1,At+1,...,Sn,An[Ut∣St=st,At]]=EAt∈π(.∣st)[Qπ(st,At)]
The state value function depends on the current state , Give expectations for future returns , The action value function depends on the current state and action , Give expectations for future returns
Value learning and strategy learning
- Value learning : The goal of reinforcement learning is to learn the optimal action value function or the optimal state value function , Using the optimal action value function or the optimal state value function to control the motion of the agent
- Strategy learning : The goal of reinforcement learning is learning strategy function , Use the strategy function to control the action of the agent
边栏推荐
- Four perspectives to teach you to choose applet development tools?
- 小程序搭建制作流程是怎样的?
- 小程序开发
- Small program development solves the anxiety of retail industry
- Two methods of covering duplicate records in tables in MySQL
- Interface anti duplicate submission
- 分布式集群架构场景优化解决方案:分布式调度问题
- What should we pay attention to when making template application of wechat applet?
- KubeSphere安装版本问题
- SQLAlchemy使用相关
猜你喜欢

At the moment of the epidemic, online and offline travelers are trapped. Can the digital collection be released?

Kotlin语言现在怎么不火了?你怎么看?

深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning

微信小程序开发制作注意这几个重点方面

Matplotlib data visualization

小程序开发系统有哪些优点?为什么要选择它?

分布式集群架构场景化解决方案:集群时钟同步问题

小程序开发如何提高效率?

Regular verification rules of wechat applet mobile number

Wechat applet development and production should pay attention to these key aspects
随机推荐
What about the app store on wechat?
【7】 Consistency between redis cache and database data
What are the points for attention in the development and design of high-end atmospheric applets?
CertPathValidatorException:validity check failed
flutter webivew input唤起相机相册
强化学习——价值学习中的DQN
Small program development solves the anxiety of retail industry
How much is wechat applet development cost and production cost?
First meet flask
Assembly packaging
Installing redis under Linux (centos7)
Digital collections strengthen reality with emptiness, enabling the development of the real economy
如何选择小程序开发企业
Nlp项目实战自定义模板框架
【一】redis简介
【4】 Redis persistence (RDB and AOF)
vscode uniapp
小程序搭建制作流程是怎样的?
alpine,debian替换源
tf.keras搭建神经网络功能扩展