当前位置:网站首页>Reinforcement learning - Basic Concepts
Reinforcement learning - Basic Concepts
2022-07-28 06:09:00 【Food to doubt life】
List of articles
Preface
All concepts in this article are extracted from 《 Deep reinforcement learning 》, If there is a mistake , Welcome to point out
Basic concepts
probability theory
- The random variable is an uncertainty , Usually in capital letters , Its value depends on a random event
- An experiment , The value of random variables is called the observed value , Usually in lowercase letters
- The probability of discrete random variables can be obtained through the probability mass function
- The probability of continuous random variables can be obtained by integrating the probability density function
Monte Carlo
- in short , Use the observed value to calculate the approximate result of the target , The more observations are used , The more accurate the result is , For example, random variables A A A The expectation is E ( A ) E(A) E(A), We can do it m Experiments , Get the random variable A A A Of m An observation , Yes m The observed values are averaged , As E ( A ) E(A) E(A) Approximate value ,m The bigger it is , The closer the approximation is E ( A ) E(A) E(A)
Strengthen learning basic concepts
The goal of strengthening learning : The goal of reinforcement learning is to find a decision rule ( Strategy ), Make the system get the maximum cumulative reward .
state : A summary of the current environment , For example, go game , The position of all the pieces on the current chessboard is the state , Status is the only basis for making decisions .
The state space : A collection of all possible states , The state space can be infinite , It can also be limited
action : Refers to the decision made , For example, in the Super Mario game , Mario can only turn left 、 towards the right 、 Up , Then action is one of the three
Action space : Refers to the set of all possible actions , In the case of Super Mario , The action space is { On 、 Left 、 Right }
agent : Refers to the subject of the action , Who does the action , Who is the agent , In the case of Super Mario , Mario is an agent
Reward : After the agent performs an action , A value returned by the environment to the agent , for instance , Primary school students ( agent ) I finished my homework ( action ), His parents let him fight for an hour, the glory of the king ( Reward ), Rewards depend on the current state s t s_t st, Actions performed by agents a a a, In some cases, it also depends on the state of the next moment s t + 1 s_{t+1} st+1
Environmental Science : Who can generate a new state , Who is the environment
State shift : Given state s, Agents perform actions a, The environment gives the state of the next moment through the state transition function
Agents interact with the environment : Observe the current state s,AI Calculate the probability of all actions with the strategy function , Then use the probability of action to do random sampling , Choose an action to be performed by the agent , After the agent performs the action , The environment generates a new state according to the state transition function , And give feedback to the agent for reward
Return : The sum of all rewards from the current moment to the end , Also known as cumulative rewards , set up t t t The return of time is a random variable U t U_t Ut, t t t The reward of the moment is R t R_t Rt, Then there are
U t = R t + R t + 1 + R t + 2 + . . . . . . U_t=R_t+R_{t+1}+R_{t+2}+...... Ut=Rt+Rt+1+Rt+2+......Discount return : Set the discount rate as γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ∈[0,1], Then the discount return is
U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+...... Ut=Rt+γRt+1+γ2Rt+2+......
The discount rate is a super parameterReturn U t U_t Ut The randomness of comes from t t t The action of the moment and t t t The action and state after the moment
Common function symbols
Action value function : Its mathematical expression is
Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[Ut∣St=st,At=at]
Its meaning is in strategy π \pi π Next , The agent is in a state s t s_t st Make an action a t a_t at The upper limit of return after ( The upper limit cannot exceed expectations , That's the average ), Its value depends on the strategy π \pi π as well as t t t The state of the moment s t s_t st And movement a t a_t atOptimal action value function : Its mathematical expression is
Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q_{*}(s_t,a_t)=\max_{\pi} Q_{\pi}(s_t,a_t) Q∗(st,at)=πmaxQπ(st,at)
When strategy π \pi π Is optimal , The action value function is the optimal action value function , Its value depends on t t t The state of the moment s t s_t st And movement a t a_t at,State transition function : The environment uses state transition functions to generate new States , The state transition function is usually a conditional probability density function , such as AI Play chess with humans ,AI After that , The subsequent state of the chessboard depends on where humans will put the pieces , The action of human putting chess pieces is random , Set the current agent status to S S S, Action for A A A, Then the state transition function is
P ( s ′ ∣ s , a ) = P ( S ′ = s ′ ∣ S = s , A = a ) P(s'|s,a)=P(S'=s'|S=s,A=a) P(s′∣s,a)=P(S′=s′∣S=s,A=a)Policy function : Make decisions based on the observed state , So as to control the agent , Set the status to S S S, Action for A A A, The conditional probability density function of the strategy function is
π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a|s)=P(A=a|S=s) π(a∣s)=P(A=a∣S=s)
That is, the current state is known , Make an action a Probability . The goal of reinforcement learning is learning strategy function , The definition of reward has a great impact on the effect of reinforcement learningState value function : Used to measure the current state , The greater the return in the future , The better the current state . Its mathematical expression is
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] V_{\pi}(s_t)=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t] Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]
Its relationship with the action value function is (~ The symbol cannot be displayed , use ∈ \in ∈ It means obeying a certain probability distribution )
V π ( s t ) = E A t , S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t ] = E A t ∈ π ( . ∣ s t ) [ E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t ] ] = E A t ∈ π ( . ∣ s t ) [ Q π ( s t , A t ) ] \begin{aligned} V_{\pi}(s_t)&=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t]\\ &=E_{A_t \in \pi(.|s_t)}[E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t]]\\ &=E_{A_t \in \pi(.|s_t)}[Q_{\pi}(s_t,A_t)] \end{aligned} Vπ(st)=EAt,St+1,At+1,...,Sn,An[Ut∣St=st]=EAt∈π(.∣st)[ESt+1,At+1,...,Sn,An[Ut∣St=st,At]]=EAt∈π(.∣st)[Qπ(st,At)]
The state value function depends on the current state , Give expectations for future returns , The action value function depends on the current state and action , Give expectations for future returns
Value learning and strategy learning
- Value learning : The goal of reinforcement learning is to learn the optimal action value function or the optimal state value function , Using the optimal action value function or the optimal state value function to control the motion of the agent
- Strategy learning : The goal of reinforcement learning is learning strategy function , Use the strategy function to control the action of the agent
边栏推荐
- Matplotlib data visualization
- 强化学习——多智能体强化学习
- How to improve the efficiency of small program development?
- Marsnft: how do individuals distribute digital collections?
- SQLAlchemy使用相关
- 深度学习——Pay Attention to MLPs
- No module named yum
- Which is more reliable for small program development?
- Tensorboard visualization
- 小程序开发
猜你喜欢

小程序制作小程序开发适合哪些企业?

微信小程序开发制作注意这几个重点方面

What are the points for attention in the development and design of high-end atmospheric applets?

Centos7 installing MySQL

使用神经网络实现对天气的预测

Digital collections become a new hot spot in tourism industry

【4】 Redis persistence (RDB and AOF)

It's not easy to travel. You can use digital collections to brush the sense of existence in scenic spots

Linux(centOs7) 下安装redis

分布式集群架构场景优化解决方案:分布式ID解决方案
随机推荐
self-attention学习笔记
Svn incoming content cannot be updated, and submission error: svn: e155015: aborting commit: XXX remains in conflict
CertPathValidatorException:validity check failed
分布式集群架构场景优化解决方案:分布式调度问题
Bert的使用方法
What are the points for attention in the development and design of high-end atmospheric applets?
微信小程序开发制作注意这几个重点方面
1: Why should databases be divided into databases and tables
Record the problems encountered in online capacity expansion server nochange: partition 1 is size 419428319. It cannot be grown
分布式集群架构场景优化解决方案:Session共享问题
深度学习——Pay Attention to MLPs
Briefly understand MVC and three-tier architecture
raise RuntimeError(‘DataLoader worker (pid(s) {}) exited unexpectedly‘.format(pids_str))RuntimeErro
Dataset类分批加载数据集
Create a virtual environment using pycharm
4个角度教你选小程序开发工具?
transformer的理解
Installing redis under Linux (centos7)
【一】redis简介
Mars number * word * Tibet * product * Pingtai defender plan details announced