当前位置：网站首页>Reinforcement learning - Basic Concepts

Reinforcement learning - Basic Concepts

2022-07-28 06:09:00 【Food to doubt life】

List of articles

Preface
Basic concepts
- probability theory
- Monte Carlo
Strengthen learning basic concepts
Common function symbols
Value learning and strategy learning

Preface

All concepts in this article are extracted from 《 Deep reinforcement learning 》, If there is a mistake , Welcome to point out

Basic concepts

probability theory

The random variable is an uncertainty , Usually in capital letters , Its value depends on a random event
An experiment , The value of random variables is called the observed value , Usually in lowercase letters
The probability of discrete random variables can be obtained through the probability mass function
The probability of continuous random variables can be obtained by integrating the probability density function

Monte Carlo

in short , Use the observed value to calculate the approximate result of the target , The more observations are used , The more accurate the result is , For example, random variables $A$ The expectation is $E (A)$ , We can do it m Experiments , Get the random variable $A$ Of m An observation , Yes m The observed values are averaged , As $E (A)$ Approximate value ,m The bigger it is , The closer the approximation is $E (A)$

Strengthen learning basic concepts

The goal of strengthening learning ： The goal of reinforcement learning is to find a decision rule （ Strategy ）, Make the system get the maximum cumulative reward .
state ： A summary of the current environment , For example, go game , The position of all the pieces on the current chessboard is the state , Status is the only basis for making decisions .
The state space ： A collection of all possible states , The state space can be infinite , It can also be limited
action ： Refers to the decision made , For example, in the Super Mario game , Mario can only turn left 、 towards the right 、 Up , Then action is one of the three
Action space ： Refers to the set of all possible actions , In the case of Super Mario , The action space is { On 、 Left 、 Right }
agent ： Refers to the subject of the action , Who does the action , Who is the agent , In the case of Super Mario , Mario is an agent
Reward ： After the agent performs an action , A value returned by the environment to the agent , for instance , Primary school students （ agent ） I finished my homework （ action ）, His parents let him fight for an hour, the glory of the king （ Reward ）, Rewards depend on the current state $s_t$ , Actions performed by agents $a$ , In some cases, it also depends on the state of the next moment $s_{t+1}$
Environmental Science ： Who can generate a new state , Who is the environment
State shift ： Given state s, Agents perform actions a, The environment gives the state of the next moment through the state transition function
Agents interact with the environment ： Observe the current state s,AI Calculate the probability of all actions with the strategy function , Then use the probability of action to do random sampling , Choose an action to be performed by the agent , After the agent performs the action , The environment generates a new state according to the state transition function , And give feedback to the agent for reward
Return ： The sum of all rewards from the current moment to the end , Also known as cumulative rewards , set up $t$ The return of time is a random variable $U_t$ , $t$ The reward of the moment is $R_t$ , Then there are
$U_t=R_t+R_{t+1}+R_{t+2}+......$
Discount return ： Set the discount rate as $\gamma \in [0,1]$ , Then the discount return is
$U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......$
The discount rate is a super parameter
Return $U_t$ The randomness of comes from $t$ The action of the moment and $t$ The action and state after the moment

Common function symbols

Action value function ： Its mathematical expression is
$Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t=a_t]$
Its meaning is in strategy $\pi$ Next , The agent is in a state $s_t$ Make an action $a_t$ The upper limit of return after （ The upper limit cannot exceed expectations , That's the average ）, Its value depends on the strategy $\pi$ as well as $t$ The state of the moment $s_t$ And movement $a_t$
Optimal action value function ： Its mathematical expression is
$Q_{*}(s_t,a_t)=\max_{\pi} Q_{\pi}(s_t,a_t)$
When strategy $\pi$ Is optimal , The action value function is the optimal action value function , Its value depends on $t$ The state of the moment $s_t$ And movement $a_t$ ,
State transition function ： The environment uses state transition functions to generate new States , The state transition function is usually a conditional probability density function , such as AI Play chess with humans ,AI After that , The subsequent state of the chessboard depends on where humans will put the pieces , The action of human putting chess pieces is random , Set the current agent status to $S$ , Action for $A$ , Then the state transition function is
$P (s^{'} ∣ s, a) = P (S^{'} = s^{'} ∣ S = s, A = a)$
Policy function ： Make decisions based on the observed state , So as to control the agent , Set the status to $S$ , Action for $A$ , The conditional probability density function of the strategy function is
$\pi(a|s)=P(A=a|S=s)$
That is, the current state is known , Make an action a Probability . The goal of reinforcement learning is learning strategy function , The definition of reward has a great impact on the effect of reinforcement learning
State value function ： Used to measure the current state , The greater the return in the future , The better the current state . Its mathematical expression is
$V_{\pi}(s_t)=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t]$
Its relationship with the action value function is (~ The symbol cannot be displayed , use $\in$ It means obeying a certain probability distribution )
$\begin{aligned} V_{\pi}(s_t)&=E_{A_t,S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t]\\ &=E_{A_t \in \pi(.|s_t)}[E_{S_{t+1},A_{t+1},...,S_n,A_n}[U_t|S_t=s_t,A_t]]\\ &=E_{A_t \in \pi(.|s_t)}[Q_{\pi}(s_t,A_t)] \end{aligned}$

The state value function depends on the current state , Give expectations for future returns , The action value function depends on the current state and action , Give expectations for future returns

Value learning and strategy learning

Value learning ： The goal of reinforcement learning is to learn the optimal action value function or the optimal state value function , Using the optimal action value function or the optimal state value function to control the motion of the agent
Strategy learning ： The goal of reinforcement learning is learning strategy function , Use the strategy function to control the action of the agent

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518199339.html