当前位置:网站首页>6.6 rl:mdp and reward function
6.6 rl:mdp and reward function
2022-06-12 11:40:00 【The metamorphosis of chicken with vegetables】
As a person who wants to step into DRL Little white of black hole , After reading about DRL After , Want to adopt Based on strategy gradient DRL Algorithm A3C Do some work , In order to send some articles in my field . Study A3C Before , Let's learn about it first AC.RL--AC--AC-PID--A3C--A3C-PID
Introduction
a. Reinforcement Learning is learning what to do-- how to map situations to actions-so as to maximize a numerical reward signal.
Reinforcement learning is learning what to do -- How to build a state to action mapping ( How to make decisions ) So as to maximize the return value .
b. RL The two most important features : Exploration Based on trial and error (trial-and-error search) And delayed returns (delayed reward)
c. Comparison with supervised learning and unsupervised learning
Supervised learning Try to learn from labeled training sets , These labels are provided in advance . The learning goal is to hope that the system can be correctly popularized 、 Generalize to data outside the current data set .
Unsupervised learning Is to look for potential data structures in unlabeled datasets .
Reinforcement learning Trying to maximize the return signal , Instead of looking for potential data structures .
d. Reinforcement learning quadruple
E=<S,A,P,R>
S: current state
A: The total set of actions that can be taken
P: Probability value of each transition state
R: Reward function
The overall process is , For the current state S, From the action set A Select an action in , It works on S On , bring S According to the probability transfer function P Transfer to another state , Then the environment according to the reward function R Feedback on actions .
【references】
1 Markov decision process (Markov Decision Process, MDP)
A random process used to describe the interaction between a control object and its environment .
1 MDP The four key elements of : state (state), action (action), Immediate evaluation value (reward), Transfer probability (transmition Model)
At some point t, The control object is in a state with the environment
, And perform actions
, stay
And
The joint action , Move to the next state
At the same time, the immediate evaluation value of this step is obtained from the environmental feedback
.
We use it
It means the moment t Immediate evaluation value of ,R Represents a set of all evaluation values . The transition probability represents the state and action at a certain time , The conditional probability distribution of the state of the system at the next moment , use
State transition probability , It shows that MDP Characteristics of , At some point t+1 The state of the system , Only with t The state of the time system is related to the action , And with the t The history of the previous moment has nothing to do with .
MDP Control objectives It is described as finding the optimal policy function ( Strategy policy function , Indicates the conditional probability of selecting actions in a given state , Or state to action mapping ), Make the expected cumulative evaluation value of the whole process maximum or minimum , namely :

combination MDP The goal of optimization , You can give a value function value function With the action - Value function action-value function. The value function represents the slave state s set out , Adopt a strategy Π Expected cumulative evaluation value until the end of the process , namely ( Value function is to use a strategy to run a whole episode Expected cumulative evaluation value obtained after )
similar , action - The value function represents the system from the State s Set out and perform the action a, Then use the strategy Π Expected cumulative evaluation value until the end of the process .( action - Value function , It means that the system adopts (s,a) From the beginning , Execution strategy Π, Expected cumulative evaluation value till the end of the process )

The bootstrap idea in dynamic programming can be simply understood as modifying the value estimation at the current time with the value estimation at the next time .
TD The algorithm is MC And dynamic programming algorithm , Compared with dynamic programming algorithm TD The algorithm does not need to know the transition probability of the system , It is more suitable for practical control problems ; Compared with MC Algorithm ,TD It is a linear incremental learning method , It only needs to know the status information of the current time and the next time , Instead of waiting for one episode The updated value of the value function can be calculated only after the end of , Can also be applied to MDP When the experimental period is positive and infinite .
since
Express
The estimate of , be
-
Express TD error .
2 Reward function in reinforcement learning
2.1 The essence of reward function
Agent owned “ The goal is ” or “ Purpose ” It all boils down to : Maximize the cumulative sum of scalar reward signals received by the agent ( be called “ earnings ”) Probability expectation .
The reward function is a bridge between people and algorithms , People translate expected tasks and goals into reward functions according to special syntax , Compiled by reinforcement learning algorithm , Finally, it runs in the interaction between agent and environment . The performance of the compiler and the quality of the reward function , Jointly determine the performance of the strategy .
2.1 Main line rewards
Mainline Events : The main tasks and objectives of reinforcement learning can be divided into :① Achievement of qualitative objectives , For example, in the two-dimensional plane navigation task, the agent reaches the end point 、 Win at chess 、 Game clearance, etc .② The extremum of quantitative target , Such as maximizing investment income 、 Minimize power consumption, etc .
Main line rewards :① For qualitative tasks , No matter how complex the task is , Judge whether the task is completed , Can be given directly when qualitative objectives are achieved agent A positive reward .② For quantitative tasks , It can be returned by itself or by some form of transformation .
2.2 Sparse reward problem
If you only use mainline rewards , Often leads to sparse reward problems . The problem can be described as : The feedback signal is sparse , It is difficult to form local knowledge in the early stage of training , Difficult to give local guidance , Lead to blind exploration ; Late training can only give one-sided guidance , Lead to one-sided use of , The sample is inefficient or even unable to converge , Learning difficulties .
Sparse reward will affect the efficiency of reinforcement learning samples .
The solution to the sparse reward problem :
① Improve the occurrence probability and utilization efficiency of effective transfer ;
② Use GA Or evolutionary algorithms instead DRL Method ;
③ Improve the design of the reward function itself ;
2.3 The dilemma of reward function design
1 Mainline rewards specifically encourage the occurrence of mainline events , Only when qualitative objectives are achieved or quantitative objectives are optimized , Will be rewarded . Therefore, optimizing revenue is equivalent to promoting the achievement of qualitative objectives 、 Quantitative target extremum , This is usually unbiased , But the rewards are sparse ;
2 Join in Supplementary reward , although agent Get more guidance , But at the same time, it leads to maximizing the benefits , The goals achieved are offset , bring agent Abnormal behavior . The abnormal behavior can be divided into :
① Rash
There are no penalties for unwanted behavior , Or the punishment is too small , Lead to agent Can't learn to avoid this event , Or after weighing the pros and cons, choose to bear the punishment of this event in exchange for greater benefits ;
② greedy
wireheading It refers to the unreasonable setting of action space , Lead to agent Learn to change the process of perception and processing of environmental information by performing special actions , To get excess returns 、 The problem of covert punishment .Reward Hacking It refers to when the reward function rewards a sub goal unilaterally but lacks checks and balances ,agent It is possible to repeatedly fabricate local benefits and ignore the initial goal .
③ Cowardice
There are many supplementary punishment items and the absolute value is too large relative to the main line reward .
agent At the beginning of training , Receive a lot of negative feedback , It prevents them from further exploring the main events and obtaining rewards , So as to fall into local optimization .
summary
The process of designing reward function , It is the process of adding auxiliary rewards on the basis of main rewards , This aspect can strengthen the understanding of agent Guidance of , Promote algorithm convergence ; On the other hand, it will also introduce target deviation , Reduce deviation at high cost .
The meaning of the reward function is to agent Specify the mission objectives , Also known as the optimal reward problem optimal reward problem.
边栏推荐
- The evil 203 in systemctl
- 单元测试用例框架--unittest
- 邻居子系统之ARP协议数据处理过程
- ARM指令集之Load/Store访存指令(二)
- [Blue Bridge Cup SCM 11th National race]
- Socket Programming TCP
- [the 11th national competition of Blue Bridge Cup single chip microcomputer]
- K52. Chapter 1: installing kubernetes v1.22 based on kubeadm -- cluster deployment
- postman传入list
- Mcuxpresso develops NXP rt1060 (3) -- porting lvgl to NXP rt1060
猜你喜欢
随机推荐
K53. Chapter 2 installing kubernetes v1.22 based on binary packages -- cluster deployment
Unity 连接 Microsoft SQLSERVER 数据库
ARM指令集之批量Load/Store指令
Les humains veulent de l'argent, du pouvoir, de la beauté, de l'immortalité, du bonheur... Mais les tortues ne veulent être qu'une tortue.
FormatConversionTool.exe
[Blue Bridge Cup SCM 11th National race]
2022-06-11: note that in this document, graph is not the meaning of adjacency matrix, but a bipartite graph. In the adjacency matrix with length N, there are n points, matrix[i][j]
Clj3-100alh30 residual current relay
Record the pits encountered when using JPA
VirtualBox virtual machine shut down due to abnormal system. The virtual machine startup item is missing
套接字实现 TCP 通信流程
System.IO.FileLoadException异常
35. search insertion position
Using stairs function in MATLAB
mysql中的索引show index from XXX每个参数的意义
VirtualBox 虚拟机因系统异常关机虚拟机启动项不见了
单元测试用例框架--unittest
Problems in cross validation code of 10% discount
网络的拓扑结构
890. find and replace mode









