当前位置：网站首页>6.6 rl:mdp and reward function

6.6 rl:mdp and reward function

2022-06-12 11:40:00 【The metamorphosis of chicken with vegetables】

As a person who wants to step into DRL Little white of black hole , After reading about DRL After , Want to adopt Based on strategy gradient DRL Algorithm A3C Do some work , In order to send some articles in my field . Study A3C Before , Let's learn about it first AC.RL--AC--AC-PID--A3C--A3C-PID

Introduction
a. Reinforcement Learning is learning what to do-- how to map situations to actions-so as to maximize a numerical reward signal.
Reinforcement learning is learning what to do -- How to build a state to action mapping （ How to make decisions ） So as to maximize the return value .
b. RL The two most important features ： Exploration Based on trial and error (trial-and-error search) And delayed returns （delayed reward）
c. Comparison with supervised learning and unsupervised learning
Supervised learning Try to learn from labeled training sets , These labels are provided in advance . The learning goal is to hope that the system can be correctly popularized 、 Generalize to data outside the current data set .
Unsupervised learning Is to look for potential data structures in unlabeled datasets .
Reinforcement learning Trying to maximize the return signal , Instead of looking for potential data structures .
d. Reinforcement learning quadruple
E=<S,A,P,R>
S： current state
A： The total set of actions that can be taken
P： Probability value of each transition state
R： Reward function
The overall process is , For the current state S, From the action set A Select an action in , It works on S On , bring S According to the probability transfer function P Transfer to another state , Then the environment according to the reward function R Feedback on actions .
【references】
Introduction to the foundation of reinforcement learning - Jin Liang's blog | Jinliang Blog (jinliangxx.github.io)
Reinforcement learning (reinforcement learning) principle _xiayto The blog of -CSDN Blog _ Strengthen learning principles

1 Markov decision process (Markov Decision Process, MDP)

A random process used to describe the interaction between a control object and its environment .

1 MDP The four key elements of ： state (state), action (action), Immediate evaluation value (reward), Transfer probability （transmition Model）

At some point t, The control object is in a state with the environment s_t , And perform actions a_t , stay s_t And a_t The joint action , Move to the next state $s_{(t+1)}$ At the same time, the immediate evaluation value of this step is obtained from the environmental feedback r_t .

We use it $r_t=r[s_t, a_t,s_{(t+1)}]$ It means the moment t Immediate evaluation value of ,R Represents a set of all evaluation values . The transition probability represents the state and action at a certain time , The conditional probability distribution of the state of the system at the next moment , use $p[s_{(t+1)}|s_t,a_t]$ State transition probability , It shows that MDP Characteristics of , At some point t+1 The state of the system , Only with t The state of the time system is related to the action , And with the t The history of the previous moment has nothing to do with .

MDP Control objectives It is described as finding the optimal policy function （ Strategy policy function , Indicates the conditional probability of selecting actions in a given state , Or state to action mapping ）, Make the expected cumulative evaluation value of the whole process maximum or minimum , namely ：

combination MDP The goal of optimization , You can give a value function value function With the action - Value function action-value function. The value function represents the slave state s set out , Adopt a strategy Π Expected cumulative evaluation value until the end of the process , namely （ Value function is to use a strategy to run a whole episode Expected cumulative evaluation value obtained after ）

similar , action - The value function represents the system from the State s Set out and perform the action a, Then use the strategy Π Expected cumulative evaluation value until the end of the process .（ action - Value function , It means that the system adopts (s,a) From the beginning , Execution strategy Π, Expected cumulative evaluation value till the end of the process ）

The bootstrap idea in dynamic programming can be simply understood as modifying the value estimation at the current time with the value estimation at the next time .

TD The algorithm is MC And dynamic programming algorithm , Compared with dynamic programming algorithm TD The algorithm does not need to know the transition probability of the system , It is more suitable for practical control problems ; Compared with MC Algorithm ,TD It is a linear incremental learning method , It only needs to know the status information of the current time and the next time , Instead of waiting for one episode The updated value of the value function can be calculated only after the end of , Can also be applied to MDP When the experimental period is positive and infinite .

since $r_t+\gamma V^{^{\pi }}(s_{t+1})$ Express $V^{\pi}(s_t)$ The estimate of , be $r_t+\gamma V^{^{\pi }}(s_{t+1})-V^{^{\pi }}(s_{t})$ - $V^{\pi}(s_t)$ Express TD error .

2 Reward function in reinforcement learning

2.1 The essence of reward function

Agent owned “ The goal is ” or “ Purpose ” It all boils down to ： Maximize the cumulative sum of scalar reward signals received by the agent （ be called “ earnings ”） Probability expectation .

The reward function is a bridge between people and algorithms , People translate expected tasks and goals into reward functions according to special syntax , Compiled by reinforcement learning algorithm , Finally, it runs in the interaction between agent and environment . The performance of the compiler and the quality of the reward function , Jointly determine the performance of the strategy .

2.1 Main line rewards

Mainline Events ： The main tasks and objectives of reinforcement learning can be divided into ：① Achievement of qualitative objectives , For example, in the two-dimensional plane navigation task, the agent reaches the end point 、 Win at chess 、 Game clearance, etc .② The extremum of quantitative target , Such as maximizing investment income 、 Minimize power consumption, etc .

Main line rewards ：① For qualitative tasks , No matter how complex the task is , Judge whether the task is completed , Can be given directly when qualitative objectives are achieved agent A positive reward .② For quantitative tasks , It can be returned by itself or by some form of transformation .

2.2 Sparse reward problem

If you only use mainline rewards , Often leads to sparse reward problems . The problem can be described as ： The feedback signal is sparse , It is difficult to form local knowledge in the early stage of training , Difficult to give local guidance , Lead to blind exploration ; Late training can only give one-sided guidance , Lead to one-sided use of , The sample is inefficient or even unable to converge , Learning difficulties .

Sparse reward will affect the efficiency of reinforcement learning samples .

The solution to the sparse reward problem ：

① Improve the occurrence probability and utilization efficiency of effective transfer ;

② Use GA Or evolutionary algorithms instead DRL Method ;

③ Improve the design of the reward function itself ;

2.3 The dilemma of reward function design

1 Mainline rewards specifically encourage the occurrence of mainline events , Only when qualitative objectives are achieved or quantitative objectives are optimized , Will be rewarded . Therefore, optimizing revenue is equivalent to promoting the achievement of qualitative objectives 、 Quantitative target extremum , This is usually unbiased , But the rewards are sparse ;

2 Join in Supplementary reward , although agent Get more guidance , But at the same time, it leads to maximizing the benefits , The goals achieved are offset , bring agent Abnormal behavior . The abnormal behavior can be divided into ：

① Rash

There are no penalties for unwanted behavior , Or the punishment is too small , Lead to agent Can't learn to avoid this event , Or after weighing the pros and cons, choose to bear the punishment of this event in exchange for greater benefits ;

② greedy

wireheading It refers to the unreasonable setting of action space , Lead to agent Learn to change the process of perception and processing of environmental information by performing special actions , To get excess returns 、 The problem of covert punishment .Reward Hacking It refers to when the reward function rewards a sub goal unilaterally but lacks checks and balances ,agent It is possible to repeatedly fabricate local benefits and ignore the initial goal .

③ Cowardice

There are many supplementary punishment items and the absolute value is too large relative to the main line reward .

agent At the beginning of training , Receive a lot of negative feedback , It prevents them from further exploring the main events and obtaining rewards , So as to fall into local optimization .

summary

The process of designing reward function , It is the process of adding auxiliary rewards on the basis of main rewards , This aspect can strengthen the understanding of agent Guidance of , Promote algorithm convergence ; On the other hand, it will also introduce target deviation , Reduce deviation at high cost .