当前位置:网站首页>6.6 rl:mdp and reward function

6.6 rl:mdp and reward function

2022-06-12 11:40:00 The metamorphosis of chicken with vegetables

As a person who wants to step into DRL Little white of black hole , After reading about DRL After , Want to adopt Based on strategy gradient DRL Algorithm A3C Do some work , In order to send some articles in my field . Study A3C Before , Let's learn about it first AC.RL--AC--AC-PID--A3C--A3C-PID

Introduction

a. Reinforcement Learning is learning what to do-- how to map situations to actions-so as to maximize a numerical reward signal. 

Reinforcement learning is learning what to do -- How to build a state to action mapping ( How to make decisions ) So as to maximize the return value .

b. RL The two most important features : Exploration Based on trial and error (trial-and-error search) And delayed returns (delayed reward)

c. Comparison with supervised learning and unsupervised learning

Supervised learning Try to learn from labeled training sets , These labels are provided in advance . The learning goal is to hope that the system can be correctly popularized 、 Generalize to data outside the current data set .

Unsupervised learning Is to look for potential data structures in unlabeled datasets .

Reinforcement learning Trying to maximize the return signal , Instead of looking for potential data structures .

d.   Reinforcement learning quadruple

E=<S,A,P,R>

S: current state

A: The total set of actions that can be taken

P: Probability value of each transition state

R: Reward function

The overall process is , For the current state S, From the action set A Select an action in , It works on S On , bring S According to the probability transfer function P Transfer to another state , Then the environment according to the reward function R Feedback on actions .

【references】

Introduction to the foundation of reinforcement learning - Jin Liang's blog | Jinliang Blog (jinliangxx.github.io)

Reinforcement learning (reinforcement learning) principle _xiayto The blog of -CSDN Blog _ Strengthen learning principles

1 Markov decision process (Markov Decision Process, MDP)

A random process used to describe the interaction between a control object and its environment .

1 MDP The four key elements of : state (state), action (action), Immediate evaluation value (reward), Transfer probability (transmition Model)

At some point t, The control object is in a state with the environment s_t, And perform actions a_t, stay s_t And a_t The joint action , Move to the next state s_{(t+1)} At the same time, the immediate evaluation value of this step is obtained from the environmental feedback r_t.

We use it r_t=r[s_t, a_t,s_{(t+1)}] It means the moment t Immediate evaluation value of ,R Represents a set of all evaluation values . The transition probability represents the state and action at a certain time , The conditional probability distribution of the state of the system at the next moment , use p[s_{(t+1)}|s_t,a_t] State transition probability , It shows that MDP Characteristics of , At some point t+1 The state of the system , Only with t The state of the time system is related to the action , And with the t The history of the previous moment has nothing to do with .

MDP Control objectives It is described as finding the optimal policy function ( Strategy policy function , Indicates the conditional probability of selecting actions in a given state , Or state to action mapping ), Make the expected cumulative evaluation value of the whole process maximum or minimum , namely :

combination MDP The goal of optimization , You can give a value function value function With the action - Value function action-value function. The value function represents the slave state s set out , Adopt a strategy Π Expected cumulative evaluation value until the end of the process , namely ( Value function is to use a strategy to run a whole episode Expected cumulative evaluation value obtained after )

similar , action - The value function represents the system from the State s Set out and perform the action a, Then use the strategy Π Expected cumulative evaluation value until the end of the process .( action - Value function , It means that the system adopts (s,a) From the beginning , Execution strategy Π, Expected cumulative evaluation value till the end of the process )

  The bootstrap idea in dynamic programming can be simply understood as modifying the value estimation at the current time with the value estimation at the next time .

TD The algorithm is MC And dynamic programming algorithm , Compared with dynamic programming algorithm TD The algorithm does not need to know the transition probability of the system , It is more suitable for practical control problems ; Compared with MC Algorithm ,TD It is a linear incremental learning method , It only needs to know the status information of the current time and the next time , Instead of waiting for one episode The updated value of the value function can be calculated only after the end of , Can also be applied to MDP When the experimental period is positive and infinite .

since r_t+\gamma V^{^{\pi }}(s_{t+1}) Express V^{\pi}(s_t) The estimate of , be r_t+\gamma V^{^{\pi }}(s_{t+1})-V^{^{\pi }}(s_{t})-V^{\pi}(s_t) Express TD error .

2  Reward function in reinforcement learning

2.1 The essence of reward function

        Agent owned “ The goal is ” or “ Purpose ” It all boils down to : Maximize the cumulative sum of scalar reward signals received by the agent ( be called “ earnings ”) Probability expectation .

        The reward function is a bridge between people and algorithms , People translate expected tasks and goals into reward functions according to special syntax , Compiled by reinforcement learning algorithm , Finally, it runs in the interaction between agent and environment . The performance of the compiler and the quality of the reward function , Jointly determine the performance of the strategy .

2.1 Main line rewards

         Mainline Events : The main tasks and objectives of reinforcement learning can be divided into :① Achievement of qualitative objectives , For example, in the two-dimensional plane navigation task, the agent reaches the end point 、 Win at chess 、 Game clearance, etc .② The extremum of quantitative target , Such as maximizing investment income 、 Minimize power consumption, etc .

         Main line rewards :① For qualitative tasks , No matter how complex the task is , Judge whether the task is completed , Can be given directly when qualitative objectives are achieved agent A positive reward .② For quantitative tasks , It can be returned by itself or by some form of transformation .

2.2 Sparse reward problem

        If you only use mainline rewards , Often leads to sparse reward problems . The problem can be described as : The feedback signal is sparse , It is difficult to form local knowledge in the early stage of training , Difficult to give local guidance , Lead to blind exploration ; Late training can only give one-sided guidance , Lead to one-sided use of , The sample is inefficient or even unable to converge , Learning difficulties .

Sparse reward will affect the efficiency of reinforcement learning samples .

The solution to the sparse reward problem :

①  Improve the occurrence probability and utilization efficiency of effective transfer ;

② Use GA Or evolutionary algorithms instead DRL Method ;

③ Improve the design of the reward function itself ;

2.3 The dilemma of reward function design

1 Mainline rewards specifically encourage the occurrence of mainline events , Only when qualitative objectives are achieved or quantitative objectives are optimized , Will be rewarded . Therefore, optimizing revenue is equivalent to promoting the achievement of qualitative objectives 、 Quantitative target extremum , This is usually unbiased , But the rewards are sparse ;

2 Join in Supplementary reward , although agent Get more guidance , But at the same time, it leads to maximizing the benefits , The goals achieved are offset , bring agent Abnormal behavior . The abnormal behavior can be divided into :

① Rash

        There are no penalties for unwanted behavior , Or the punishment is too small , Lead to agent Can't learn to avoid this event , Or after weighing the pros and cons, choose to bear the punishment of this event in exchange for greater benefits ;

② greedy

        wireheading It refers to the unreasonable setting of action space , Lead to agent Learn to change the process of perception and processing of environmental information by performing special actions , To get excess returns 、 The problem of covert punishment .Reward Hacking It refers to when the reward function rewards a sub goal unilaterally but lacks checks and balances ,agent It is possible to repeatedly fabricate local benefits and ignore the initial goal .

③ Cowardice

        There are many supplementary punishment items and the absolute value is too large relative to the main line reward .

agent At the beginning of training , Receive a lot of negative feedback , It prevents them from further exploring the main events and obtaining rewards , So as to fall into local optimization .

summary

The process of designing reward function , It is the process of adding auxiliary rewards on the basis of main rewards , This aspect can strengthen the understanding of agent Guidance of , Promote algorithm convergence ; On the other hand, it will also introduce target deviation , Reduce deviation at high cost .

The meaning of the reward function is to agent Specify the mission objectives , Also known as the optimal reward problem optimal reward problem.

原网站

版权声明
本文为[The metamorphosis of chicken with vegetables]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206121131113063.html