当前位置:网站首页>Reinforcement learning series (I): basic principles and concepts

Reinforcement learning series (I): basic principles and concepts

2022-07-06 13:41:00 zhugby

Catalog

One 、 What is reinforcement learning ?

Two 、 Structure of reinforcement learning

first floor

The second floor

The third level

3、 ... and 、 Value function

1)Policy function:

2)Value function:

3)Q And V Transformation between

3)Q Value update —— Behrman formula

Four 、 The characteristics of reinforcement learning

5、 ... and 、 The advantages of reinforcement learning


One 、 What is reinforcement learning ?

         In recent years, intensive learning has been very popular in academia , Presumably everyone has heard this term more or less . What is reinforcement learning ? Reinforcement learning is a branch of machine learning , refer to agent In the process of interaction with the environment, the learning process to achieve a goal . With TSP The question is :agent It's a travel agent , He observes the environment environment change ( Geographical location map of the customer point to visit ) According to the current situation state( That is, where you are now ) To make a action( The next node location to access ), Every time you make one action,environment Will change ,agent Will get a new state, Then choose the new action Keep implementing . In the current state choice action On the basis of Policy, For each action Assign the probability of choice . Generally speaking TSP The goal of the problem is to minimize the path distance , that reward It can be set as a negative number of the distance between two nodes , The purpose of the training goal Is to make the path total reward And the biggest .

Two 、 Structure of reinforcement learning

         Let's briefly sort out the basic elements and relationships of reinforcement learning , Change the common figure above , The structure of reinforcement learning is divided into three layers as follows :

first floor :

                agent: The subject of the action

                environment: Strengthen the learning environment

                goal: The goal of strengthening learning

                Reinforcement learning is agent In the process of interaction with the environment, the learning process to achieve a goal .

The second floor :

                state: At present agent The state of

                action: The action performed / Behavior

                reward: Real time rewards for performing actions

                state and action The cyclic process of reinforcement learning constitutes the main part of reinforcement learning .

It should be noted that :reward And goal It's not the same concept ,reward It is a real-time reward obtained after performing a certain action ,goal Is the ultimate goal of reinforcement learning ( Generally speaking, make reward The sum is the greatest ), but goal To determine the reward.

The third level :

        It is also the core element , Including two functions , Value function (Value function) And policy functions (Policy function). The next section details .

3、 ... and 、 Value function

1)Policy function:

       Policy Decide a state Which one should be selected action, in other words , In the state of Policy The input of ,action yes Policy Output . Strategy Policy Assign probability to each action , for example :π(s1|a1) = 0.3, Description in status s1 Next, choose the action a1 Is the probability that 0.3, And the strategy only depends on the current state , State independent of previous time , Therefore, the whole process is also a Markov decision-making process . The core and training goal of reinforcement learning is to choose an appropriate Policy/\tiny \prod, bring reward The sum of the \tiny \sum \left ( R \right ) Maximum .

2)Value function:

         There are two kinds of value functions , One is V State value function (state value function), One is Q State action function (state action value function).Q Value evaluates the value of action , representative agent The expected value of the total reward after doing this action until the final state ;V Value evaluates the value of the state , representative agent The expectation of the total reward from this state to the final state . The higher the value , Indicates that I am from current state To A final state Available Average reward Will be higher , So I choose the action with high value .

        Generally speaking , State value function V It is defined for specific policies , Because the expectation of calculating the reward depends on choosing each action Probability . State action function V On the surface, it is related to strategy policy It doesn't matter , It depends on the probability of state transition , But in reinforcement learning, the state transition function is generally unchanged . it is to be noted that ,Q Values and V Values can be converted to each other .

3)Q And V Transformation between

         A state of V value , It's all the actions in this state Q Values in Policy The next expectation , Expressed as :

          One action Q value , It is the new state transferred after the action is executed S{}' Of V Value expectations plus real-time reward value . Expressed as :

        To sum up , state S The value of is the child node action action Of Q Value expectation ; action act The value of is the child node state S Of V Value expectation . Different strategies Policy Next , Calculated expectations Q、V It's different  , Therefore, it can be evaluated through the value function Policy The quality of the .

3)Q Value update —— Behrman formula

         The process of reinforcement learning is to constantly update the value Q The process of , That is, Behrman formula :

            among It stands for Q Realistic value of ;γ yes Discount factor , be located [0,1) Between , Indicates the degree of foresight of the model ,γ The smaller it is , It means the present reward More important than the future ;\alpha It's the learning rate ; yes Q The estimate of ;Q Update equals Q Estimate plus learning rate \alpha Multiply by the difference between the real value and the estimated value . 

Four 、 The characteristics of reinforcement learning :

1.trial and error( Trial and error learning )

       Learn from trial and error , Give higher rewards to good behavior .

 2.delayed reward( Delay reward )

      choice action When , It's not real-time reward value , Instead, consider the expected value of the total reward from the action to the final state Q, choice Q The most valuable action .

5、 ... and 、 The advantages of reinforcement learning :

1. Modeling is difficult 、 The problem of inaccurate modeling ,RL Can pass Agent Continuous interaction with the environment , Learn the optimal strategy ;

2. Traditional methods are difficult to solve high-dimensional problems ,RL It provides approximation algorithms including value function approximation and direct strategy search ;

3. It is difficult to solve dynamic and random problems ,RL Can be found in Agent Random factors are added in the process of interaction with the environment and state transition ;

4. Compared with supervised learning , Overcome the constraint that requires a large number of annotation data sets , Fast solution speed , Suitable for solving real large-scale problems ;

5. Transferable learning , Strong generalization ability , It is robust to unknowns and disturbances .

         about OR Field , Many combinatorial optimization problems (TSP、VRP、MVC etc. ) Can become a sequential decision / Markov decision problem , And reinforcement learning “ Action selection ” Have characteristics similar to natural movements , And RL Of “ Offline training , Online solution ” It makes online and real-time solution of combinatorial optimization possible . In recent years, many new methods of reinforcement learning to solve combinatorial optimization problems have emerged , It provides a new perspective for the research of operational research optimization , Has become a OR A major research hotspot in the field .

         Limited by my limited level , The above content is looking through the literature 、 You know 、B After the video station, summarize and summarize according to your own understanding , Welcome to point out the mistakes !

原网站

版权声明
本文为[zhugby]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060916545430.html