当前位置:网站首页>Reinforcement learning series (I): basic principles and concepts
Reinforcement learning series (I): basic principles and concepts
2022-07-06 13:41:00 【zhugby】
Catalog
One 、 What is reinforcement learning ?
Two 、 Structure of reinforcement learning
3)Q And V Transformation between
3)Q Value update —— Behrman formula
Four 、 The characteristics of reinforcement learning
5、 ... and 、 The advantages of reinforcement learning
One 、 What is reinforcement learning ?
In recent years, intensive learning has been very popular in academia , Presumably everyone has heard this term more or less . What is reinforcement learning ? Reinforcement learning is a branch of machine learning , refer to agent In the process of interaction with the environment, the learning process to achieve a goal . With TSP The question is :agent It's a travel agent , He observes the environment environment change ( Geographical location map of the customer point to visit ) According to the current situation state( That is, where you are now ) To make a action( The next node location to access ), Every time you make one action,environment Will change ,agent Will get a new state, Then choose the new action Keep implementing . In the current state choice action On the basis of Policy, For each action Assign the probability of choice . Generally speaking TSP The goal of the problem is to minimize the path distance , that reward It can be set as a negative number of the distance between two nodes , The purpose of the training goal Is to make the path total reward And the biggest .

Two 、 Structure of reinforcement learning
Let's briefly sort out the basic elements and relationships of reinforcement learning , Change the common figure above , The structure of reinforcement learning is divided into three layers as follows :

first floor :
agent: The subject of the action
environment: Strengthen the learning environment
goal: The goal of strengthening learning
Reinforcement learning is agent In the process of interaction with the environment, the learning process to achieve a goal .
The second floor :
state: At present agent The state of
action: The action performed / Behavior
reward: Real time rewards for performing actions
state and action The cyclic process of reinforcement learning constitutes the main part of reinforcement learning .
It should be noted that :reward And goal It's not the same concept ,reward It is a real-time reward obtained after performing a certain action ,goal Is the ultimate goal of reinforcement learning ( Generally speaking, make reward The sum is the greatest ), but goal To determine the reward.
The third level :
It is also the core element , Including two functions , Value function (Value function) And policy functions (Policy function). The next section details .
3、 ... and 、 Value function
1)Policy function:
Policy Decide a state Which one should be selected action, in other words , In the state of Policy The input of ,action yes Policy Output . Strategy Policy Assign probability to each action , for example :π(s1|a1) = 0.3, Description in status s1 Next, choose the action a1 Is the probability that 0.3, And the strategy only depends on the current state , State independent of previous time , Therefore, the whole process is also a Markov decision-making process . The core and training goal of reinforcement learning is to choose an appropriate Policy/
, bring reward The sum of the
Maximum .
2)Value function:
There are two kinds of value functions , One is V State value function (state value function), One is Q State action function (state action value function).Q Value evaluates the value of action , representative agent The expected value of the total reward after doing this action until the final state ;V Value evaluates the value of the state , representative agent The expectation of the total reward from this state to the final state . The higher the value , Indicates that I am from current state To A final state Available Average reward Will be higher , So I choose the action with high value .
Generally speaking , State value function V It is defined for specific policies , Because the expectation of calculating the reward depends on choosing each action Probability . State action function V On the surface, it is related to strategy policy It doesn't matter , It depends on the probability of state transition , But in reinforcement learning, the state transition function is generally unchanged . it is to be noted that ,Q Values and V Values can be converted to each other .
3)Q And V Transformation between
A state of V value , It's all the actions in this state Q Values in Policy The next expectation , Expressed as :

One action Q value , It is the new state transferred after the action is executed
Of V Value expectations plus real-time reward value . Expressed as :

To sum up , state S The value of is the child node action action Of Q Value expectation ; action act The value of is the child node state S Of V Value expectation . Different strategies Policy Next , Calculated expectations Q、V It's different , Therefore, it can be evaluated through the value function Policy The quality of the .
3)Q Value update —— Behrman formula
The process of reinforcement learning is to constantly update the value Q The process of , That is, Behrman formula :

among
It stands for Q Realistic value of ;γ yes Discount factor , be located [0,1) Between , Indicates the degree of foresight of the model ,γ The smaller it is , It means the present reward More important than the future ;
It's the learning rate ;
yes Q The estimate of ;Q Update equals Q Estimate plus learning rate
Multiply by the difference between the real value and the estimated value .
Four 、 The characteristics of reinforcement learning :
1.trial and error( Trial and error learning )
Learn from trial and error , Give higher rewards to good behavior .
2.delayed reward( Delay reward )
choice action When , It's not real-time reward value , Instead, consider the expected value of the total reward from the action to the final state Q, choice Q The most valuable action .
5、 ... and 、 The advantages of reinforcement learning :
1. Modeling is difficult 、 The problem of inaccurate modeling ,RL Can pass Agent Continuous interaction with the environment , Learn the optimal strategy ;
2. Traditional methods are difficult to solve high-dimensional problems ,RL It provides approximation algorithms including value function approximation and direct strategy search ;
3. It is difficult to solve dynamic and random problems ,RL Can be found in Agent Random factors are added in the process of interaction with the environment and state transition ;
4. Compared with supervised learning , Overcome the constraint that requires a large number of annotation data sets , Fast solution speed , Suitable for solving real large-scale problems ;
5. Transferable learning , Strong generalization ability , It is robust to unknowns and disturbances .
about OR Field , Many combinatorial optimization problems (TSP、VRP、MVC etc. ) Can become a sequential decision / Markov decision problem , And reinforcement learning “ Action selection ” Have characteristics similar to natural movements , And RL Of “ Offline training , Online solution ” It makes online and real-time solution of combinatorial optimization possible . In recent years, many new methods of reinforcement learning to solve combinatorial optimization problems have emerged , It provides a new perspective for the research of operational research optimization , Has become a OR A major research hotspot in the field .
Limited by my limited level , The above content is looking through the literature 、 You know 、B After the video station, summarize and summarize according to your own understanding , Welcome to point out the mistakes !
边栏推荐
- Floating point comparison, CMP, tabulation ideas
- 9.指针(上)
- (超详细二)onenet数据可视化详解,如何用截取数据流绘图
- 2022泰迪杯数据挖掘挑战赛C题思路及赛后总结
- string
- 凡人修仙学指针-1
- [during the interview] - how can I explain the mechanism of TCP to achieve reliable transmission
- 8. C language - bit operator and displacement operator
- Miscellaneous talk on May 14
- 【九阳神功】2017复旦大学应用统计真题+解析
猜你喜欢
随机推荐
Redis实现分布式锁原理详解
Inaki Ading
深度强化文献阅读系列(一):Courier routing and assignment for food delivery service using reinforcement learning
魏牌:产品叫好声一片,但为何销量还是受挫
Comparison between FileInputStream and bufferedinputstream
The overseas sales of Xiaomi mobile phones are nearly 140million, which may explain why Xiaomi ov doesn't need Hongmeng
[modern Chinese history] Chapter 6 test
Mode 1 two-way serial communication is adopted between machine a and machine B, and the specific requirements are as follows: (1) the K1 key of machine a can control the ledi of machine B to turn on a
Why use redis
【手撕代码】单例模式及生产者/消费者模式
View UI plus released version 1.2.0 and added image, skeleton and typography components
Have you encountered ABA problems? Let's talk about the following in detail, how to avoid ABA problems
3. C language uses algebraic cofactor to calculate determinant
View UI plus released version 1.3.0, adding space and $imagepreview components
Leetcode. 3. Longest substring without repeated characters - more than 100% solution
MPLS experiment
甲、乙机之间采用方式 1 双向串行通信,具体要求如下: (1)甲机的 k1 按键可通过串行口控制乙机的 LEDI 点亮、LED2 灭,甲机的 k2 按键控制 乙机的 LED1
2.C语言初阶练习题(2)
Voir ui plus version 1.3.1 pour améliorer l'expérience Typescript
20220211-CTF-MISC-006-pure_ Color (use of stegsolve tool) -007 Aesop_ Secret (AES decryption)









