当前位置:网站首页>Curiosity mechanism in reinforcement learning
Curiosity mechanism in reinforcement learning
2022-06-27 09:27:00 【Drunk heart】
Reference link :
https://www.leiphone.com/category/ai/TmJCRLNVeuXoh2mv.html
https://tianjuewudi.gitee.io/2021/12/02/qiang-hua-xue-xi-zhong-de-hao-qi-xin-jiang-li-ji-zhi/#!
https://cloud.tencent.com/developer/news/339809
https://cloud.tencent.com/developer/article/1367441
https://blog.csdn.net/triplemeng/article/details/84912694
https://blog.csdn.net/wzduang/article/details/112533271
Reference paper :
Curiosity-driven Exploration by Self-supervised Prediction
Large-Scale Study of Curiosity-Driven Learning
Curiosity-driven Exploration for Mapless Navigation with Deep Reinforcement Learning
1. background
In recent years , We see a lot of innovation in the area of deep reinforcement learning . from 2014 Year of DeepMind The depth of the Q Learning framework to 2018 year OpenAI Released play Dota2 Of OpenAI Five Game robots , We live in an exciting and hopeful Era .
Today we will learn about deep reinforcement learning The most exciting 、 One of the most promising strategies —— Curiosity drives learning .
Reinforcement learning is based on reward mechanisms , That is, each goal can be described as maximizing rewards . However , The problem now is external rewards ( That is, the reward given by the environment ) By Artificially hard coded functions , It's not scalable .
The idea of curiosity driven learning is to build a Agents with intrinsic reward functions ( The reward function is generated by the agent itself ). This means that agents will become self learners , Because he is a student , They are also feedbacks .
2. Several practical cases
If you are trying to learn how to find what you want to buy in a maze of supermarkets . You look around , But I just can't find . If you don't get every step “ Carrot ”, even “ Big stick ” either , No feedback , You may doubt whether you are going in the right direction at every step . under these circumstances , How to avoid spinning in place ? Then you can use your curiosity . While shopping in the supermarket , You will try to predict “ I'm in the meat section now , So I think the area around the corner is the aquatic area ”. If the prediction is wrong , You will be surprised “ yo , little does one think ”, And get the satisfaction reward of curiosity . This makes you more willing to explore new places in the future , Just to see if expectations are consistent with reality .
Looking for something in a maze , Only when you find this thing can you get a reward , The rest of the time is a long-term exploration . In order to avoid the agent always exploring aimlessly in the maze , We can make the rewards denser by defining additional reward functions .
On a sunny weekend afternoon , When a three-year-old child's motivation in life is out of reach ( Like college , Work , house , Family and so on ), Can still play heartlessly in the playground . As a human subject (human agent), Her behavior is what psychologists call intrinsic motivation (intrinsic motivation) The drive of curiosity .
3. The reason why the curiosity mechanism was born
At present, there are two main problems in reinforcement learning :
- The problem of sparse rewards , Action and feedback ( Reward ) The time difference between . If every action has a reward , Then you will learn quickly , For quick feedback . contrary , If you reward washing , It is difficult for agents to learn from it .
- Design intensive 、 Well defined external rewards are difficult , And it's usually not scalable .
4. Curiosity mechanism
Curiosity mechanism is to encourage agents to take action , To reduce the uncertainty of the learners' ability to predict the consequences of their own behavior . Curiosity is essentially an internal reward , That is, the error of the agent in predicting the consequences of its own behavior in the current state . Measuring curiosity error requires establishing an environmental dynamic model , Predict the next state given the current state and action . Through the curiosity module (Intrinsic Curiosity Module, ICM) To generate the agent's curiosity reward .
Good feature space enables the agent to predict the next state according to the current state and action . Define curiosity as a given current state s t s_t st And action a t a_t at New state of prediction s t + 1 s_{t+1} st+1 And the new state of reality s t + 1 ∗ s_{t+1}^* st+1∗ Error between . Considering the influence of random noise of external environment , An agent may take uncontrollable environmental noise as part of curiosity , Thus affecting the agent's exploration behavior . therefore , We need to construct a good feature space according to the following three rules :
- Modeling objects that can be controlled by agents ;
- Modeling things that an agent can't control but may affect ;
- Modeling things that agents cannot control and have no influence on .
Compact the space 、 A stable feature space that retains sufficient information about observations is used for ICM Module construction , Make sure you have ICM The agent of the module can eliminate uncontrollable things and control all things that can affect it .ICM The module consists of two neural networks , The reverse model for learning the feature representation of the state and the next state and the forward dynamic model for generating the predictive feature representation of the next state .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-iBJ91RgJ-1654093917645)(https://gitee.com/zuiyixin/blog-graph/raw/master/202206011124497.png)]
According to the rules of the feature space we set , We only want to predict the changes in the environment that may be caused by the behavior of agents , At the same time, it ignores the changes that affect the agent . It means , What we need is not to predict from the original sensory space , Instead, it converts sensory input into feature vectors , Only the information related to the actions performed by the agent is represented .
To learn this feature space : We use self-monitoring , Training neural network on agent reverse dynamic task , Given its current state s t s_t st And the next state s t + 1 s_{t+1} st+1 Predict learning body behavior a t ^ \hat {a_t} at^ .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-LZZOx5El-1654093917646)(https://gitee.com/zuiyixin/blog-graph/raw/master/202206012103026.png)]
We use this feature space to train the forward model , Given current state ϕ ( s t ) \phi(s_t) ϕ(st) Feature representation and action of , Use this model to predict the next state ϕ ( s t + 1 ) \phi(s_{t+1}) ϕ(st+1) The future of means .
And we provide the agent with the prediction error of the forward dynamic model ( The difference between the predicted next state and the real next state ), As an internal reward to encourage their curiosity .
C u r i o s i t y = ϕ p r e d i c t ( s t + 1 ) − ϕ ( s t + 1 ) Curiosity = \phi_{predict}(s_{t + 1}) - \phi(s_{t + 1}) Curiosity=ϕpredict(st+1)−ϕ(st+1)
that , We are ICM There are two models in :
1. Reverse model ( Blue ): State s t s_t st and s t + 1 s_{t + 1} st+1 Encode to the eigenvector trained to predict the motion ϕ ( s t ) \phi(s_{t}) ϕ(st) and ϕ ( s t + 1 ) \phi(s_{t+1}) ϕ(st+1) in .
a t ^ = g ( s t , s t + 1 ; θ I ) \hat{a_t} = g(s_t,s_{t+1};\theta_I) at^=g(st,st+1;θI)
among , a t ^ \hat{a_t} at^ It's a predicted action ,g Is the learning function of the inverse model , θ I \theta_I θI Is a parameter of the inverse model . The corresponding inverse loss function is :
m i n θ I L I ( a t ^ , a t ) \underset{\theta_I}{min} L_I(\hat{a_t},a_t) θIminLI(at^,at)
2. Forward model ( Red ): take ϕ ( s t ) \phi(s_t) ϕ(st) and a t a_t at As input , And predict the characteristic representation of the next state ϕ ^ ( s t + 1 ) \hat{\phi}(s_{t + 1}) ϕ^(st+1).
ϕ ^ ( s t + 1 ) = f ( ϕ ( s t ) , a t ; θ F ) \hat{\phi}(s_{t + 1}) = f(\phi(s_t),a_t;\theta_F) ϕ^(st+1)=f(ϕ(st),at;θF)
among , ϕ ^ ( s t + 1 ) \hat{\phi}(s_{t + 1}) ϕ^(st+1) Is the characteristic representation of the next state ,f Is the learning function of the forward model , θ F \theta_F θF Is a parameter of the forward model . The corresponding inverse loss function is :
L F ( ϕ ( s t ) , ϕ ^ ( s t + 1 ) ) = 1 2 ∣ ∣ ϕ ^ ( s t + 1 ) − ϕ ( s t ) ∣ ∣ 2 2 L_F(\phi(s_t),\hat{\phi}(s_{t+1})) = \frac{1}{2} ||\hat{\phi}(s_{t+1})-\phi(s_t)||^2_2 LF(ϕ(st),ϕ^(st+1))=21∣∣ϕ^(st+1)−ϕ(st)∣∣22
Then mathematically speaking , Curiosity will be the difference between the eigenvector of the next state we predict and the real eigenvector of the next state .
r t i = η 2 ∣ ∣ ϕ ^ ( s t + 1 ) − ϕ ( s t + 1 ) ∣ ∣ 2 2 r^i_t = \frac{\eta}{2}||\hat{\phi}(s_{t + 1}) - \phi(s_{t + 1})||^2_2 rti=2η∣∣ϕ^(st+1)−ϕ(st+1)∣∣22
among , r t i r^i_t rti It's No t The intrinsic reward value of the moment , η \eta η It's the gauge factor , ϕ ^ ( s t + 1 ) − ϕ ( s t + 1 ) \hat{\phi}(s_{t + 1}) - \phi(s_{t + 1}) ϕ^(st+1)−ϕ(st+1) Is the difference between the predicted next state and the actual next state . Last , The overall optimization problem of the module is the reverse loss , A combination of positive losses .
m i n θ P , θ I , θ F [ − λ E π ( s t ; θ P ) [ ∑ t r t ] + ( 1 − β ) L I + β L F ] \underset{\theta_P,\theta_I,\theta_F}{min} \left[-\lambda E_{\pi(s_t;\theta_P)}[\sum_t r_t]+(1-\beta)L_I+\beta L_F \right] θP,θI,θFmin[−λEπ(st;θP)[t∑rt]+(1−β)LI+βLF]
among , θ P , θ I , θ F \theta_P,\theta_I,\theta_F θP,θI,θF Represent strategy separately , Reverse model , Parameters of the forward model , λ \lambda λ Represents the importance of loss to learning intrinsic rewards , β \beta β It is a scalar that weighs the loss of the reverse model and the loss of the forward model ( Be situated between 0 and 1 Between ), L I L_I LI Is the loss of the reverse model , L F L_F LF Is the loss of the forward model .
Take a look back. :
- Due to the problems of external reward realization and sparse reward , We want to create the intrinsic rewards of agents .
- So , We create curiosity , This is the error of the agent in predicting the action result in its current state .
- Using curiosity will promote our learning body Support conversion with high prediction error ( It will be higher in the area where the agent spends less time or in the area with dynamic complexity ), So as to better explore our environment .
- But because we can't predict the next frame ( Too complicated ) To predict the next state , So we use ** Better features represent , Only the elements that can be controlled or influenced by the agent .** Add , It is recommended to design the feature space reasonably in the task .
- To generate curiosity , We use an intrinsic curiosity module consisting of two models : Used to represent the characteristics of the learning state and the next state Reverse model And a prediction feature representation for generating the next state Forward model .
- Curiosity will equal The difference between the forward dynamic model and the reverse dynamic model .
5. Related knowledge
5.1 Random Network Distillaiton
Random Network Distillaiton, Random network distillation , abbreviation RND. and Curiosity The goal is the same , It also solves the problem of sparse reward by adding internal exploration reward to agents . The core of both is to give high rewards to states that have not been explored , But there are different ways to judge whether the state has been explored .
And Curiosity Different ,RND Have two networks , The structures of the two networks are consistent , Input as status s t s_t st , Generate a set of vectors , In fact, feature extraction . A network is called target The Internet , At first, after random parameters , Its parameters are no longer updated . Another network is called Predictor The Internet , Its parameters can be continuously learned and adjusted , The goal of learning is to give the same state s t s_t st , Its output and Target The closer the network is, the better .
Therefore, when the output difference between the two networks is greater , The more it shows that this state has not been explored before , We should give greater rewards , On the contrary, when the two tend to be consistent , This indicates that this state has been encountered many times , After full study , Smaller rewards should be given . The reward function formula is :
r t i = ∣ ∣ f ^ ( s t + 1 ) − f ( s t + 1 ) ∣ ∣ 2 2 r^i_t = ||\hat{f}(s_{t + 1}) - f(s_{t + 1})||^2_2 rti=∣∣f^(st+1)−f(st+1)∣∣22
f Namely target The output of the network , f ^ \hat{f} f^ Is the predicted value .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-nqvhxgck-1654093917646)(https://gitee.com/zuiyixin/blog-graph/raw/master/202206012215841.png)]
5.2 be based on “ accident ” Curiosity
What is mentioned in the text is based on “ accident ” Curiosity ,ICM The method also establishes a prediction model of world dynamics , And reward the agent when the model fails to make a good prediction , This reward marks “ accident ” or “ Something new ”. Be careful , Explore places you've never been before , Not at all ICM A direct component of the curiosity mechanism .
about ICM In terms of method , It's just getting more “ accident ” One way , The goal is to maximize the overall rewards obtained . The fact proved that , In some environments, there may be other ways to cause “ Self accident ”, This leads to unpredictable results .
stay 《Large-Scale Study of Curiosity-Driven Learning》 In the article ,ICM The author of the method and OpenAI Of researchers have shown that , be based on “ Unexpected maximization ” There may be potential risks in the reinforcement learning approach : Agents can learn to indulge and procrastinate , Don't do anything useful to complete the current task .
To understand why , Look at a common thought experiment , The experiment is called “ Noisy TV problems ”, In the experiment , The agent is placed in a maze , The task is to find a very valuable project ( And the supermarket example before this article “ cheese ” similar ).
There is also a TV in the test environment , The agent has a TV remote control . The number of TV channels is limited ( Each channel shows different programs ), Each key will switch to random channel . How the agent will behave in such an environment ?
For an approach based on accidental curiosity , Changing the channel will bring huge rewards , Because every channel change is unpredictable and unexpected . It is important to , Even after all available channels have been cycled once , Because the content of the channel is random , So every new change is still an accident , Because the agent always predicts what program will be played after changing the channel , This prediction is likely to go wrong , Cause an accident .
Even if the agent has seen every program on every channel , This random change is still unpredictable . therefore , Constantly reap unexpected curiosity agents , Will eventually stay in front of the TV forever , Will not look for that very valuable item , This is similar to a kind of “ delay ” Behavior . that , How to define “ Curiosity ” To avoid such procrastination ?
5.3 be based on “ situation ” Curiosity
stay 《Episodic Curiositythrough Reachability》 In the article , We explored a memory based “ Situational curiosity ” Model , it turns out to be the case that , This model is not easy to produce “ Self indulgence ” Immediate satisfaction . Why? ?
The above experiment is still taken as an example , After the agent continuously changes the TV channel for a period of time , All the programs will eventually appear in memory . therefore , Television will no longer be attractive : Even if the sequence of programs appearing on the screen is random and unpredictable , But all these programs are already in memory .
** This is the same as the method mentioned above “ Based on accidents ” The main difference between the methods of : Our method doesn't even predict the future .** On the contrary , Intelligent experience checks past information , Find out if you have seen and observed the current results . therefore , Our agents will not be provided by noisy TV “ Instant satisfaction ” Attracted by the . It must explore the world beyond television , To get more rewards .
How to judge whether the agent sees the same thing as the existing memory ? It may be pointless to check whether the two match exactly : Because in the real world , There are very few identical scenes . such as , Even if the agent returns to the same room , The observation angle will also be different from the previous memory scene .
We will not check whether there is an exact match in the memory of the agent , Instead, the trained deep neural network is used to measure the similarity between the two experiences . To train the network , We will guess whether the two observations are very close in time . If the two are very close in time , It is likely that it should be regarded as different parts of the same experience of the agent .
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-6z8BUN9f-1654093917647)(https://gitee.com/zuiyixin/blog-graph/raw/master/202206012229034.jpeg)]
Is it new or old “ Accessibility ” Figure decision . in application , This graph cannot be obtained , We train the neural network estimator , Estimate a series of steps between observations .
边栏推荐
猜你喜欢

Installation and use of SVN version controller

That is, a one-stop live broadcast service with "smooth live broadcast" and full link upgrade

How Oracle converts strings to multiple lines

Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher

Advanced mathematics Chapter 7 differential equations

多网络设备存在时,如何配置其上网优先级?

Privacy computing fat offline prediction

Process 0, process 1, process 2

webrtc入门:12.Kurento下的RtpEndpoint和WebrtcEndpoint

VIM from dislike to dependence (20) -- global command
随机推荐
内部类~锁~访问修饰符
HiTek电源维修X光机高压发生器维修XR150-603-02
1098 insertion or heap sort (PAT class a)
Markem imaje马肯依玛士喷码机维修9450E打码机维修
[vivid understanding] the meanings of various evaluation indicators commonly used in deep learning TP, FP, TN, FN, IOU and accuracy
Preliminary understanding of pytorch
经典的一道面试题,涵盖4个热点知识
prometheus告警流程及相关时间参数说明
Oracle uses an SQL to find out which data is not in a table
多网络设备存在时,如何配置其上网优先级?
June 26, 2022 (LC 6100 counts the number of ways to place houses)
一次线上移动端报表网络连接失败问题定位与解决
How much memory does the data type occupy? LongVsObject
视频文件太大?使用FFmpeg来无损压缩它
js 所有的网络请求方式
unity--newtonsoft.json解析
快速入门CherryPy(1)
浏览器的markdown插件显示不了图片
有關二叉樹的一些練習題
静态代码块Vs构造代码块