当前位置:网站首页>How to safely eat apples on the edge of a cliff? Deepmind & openai gives the answer of 3D security reinforcement learning
How to safely eat apples on the edge of a cliff? Deepmind & openai gives the answer of 3D security reinforcement learning
2022-07-05 01:15:00 【QbitAl】
Line early From the Aofei temple
qubits | official account QbitAI
DeepMind&OpenAI This time, we jointly demonstrated the good work of the first-hand safety reinforcement learning model .
They put two-dimensional security RL Model ReQueST To a more practical 3D Scene .
Need to know ReQueST Originally, it was only used in navigation tasks ,2D Racing and other two-dimensional tasks , Learn how to avoid agents from the safety trajectory given by humans “ Self mutilation ”.

△ Figure note : original ReQueST Two dimensional navigation task ( Avoid the red area ) And racing tasks
But in practice 3D The problem in the environment is more complex , For example, robots performing tasks need to avoid obstacles in their work , Self driving cars need to avoid driving into ditches .
But in practice 3D The problem in the environment is more complex , For example, robots performing tasks need to avoid obstacles in their work , Self driving cars need to avoid driving into ditches .
So here comes the question , be used for 2D Mission ReQueST In a complex 3D Can it work in the environment ? stay 3D Can the quality and quantity of safety trajectory data given by humans in the environment meet the needs of training ?
To solve these two problems ,DeepMind and OpenAI Come up with a more complex dynamic model and a reward model incorporating human feedback , Will succeed ReQueST Migrate to 3D Environment , A step towards application .
And the security has also been improved , In the experiment, the number of unsafe behaviors of agents was reduced to baseline One tenth of .
How can I feel it intuitively ? Let's go to simulation 3D Take a look in the environment .

In the scene above , On the upper left side of the room is a cliff , The agent needs to wait until the green light on both sides of the room disappears , Try to eat three apples .
One of the apples needs to press the button to open the door to eat .
In the video shown , The agent presses the button , Open the gate , Successfully eat the apple that is locked , A set of operating procedures .

Let's see how it does it .
3D How to train the version of safety reinforcement learning model
stay ReQueST On the basis of ,DeepMind and OpenAI The problem to be solved is to apply to 3D Of the scene Dynamic model and Reward model .
Let's first look at the roles of these two from the overall process .
As shown in the figure below , It is the training process of the new model for the task of eating apples .

The light blue box represents the steps involved in the dynamic model . Start from the top row , Provide some safe tracks by people , Avoid red danger areas .
According to these, the dynamic model is trained , Then use it to generate some random tracks .
Then go to the lower row , Let humans follow these random tracks , Provide feedback by rewarding sketches , Then use these reward sketches , Reward model at the beginning of training , And constantly optimize both .
Next, we introduce these two models .
This time, DeepMind and OpenAI The dynamic model used LSTM Predict future image observations based on action sequences and past image observations .
Models and ReQueST Similar to , The encoder network and the deconvolution decoder network are a little larger , And use the mean square error loss of the observed and predicted values of the real image for training .
most important of all , This loss is based on the prediction of the future steps of each step , Thus, the dynamic model can maintain consistency in long-term deployment .
The training curve obtained is shown in the figure below , The horizontal axis represents the number of steps , The vertical axis represents the loss , Curves of different colors represent the number of tracks of different orders :

Besides , In the reward model section ,DeepMind and OpenAI Trained a 220 10000 parameter 11 Layer residual convolution network .
Input is 96x72 Of RGB Images , Output a scalar reward prediction , The loss is also the mean square error .
In this network , The reward sketch of human feedback also plays a very important role .
The reward sketch is simply to score the reward value manually .
As shown in the figure below , The upper part of the figure is the sketch given by people , In the second half of the prediction observation, there is apple , The reward value is 1, If Apple fades out of sight , The reward becomes -1.

In order to adjust the reward model network .
3D How effective is the security reinforcement learning model version
Next, let's take a look at the new model and other models as well Baseline How about the contrast effect of .
The results are shown in the following figure , Different difficulties correspond to different scene sizes .
On the left side of the figure below is the number of times the agent fell from the cliff , On the right is the number of apples eaten .

It should be noted that , In the legend ReQueST(ours) The representative training set contains the training results of human providing the wrong path .
and ReQueST(safe-only) Represents the training results of using only safe paths in the training set .
in addition ,ReQueST(sparse) It is the result of sketch training without reward .
It can be seen from it that , although Model-free This article baseline Ate all the apples , But at the expense of a lot of security .
and ReQueST The average agent can eat two of the three apples , And the number of falls off the cliff is only baseline One tenth of , Outstanding performance .
Judging from the difference between reward models , Reward sketch training ReQueST And sparse label training ReQueST The effect varies greatly .
Sparse label training ReQueST On average, you can't eat an apple .
It seems ,DeepMind and OpenAI There are indeed improvements in these two points .
Reference link :
[1]https://www.arxiv-vanity.com/papers/2201.08102/
[2]https://deepmind.com/blog/article/learning-human-objectives-by-evaluating-hypothetical-behaviours
边栏推荐
- How to use words to describe breaking change in Spartacus UI of SAP e-commerce cloud
- Pandora IOT development board learning (RT thread) - Experiment 4 buzzer + motor experiment [key external interrupt] (learning notes)
- Innovation leads the direction. Huawei Smart Life launches new products in the whole scene
- SAP UI5 应用的主-从-从(Master-Detail-Detail)布局模式的实现步骤
- Ruby tutorial
- Database postragesql client connection default
- 微信小程序:全网独家小程序版本独立微信社群人脉
- [Chongqing Guangdong education] National Open University spring 2019 1042 international economic law reference questions
- Hedhat firewall
- BGP comprehensive experiment
猜你喜欢
![[CTF] AWDP summary (WEB)](/img/4c/574742666bd8461c6f9263fd6c5dbb.png)
[CTF] AWDP summary (WEB)
![[wave modeling 1] theoretical analysis and MATLAB simulation of wave modeling](/img/c4/46663f64b97e7b25d7222de7025f59.png)
[wave modeling 1] theoretical analysis and MATLAB simulation of wave modeling

To sort out messy header files, I use include what you use

Redis(1)之Redis简介

Complex, complicated and numerous: illustration of seven types of code coupling
![[flutter topic] 64 illustration basic textfield text input box (I) # yyds dry goods inventory #](/img/1c/deaf20d46e172af4d5e11c28c254cf.jpg)
[flutter topic] 64 illustration basic textfield text input box (I) # yyds dry goods inventory #

dotnet-exec 0.6.0 released

Playwright之录制

Playwright recording

小程序直播 + 电商,想做新零售电商就用它吧!
随机推荐
Global and Chinese market of nutrient analyzer 2022-2028: Research Report on technology, participants, trends, market size and share
Global and Chinese markets for industrial X-ray testing equipment 2022-2028: Research Report on technology, participants, trends, market size and share
What you learned in the eleventh week
Database performance optimization tool
Global and Chinese markets of radiation linear accelerators 2022-2028: Research Report on technology, participants, trends, market size and share
How to use words to describe breaking change in Spartacus UI of SAP e-commerce cloud
When the industrial Internet era is truly developed and improved, it will witness the birth of giants in every scene
那些一门心思研究自动化测试的人,最后都怎样了?
Hedhat firewall
【FPGA教程案例9】基于vivado核的时钟管理器设计与实现
Global and Chinese markets of emergency rescue vessels (errv) 2022-2028: Research Report on technology, participants, trends, market size and share
Armv8-a programming guide MMU (3)
Package What is the function of JSON file? What do the inside ^ angle brackets and ~ tilde mean?
Jcenter () cannot find Alibaba cloud proxy address
微信小程序;胡言乱语生成器
微信小程序:全网独家小程序版本独立微信社群人脉
Database postragesql client connection default
Apifox (postman + swagger + mock + JMeter), an artifact of full stack development and efficiency improvement
pycharm专业版下载安装教程
[Chongqing Guangdong education] National Open University spring 2019 1042 international economic law reference questions
