当前位置:网站首页>[paper notes] multi goal reinforcement learning: challenging robotics environments and request for research
[paper notes] multi goal reinforcement learning: challenging robotics environments and request for research
2022-07-05 05:06:00 【See deer (Python version)】
Catalog
Abstract
First , It introduces a challenging system based on existing robot hardware Continuous control tasks ( from OpenAI Gym Integrate ).
All tasks have Sparse binary rewards , And follow a Multi-objective reinforcement learning (RL) frame , In this framework , An agent is told to use a Additional input What to do .
The second part of this paper proposes a set of improvement RL Algorithm Specific research ideas , Most of them are related to Multiple goals RL and Post experience playback of .
1 Environments
1.1 Fetch environments
The crawl environment is based on 7 Degree of freedom grasping robot arm , It has a two finger parallel gripper .
We added an extra arrive Mission ,pick and place The task is a little different .
In all acquisition tasks , It's all about goals The three-dimensional , And describes The desired position of the goal ( or To arrive End actuators ).
Rewards are sparse and binary : If the object is Target location ( stay 5 Within centimeters ), Agents get 0 Reward , otherwise −1.
The action is 4 Dimensional : Three dimensional assignment in Cartesian coordinates Gripper movement required in , The last dimension controls the On and off .
Before returning control to the agent , We are 20 In simulator steps ( Every δ = 0.002 \delta=0.002 δ=0.002) Apply the same action , That is, the action frequency of the agent is f = 25 H z f=25Hz f=25Hz.
Observations include the Cartesian position of the fixture 、 Linear velocity and the position and linear velocity of the robot fixture .
If an object exists , We also include Cartesian positions and rotations using Euler angles , Its linear and angular velocity , And its position and linear speed relative to the clamping .
Reaching (FetchReach)
The task is to move the gripper to the target position . This task is very easy to learn , Therefore, it is a suitable benchmark , To ensure that a new idea works completely .
Pushing (FetchPush)
A box is placed on a table in front of the robot , Its task is to move it to a target location on the table . The fingers of the robot are locked , To prevent gripping . Learned behavior is usually a mixture of pushing and rolling .
Sliding (FetchSlide)
An ice hockey was placed on a long smooth table , The target position is beyond the reach of the robot , So it must use this Hit the ice hockey with strength , It slides , Then stop at the target position due to friction .
Pick & Place (FetchPickAndPlace)
The task is to grab a box , And move it to a target location that may be on the surface of the table or in the air above the table .
1.2 Hand environments
These environments are based on Shadow Dexterous Hand, This is a Anthropomorphic , Yes 24 Two degrees of freedom manipulator . Here 24 In joints , Yes 20 One can be controlled independently , The other one is the coupling joint .
Items | Contents |
---|---|
Rewards | The agent obtains a reward of 0 if the goal has been achieved (within some task-specific tolerance) and −1 otherwise. |
Actions | 20-dimensional. Use absolute position control for all non-coupled joints of the hand. |
Observations | include the 24 positions and velocities of the robot’s joints. In case of an object that is being manipulated, we also include its Cartesian position and rotation represented by a quaternion (hence 7-dimensional) as well as its linear and angular velocities. |
Reaching (HandReach)
A simple task , The goal is 15 Dimensional , And include the target Cartesian position of each fingertip of the hand . If the average distance between the fingertip and the desired position is less than 1 centimeter , It is considered that the goal has been achieved .
Block manipulation (HandManipulateBlock)
In the block operation task , A block is placed on the palm . then , Task is operation block , So as to achieve the target attitude .
HandManipulateBlockRotateZ
The target revolves around the block z The axis rotates randomly . There is no target location .
HandManipulateBlockRotateParallel
Around the block z The random target rotation of the axis and x Axis and y Axis target rotation of axis . There is no target location .
HandManipulateBlockRotateXYZ
Random target rotation for all axes of the block . There is no target location .
HandManipulateBlockFull
Random target rotation for all axes of the block . Random target location .
If the distance between the position of the block and its desired position is less than 1 centimeter ( Only for complete variants ), And the rotation difference is less than 0.1rad, It is considered that the goal has been achieved .
Egg manipulation (HandManipulateEgg)
The goal here is similar to the block task , But it's not Egg shaped objects The block .
The geometry of the object is significantly different from the difficulty of the problem , And eggs are probably the simplest object .
The goal is also 7 Dimensional , Including target location ( In Cartesian coordinates ) And target rotation ( Expressed in quaternions ).
HandManipulateEggRotate
Randomly rotate all the axes of the egg . There is no target location .
HandManipulateEggFull
Randomly rotate all the axes of the egg . Random target location .
If the distance between the position of the egg and its desired position is less than 1 centimeter ( Only for complete variants ), And the rotation difference is less than 0.1rad, It is considered that the goal has been achieved .
Pen manipulation (HandManipulatePen)
It's hard to grasp the pen , Because it's easy to fall off your hands , It's easy to collide and get stuck between other fingers .
Another operation , This time, use a pen instead of building blocks or eggs .
HandManipulatePenRotate
Random target rotation x and y Axis , There is no goal around z Shaft rotation . There is no target location .
HandManipulatePenFull
Random target rotation x and y Axis , There is no goal around z Shaft rotation . Random target location .
If the distance between the position of the pen and its desired position is less than 5 centimeter ( Only for complete variants ), And the difference in rotation , Ignore z Axis , Less than 0.1rad, It is considered that the goal has been achieved .
1.3 Multi-goal environment interface
Goal-aware observation space
It requires that the type of observation space is :gym.space.Dict
observtion
The state or posture of the robot .
desired_goal
What agents must achieve .
achieved_goal
What the agent has achieved . stay FetchReach in , This is the position of the robot end effector . Ideally , This will work with desired_goal identical .
Exposed reward function
secondly , We allow Recalculate rewards in a different way The way to show the reward function . This is an alternative target HER The necessary requirements of formula Algorithm .
Compatibility with standard RL algorithms
We include a simple wrapper , It will be new Target observation space based on dictionary Convert to a more common Array Express .
1.4 Benchmark results
Subjects
DDPG+HER with sparse rewards | DDPG+HER with dense rewards | DDPG with sparse rewards | DDPG with dense rewards |
---|
We pass on each MPI The worker performs 10 A deterministic test is derived to evaluate the performance after each stage , Then through the introduction and MPI Workers average to calculate the test success rate .
In all cases , We use it 5 Repeat an experiment with different random seeds , And report the results by calculating the median test success rate and the quartile range .
In the rest of the environment ,DDPG+HER Significantly better than all other configurations .
If the reward structure is sparse , But you can also successfully learn from intensive rewards , that DDPG+HER The best performance of .
For ordinary DDPG, It is usually easier to learn from intensive rewards , Sparse rewards are more challenging .
Similar to before , In the use of HER when , Sparse reward structure is obviously better than dense reward structure .
She can learn some successful policies in all environments , But especially HandManipulatePen Especially challenging .
Explanation of the reason :
- Learning sparse return is much simpler , Because critics only need to distinguish between the state of success and the state of failure .
- However , Intensive incentive policies encourage the choice of a strategy that directly achieves the desired goals .
2 Request for Research
Automatic hindsight goals generation
We can learn which goals are most valuable for experience replay .
The biggest problem is how to determine which targets are most valuable for replay . One option is training generator , To maximize Behrman error .
Unbiased HER
HER The joint distribution of replay tuples is changed in an unprincipled way .
Theoretically , This may make training impossible in an extremely random environment , Although we haven't noticed this in practice .
Consider an environment , There is a special action , Bring the agent to a random state , The incident ended after that .
In hindsight , If we replay the goals that agents will achieve in the future , Such action seems to be perfect .
How to avoid this problem ? One possible method is to use importance sampling to eliminate sampling bias , But this may lead to too high variance of the gradient .
HER+HRL
A possible extension of this work is to replace not only the goal , And a higher level of action , for example , If the high level requires a low level to reach the state , But other States B state , We can replay this process instead of high-level action B.
This can make a higher level of learning even though the low-level policy is very bad , But this is not very principled , It may make the training unstable .
Richer value functions
UVFA Extend the value function to multiple goals , and TDM Expand it to different time ranges .
Both of these innovations can make training easier , Although the function of learning is more complex .
Faster information propagation
Most of the most advanced non strategies RL The algorithm uses the target network to stabilize the training .
However , This is at the cost of limiting the maximum learning speed of the algorithm , Because each target network update only returns the returned information one step in time ( If you use one-step guidance ).
We noticed that , In the early stages of training ,DDPG+HER The learning speed of is often proportional to the frequency of updating the target network , But the frequency of target network updates / Excessive amplitude will lead to unstable training , The final performance is worse .
HER + multi-step returns
HER The generated data deviates greatly from the strategy , Therefore, multi-step regression cannot be used , Unless we use some correction factors , Such as importance sampling .
Although there are many non strategic solutions for processing data , However, it is not clear whether they will perform well in the setting of training data far from strategy .
It may be beneficial to use multi-step return , Because the reduction of the guiding frequency can lead to less bias gradient .
Besides , It accelerates the reverse transmission of information about the return in time , According to our experiment , This is often DDPG+HER Limitations of training ( Compare the previous paragraph ).
On-policy HER
Rauber Et al. Put forward some preliminary results about the general policy gradient , But this method needs to be tested in a more challenging environment , As proposed in this report . One possible option is to use something similar to IPG The technique used in .
Combine HER with recent improvements in RL
RL with very frequent actions
In the continuous control domain , When the frequency of action approaches infinity , Performance will approach zero , This is caused by two factors .
- Inconsistent exploration and the need to guide more time to spread information about return back in time . How to design a sample with high efficiency RL Algorithm , Even if the frequency of action tends to infinity , It can also maintain its performance ? The exploration and utilization problem can be solved by using parameter noise , Using multi-step return can achieve faster information dissemination .
- The other method can be an adaptive and learnable frame skipping .
Appendix A
Appendix B
边栏推荐
- Database under unity
- Introduction to JVM principle and process
- This article is good
- LeetCode之单词搜索(回溯法求解)
- 669. 修剪二叉搜索树 ●●
- AutoCAD - isometric annotation
- cocos_ Lua loads the file generated by bmfont fnt
- #775 Div.1 C. Tyler and Strings 组合数学
- django连接数据库报错,这是什么原因
- Chinese notes of unit particle system particle effect
猜你喜欢
随机推荐
#775 Div.1 B. Integral Array 数学
Unity synergy
django连接数据库报错,这是什么原因
Unity intelligent NPC production -- pre judgment walking (method 1)
Page countdown
中国金刚烷行业研究与投资预测报告(2022版)
Sixth note
3dsmax scanning function point connection drawing connection line
The first topic of ape Anthropology
LeetCode之單詞搜索(回溯法求解)
PR first time
Download and use of font icons
This article is good
64 horses, 8 tracks, how many times does it take to find the fastest 4 horses at least
《动手学深度学习》学习笔记
Data is stored in the form of table
Research and investment forecast report of adamantane industry in China (2022 Edition)
Emlog blog theme template source code simple good-looking responsive
被舆论盯上的蔚来,何时再次“起高楼”?
Django reports an error when connecting to the database. What is the reason