当前位置:网站首页>[hierarchical reinforcement learning] HAC paper and code
[hierarchical reinforcement learning] HAC paper and code
2022-07-27 20:41:00 【Xiaoshuai acridine】
Title of thesis :Learning Multi-Level Hierarchies with Hindsight
Author of the paper :Andrew Levy, George Konidaris, Robert Platt, Kate Saenko
Published from :ICLR 2019
Current Google academic citations :134
The article links :https://arxiv.org/abs/1712.00948
And HIRO equally , This paper also solves the problems of different levels of strategy learning in hierarchical reinforcement learning non-stationary( Nonstationary problem ), But in a completely different way . Hierarchical reinforcement learning decomposes tasks into multiple subtasks , Higher sample utilization . However , In a hierarchical structure , The transfer function of the upper layer depends on the strategy of the lower layer , When strategies at all levels are trained at the same time , The lower level strategy is constantly updated , This leads to the continuous change of the upper transfer function , In such a non-stationary environment , It is difficult for agents to learn the optimal strategy , This is the unsteadiness faced by hierarchical reinforcement learning (non-stationary) problem .
In order to effectively solve this non-stationary problem ,HIRO Used off-policy correction The way , That is, re propose sub goals , So that it can adapt to the underlying strategies at different times . This article uses hindsight Methods . Added some hindsight transition.hindsight It means after the event . The article assumes that once all lower level strategies converge to the optimal or suboptimal , At this time, learning multi-level strategies at the same time can achieve .

The article mainly proposes Hindsight Action Transitions、Hindsight Goal Transitions、Subgoal Testing Transitions Realize the original transition To solve the problem of nonstationarity .
It has been summarized better blog :
You know XuanAxuan:https://www.zhihu.com/question/520764541/answer/2439785957
Zhihu zhaoyingnan :
https://zhuanlan.zhihu.com/p/91055669
summary :
1. About hindsight action transitions It's really just the transition Medium action( That is, the next floor goal) Change to actually arrived state, In this way, the purpose of assuming that the underlying strategy is the optimal or suboptimal strategy is achieved ( Because with this modification, your trajectory It is proposed goal Walking ).
2. About Hindsight Goal Transitions Yes, it will Hindsight Experience Replay Extended to the hierarchical structure , It ensures that each level can get a sparse reward after an action . Simply speaking ,Hindsight goal transitions At intervals transitions After the sequence , In these transitions Choose one of the achieved next state As the current level goals, This will ensure that there will be one transitions Of reward It works .
3. About Subgoal testing transitions, Even though hindsight action transitions And hindsight goal transitions It enables agents to learn strategies at all levels in parallel under sparse rewards , but hindsight action The definition of limits the i Layer of hindsight action It can only be the first i−1 Layer in H individual action The state that can be achieved within , This makes a level can only learn near the current state subgoal action Of Q value , And ignore the need to exceed H It can only be achieved by one action subgoal actions, Which in turn leads to Q The estimate of the value deviates .Subgoal testing transitions The role of hindsight action transitions The effect of is diametrically opposite :hindsight action transitions Assuming that the underlying strategy is optimal , Learn the current level strategy ; and subgoal testing transitions It is used to make the current strategy understand under the current low-level strategy , One subgoal Whether the state can be realized .

Joined the subgoal testing transition,critic function The value function of those unattainable sub goals will not be ignored , At the same time, each sub target layer can also learn at the same time ,Q Value will still be more inclined to those sub goals that can be achieved by the underlying goals .
doubt :
1. stay hindsight action transitions For why this method is effective for the top level of agents , Zhao's blog describes this : Despite the current transtition The reward for is still -1, But it is still useful for the top layer of the agent . Through these transitions, High level strategies can learn how to propose their own sub goals , because time scale It's the same . At the same time these transition Can not consider non-stationary The problem of .
stay Xuan In my blog : Although these hindsight action Can't get sparse rewards , They are still helpful for the upper level training of agents , Because they can help high-level strategies find A goal that the original action can achieve . also , these transitions Can overcome the impact of lower level strategy changes or exploration .
Algorithm pseudo code :
This algorithm can be summarized as , In fact, there are three different transitions Put it into the experience playback pool for training , Therefore, the process of algorithm is actually the process of generating different samples .
Simulation results :

1. Layering is better than non layering
2. It's good to score three levels and two levels
3.HAC Than HIRO The effect is good
HAC Implementation code (pytorch edition ):https://github.com/nikhilbarhate99/Hierarchical-Actor-Critic-HAC-PyTorch
边栏推荐
- MySQL log error log
- 【数据集显示标注】VOC文件结构+数据集标注可视化+代码实现
- [RCTF2015]EasySQL-1|SQL注入
- Check the internship salary of Internet companies: with it, you can also enter the factory
- Get wechat product details API
- Koin simple to use
- It is said that Intel will stop the nervana chip manufactured by TSMC at 16nm
- 软件测试面试题:已知一个队列,如: [1, 3, 5, 7], 如何把第一个数字,放到第三个位置,得到:[3, 5, 1, 7]
- C language -- array
- How bad can a programmer be?
猜你喜欢

Graphic leetcode - Sword finger offer II 115. reconstruction sequence (difficulty: medium)

一个程序员的水平能差到什么程度?

Understand the wonderful use of dowanward API, and easily grasp kubernetes environment variables

Knowledge dry goods: basic storage service novice Experience Camp

PyQt5快速开发与实战 4.5 按钮类控件 and 4.6 QComboBox(下拉列表框)

Redis 事物学习

Technology sharing | how to do Assertion Verification in interface automated testing?

Illustration leetcode - 592. Fraction addition and subtraction (difficulty: medium)

Can software testing be learned in 2022? Don't learn, software testing positions are saturated

C语言--数组
随机推荐
Understand the wonderful use of dowanward API, and easily grasp kubernetes environment variables
Some contents related to cmsis-rtos
软件测试面试题:已知一个数字为1,如何输出“0001
Oracle Xe installation and user operation
Introduction to zepto
To share the denoising methods and skills of redshift renderer, you must have a look
slf4j中如何进行log4j配置呢?
How to configure log4j in slf4j?
Scrollintoview realizes simple anchor location (example: select city list)
I'm also drunk. Eureka delayed registration and this pit
Standing on the shoulders of giants to learn, jd.com's popular architect growth manual was launched
Illustration leetcode - 592. Fraction addition and subtraction (difficulty: medium)
Ten year test old bird talk about mobile terminal compatibility test
Under the epidemic, I left my job for a year, and my income increased 10 times
es6删除对象的属性_ES6删除对象中的某个元素「建议收藏」
Pytorch multiplication and broadcasting mechanism
Anfulai embedded weekly report no. 275: 2022.07.18--2022.07.24
如何监控NVIDIA Jetson的的运行状态和使用情况
MySQL 日志错误日志
学习Blender必备的12款动画插件,来了解一下