当前位置:网站首页>[hierarchical reinforcement learning] HAC paper and code
[hierarchical reinforcement learning] HAC paper and code
2022-07-27 20:41:00 【Xiaoshuai acridine】
Title of thesis :Learning Multi-Level Hierarchies with Hindsight
Author of the paper :Andrew Levy, George Konidaris, Robert Platt, Kate Saenko
Published from :ICLR 2019
Current Google academic citations :134
The article links :https://arxiv.org/abs/1712.00948
And HIRO equally , This paper also solves the problems of different levels of strategy learning in hierarchical reinforcement learning non-stationary( Nonstationary problem ), But in a completely different way . Hierarchical reinforcement learning decomposes tasks into multiple subtasks , Higher sample utilization . However , In a hierarchical structure , The transfer function of the upper layer depends on the strategy of the lower layer , When strategies at all levels are trained at the same time , The lower level strategy is constantly updated , This leads to the continuous change of the upper transfer function , In such a non-stationary environment , It is difficult for agents to learn the optimal strategy , This is the unsteadiness faced by hierarchical reinforcement learning (non-stationary) problem .
In order to effectively solve this non-stationary problem ,HIRO Used off-policy correction The way , That is, re propose sub goals , So that it can adapt to the underlying strategies at different times . This article uses hindsight Methods . Added some hindsight transition.hindsight It means after the event . The article assumes that once all lower level strategies converge to the optimal or suboptimal , At this time, learning multi-level strategies at the same time can achieve .

The article mainly proposes Hindsight Action Transitions、Hindsight Goal Transitions、Subgoal Testing Transitions Realize the original transition To solve the problem of nonstationarity .
It has been summarized better blog :
You know XuanAxuan:https://www.zhihu.com/question/520764541/answer/2439785957
Zhihu zhaoyingnan :
https://zhuanlan.zhihu.com/p/91055669
summary :
1. About hindsight action transitions It's really just the transition Medium action( That is, the next floor goal) Change to actually arrived state, In this way, the purpose of assuming that the underlying strategy is the optimal or suboptimal strategy is achieved ( Because with this modification, your trajectory It is proposed goal Walking ).
2. About Hindsight Goal Transitions Yes, it will Hindsight Experience Replay Extended to the hierarchical structure , It ensures that each level can get a sparse reward after an action . Simply speaking ,Hindsight goal transitions At intervals transitions After the sequence , In these transitions Choose one of the achieved next state As the current level goals, This will ensure that there will be one transitions Of reward It works .
3. About Subgoal testing transitions, Even though hindsight action transitions And hindsight goal transitions It enables agents to learn strategies at all levels in parallel under sparse rewards , but hindsight action The definition of limits the i Layer of hindsight action It can only be the first i−1 Layer in H individual action The state that can be achieved within , This makes a level can only learn near the current state subgoal action Of Q value , And ignore the need to exceed H It can only be achieved by one action subgoal actions, Which in turn leads to Q The estimate of the value deviates .Subgoal testing transitions The role of hindsight action transitions The effect of is diametrically opposite :hindsight action transitions Assuming that the underlying strategy is optimal , Learn the current level strategy ; and subgoal testing transitions It is used to make the current strategy understand under the current low-level strategy , One subgoal Whether the state can be realized .

Joined the subgoal testing transition,critic function The value function of those unattainable sub goals will not be ignored , At the same time, each sub target layer can also learn at the same time ,Q Value will still be more inclined to those sub goals that can be achieved by the underlying goals .
doubt :
1. stay hindsight action transitions For why this method is effective for the top level of agents , Zhao's blog describes this : Despite the current transtition The reward for is still -1, But it is still useful for the top layer of the agent . Through these transitions, High level strategies can learn how to propose their own sub goals , because time scale It's the same . At the same time these transition Can not consider non-stationary The problem of .
stay Xuan In my blog : Although these hindsight action Can't get sparse rewards , They are still helpful for the upper level training of agents , Because they can help high-level strategies find A goal that the original action can achieve . also , these transitions Can overcome the impact of lower level strategy changes or exploration .
Algorithm pseudo code :
This algorithm can be summarized as , In fact, there are three different transitions Put it into the experience playback pool for training , Therefore, the process of algorithm is actually the process of generating different samples .
Simulation results :

1. Layering is better than non layering
2. It's good to score three levels and two levels
3.HAC Than HIRO The effect is good
HAC Implementation code (pytorch edition ):https://github.com/nikhilbarhate99/Hierarchical-Actor-Critic-HAC-PyTorch
边栏推荐
- Flask-MDict搭建在线Mdict词典服务
- OA项目之我的审批(查询&会议签字)
- Technology sharing | how to do Assertion Verification in interface automated testing?
- Standing on the shoulders of giants to learn, jd.com's popular architect growth manual was launched
- ZJNU 22-07-26 competition experience
- 2022.07.11
- 一个程序员的水平能差到什么程度?
- [RCTF2015]EasySQL-1|SQL注入
- ES6--解构赋值
- Western digital mobile hard disk can't be read (the idiom of peace of mind)
猜你喜欢

Learn about the 12 necessary animation plug-ins of blender

C语言--数组

Redis queue, RDB learning

Oracle Xe installation and user operation

我也是醉了,Eureka 延迟注册还有这个坑

Jetpack Compose 性能优化指南——编译指标

数仓搭建——DWD层
![[benefit activity] stack a buff for your code! Click](/img/2d/dabf0ad5d7bd553dada5921abf6c06.png)
[benefit activity] stack a buff for your code! Click "tea" to receive the gift

Standing on the shoulders of giants to learn, jd.com's popular architect growth manual was launched

2022.07.11
随机推荐
MySQL log query log
Set -- data deconstruction
Pyqt5 rapid development and practice 4.7 qspinbox (counter) and 4.8 QSlider (slider)
我也是醉了,Eureka 延迟注册还有这个坑
leetcode:1498. 满足条件的子序列数目【排序 + 二分 + 幂次哈希表】
MLX90640 红外热成像仪测温传感器模块开发笔记(七)
Konka semiconductor's first storage master chip was mass produced and shipped, with the first batch of 100000 chips
图解LeetCode——592. 分数加减运算(难度:中等)
Huawei's 150 member team rushed to the rescue, and Wuhan "Xiaotangshan" 5g base station was quickly opened!
Linked list~~~
Oracle simple advanced query
Homology and cross domain
(manual) [sqli labs38, 39] stack injection, error echo, character / number type
DP (dynamic programming)
EasyCVR平台添加RTSP设备时,出现均以TCP方式连接的现象是什么原因?
MediaTek releases Helio g80, a mid-range game phone chip
Oracle Xe installation and user operation
You can understand it at a glance, eslint
同源与跨域
Adjust the array so that odd numbers all precede even numbers