当前位置：网站首页>[hierarchical reinforcement learning] HAC paper and code

[hierarchical reinforcement learning] HAC paper and code

2022-07-27 20:41:00 【Xiaoshuai acridine】

Title of thesis ：Learning Multi-Level Hierarchies with Hindsight
Author of the paper ：Andrew Levy, George Konidaris, Robert Platt, Kate Saenko
Published from ：ICLR 2019
Current Google academic citations ：134

The article links ：https://arxiv.org/abs/1712.00948

And HIRO equally , This paper also solves the problems of different levels of strategy learning in hierarchical reinforcement learning non-stationary（ Nonstationary problem ）, But in a completely different way . Hierarchical reinforcement learning decomposes tasks into multiple subtasks , Higher sample utilization . However , In a hierarchical structure , The transfer function of the upper layer depends on the strategy of the lower layer , When strategies at all levels are trained at the same time , The lower level strategy is constantly updated , This leads to the continuous change of the upper transfer function , In such a non-stationary environment , It is difficult for agents to learn the optimal strategy , This is the unsteadiness faced by hierarchical reinforcement learning （non-stationary） problem .

In order to effectively solve this non-stationary problem ,HIRO Used off-policy correction The way , That is, re propose sub goals , So that it can adapt to the underlying strategies at different times . This article uses hindsight Methods . Added some hindsight transition.hindsight It means after the event . The article assumes that once all lower level strategies converge to the optimal or suboptimal , At this time, learning multi-level strategies at the same time can achieve .

Insert picture description here

The article mainly proposes Hindsight Action Transitions、Hindsight Goal Transitions、Subgoal Testing Transitions Realize the original transition To solve the problem of nonstationarity .

It has been summarized better blog ：
You know XuanAxuan：https://www.zhihu.com/question/520764541/answer/2439785957
Zhihu zhaoyingnan ：
https://zhuanlan.zhihu.com/p/91055669

summary ：
1. About hindsight action transitions It's really just the transition Medium action（ That is, the next floor goal） Change to actually arrived state, In this way, the purpose of assuming that the underlying strategy is the optimal or suboptimal strategy is achieved （ Because with this modification, your trajectory It is proposed goal Walking ）.
2. About Hindsight Goal Transitions Yes, it will Hindsight Experience Replay Extended to the hierarchical structure , It ensures that each level can get a sparse reward after an action . Simply speaking ,Hindsight goal transitions At intervals transitions After the sequence , In these transitions Choose one of the achieved next state As the current level goals, This will ensure that there will be one transitions Of reward It works .
Insert picture description here
3. About Subgoal testing transitions, Even though hindsight action transitions And hindsight goal transitions It enables agents to learn strategies at all levels in parallel under sparse rewards , but hindsight action The definition of limits the i Layer of hindsight action It can only be the first i−1 Layer in H individual action The state that can be achieved within , This makes a level can only learn near the current state subgoal action Of Q value , And ignore the need to exceed H It can only be achieved by one action subgoal actions, Which in turn leads to Q The estimate of the value deviates .Subgoal testing transitions The role of hindsight action transitions The effect of is diametrically opposite ：hindsight action transitions Assuming that the underlying strategy is optimal , Learn the current level strategy ; and subgoal testing transitions It is used to make the current strategy understand under the current low-level strategy , One subgoal Whether the state can be realized .

Insert picture description here
Joined the subgoal testing transition,critic function The value function of those unattainable sub goals will not be ignored , At the same time, each sub target layer can also learn at the same time ,Q Value will still be more inclined to those sub goals that can be achieved by the underlying goals .

doubt ：
1. stay hindsight action transitions For why this method is effective for the top level of agents , Zhao's blog describes this ： Despite the current transtition The reward for is still -1, But it is still useful for the top layer of the agent . Through these transitions, High level strategies can learn how to propose their own sub goals , because time scale It's the same . At the same time these transition Can not consider non-stationary The problem of .
stay Xuan In my blog ： Although these hindsight action Can't get sparse rewards , They are still helpful for the upper level training of agents , Because they can help high-level strategies find A goal that the original action can achieve . also , these transitions Can overcome the impact of lower level strategy changes or exploration .

Algorithm pseudo code ：
This algorithm can be summarized as , In fact, there are three different transitions Put it into the experience playback pool for training , Therefore, the process of algorithm is actually the process of generating different samples .
Insert picture description here

Simulation results ：
Insert picture description here

1. Layering is better than non layering
2. It's good to score three levels and two levels
3.HAC Than HIRO The effect is good