当前位置:网站首页>HIRO: Hierarchical Reinforcement Learning 】 【 Data - Efficient Hierarchical Reinforcement Learning
HIRO: Hierarchical Reinforcement Learning 】 【 Data - Efficient Hierarchical Reinforcement Learning
2022-08-01 01:36:00 【little handsome acridine】
Paper Title: Data-Efficient Hierarchical Reinforcement Learning
Authors: Ofir Nachum, Shixiang (Shane) Gu, Honglak Lee, Sergey Levine
Published by: NeurIPS 2018
Summary
Hierarchical reinforcement learning (HRL) is a promising approach to extending traditional reinforcement learning (RL) methods to solve more complex tasks.Most current HRL methods require careful task-specific design and policy training, which makes them difficult to apply to real-world scenarios.In this paper, we examine how to develop general HRL algorithms because they do not make onerous additional assumptions beyond standard RL algorithms and are efficient because they can be used with a modest number of interaction samples, making themIt is suitable for real-world problems such as robot control.For generality, we develop a scheme in which the low-level controller is automatically learned and supervised by the proposed objective by the high-level controller.To improve efficiency, we recommend using off-policy experience in both high-level and low-level training.This poses a considerable challenge, as changes in lower-level behavior can alter the action space for higher-level policies, and we introduce a type of Off-Policy Corrections to address this challenge.This allows us to take advantage of recent advances in off-policy
model-free RL to learn higher-level and lower-level policies using much less environment interaction than on-policy algorithms.We called the generated HRL proxy HIRO and found it to be generally applicable and sample efficient.Our experiments show that HIRO can be used to learn to simulate highly complex behaviors of robots, such as pushing objects and using them to reach target locations, learning from millions of samples, equivalent to days of real-time interactions.Compared with many previous HRL methods, we find that our method outperforms the previous state-of-the-art by a large margin.
Algorithmic Framework
Internal Rewards
where the high-level policy produces a target gt indicating the desired relative change in state observations.That is, at step t, the high-level policy produces a goal gt, indicating that it expects the lower-level agent to take action, producing an observation st+c that is close to st+gt.Although some state dimensions are more natural as target subspaces, we choose this more general goal representation to have broad applicability without the need to manually design the target space, primitives, or controllable dimensions. This makes ourThe method is general and applicable to new problem settings.
In order to maintain the same absolute position of the target regardless of the state change, the target transition model h is defined as
Define intrinsic reward as based on current observation and target observationA parameterized reward function for the distance between:
aboveThis reward function targets low-level policies for taking actions that yield observations close to the expected value st + gt.
Lower-level policies can be trained using standard methods by simply incorporating gt as an additional input into the value and policy models.For example, in DDPG, low-level criticism is achieved by minimizing the error of the following equation:
The policy actor is updated with:
Off-Policy Corrections for Higher-Level Training
What is the non-stationarity problem?
Here my understanding is that by modifying the target in the transition, when encountering the same state, the underlying strategy will eventuallyThe resulting result passed to the upper layer is consistent with the result passed to the upper layer when this state was encountered before.
Reference:
https://zhuanlan.zhihu.com/p/86602304
HIRO's pytorch code: https://github.com/watakandai/hiro_pytorch
边栏推荐
- 被 CSDN,伤透了心
- 声称AI存在意识,谷歌工程师遭解雇:违反保密协议
- RTL8762DK WDG (six)
- leetcode: 1562. Find latest grouping of size M [simulation + endpoint record + range merge]
- Nmap 操作手册 - 完整版
- Beijing suddenly announced that yuan universe big news
- MYSQL master-slave replication
- 链式编程、包、访问权限
- Luogu P3373: Segment tree
- Data Middle Office Construction (VII): Data Asset Management
猜你喜欢
Flink 部署和提交job
MYSQL-Batch insert data
STK8321 I2C (Shengjia-accelerometer) example
RTL8762DK uses DebugAnalyzer (four)
Compiled on unbutu with wiringPi library and run on Raspberry Pi
RTL8762DK RTC(五)
Replacing the Raspberry Pi Kernel
从零造键盘的键盘超级喜欢,IT人最爱
【 】 today in history: on July 31, "brains in vats" the birth of the participant;The father of wi-fi was born;USB 3.1 standard
蓝图:杨辉三角排列
随机推荐
【数据分析】基于matlab GUI学生成绩管理系统【含Matlab源码 1981期】
IDEA modifies the annotation font
C字符串数组反转
IDEA 找不到或无法加载主类 或 Module “*“ must not contain source root “*“ The root already belongs to module “*“
leetcode: 1562. Find latest grouping of size M [simulation + endpoint record + range merge]
MYSQL经典面试题
现代企业架构框架1
[Data analysis] Based on matlab GUI student achievement management system [including Matlab source code 1981]
设备树的树形结构到底是怎样体现的?
修改Postman安装路径
Blueprint: Yang Hui's Triangular Arrangement
七月集训(第31天) —— 状态压缩
By Value or By Reference
C string array reverse
Data Middle Office Construction (VII): Data Asset Management
Soft Exam Senior System Architect Series: Basic Knowledge of System Development
SC7A20(士兰微-加速度传感器)示例
RTL8762DK WDG(六)
Parse the bootargs from the device tree (dtb format data)
Summary of JVM interview questions (continuously updated)