当前位置:网站首页>Paper notes: universal value function approvers
Paper notes: universal value function approvers
2022-06-28 19:23:00 【UQI-LIUWJ】
PMLR 2015
1 Introduce
This article paper Put forward UVFA(universal value function approximators), That's according to a state( Other value function Some parts ) and goal( Other value function The part that doesn't have ) To estimate the expected return 
Study UVFA The challenge is , Generally speaking agent Only a small part will be seen (s,g) Combine , It's impossible to traverse all the state-goal Yes . If we use supervised learning to train
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
here UVFA A method similar to matrix decomposition is used , Think of the data as a sparse matrix , Each line is an observed state s, Each column is an observed target g. Then the matrix is decomposed into states embedding Φ(s) And the target embedding φ(g).
——> So we can learn from state To Φ(s);goal To φ(g) The nonlinearity of mapping
2 The model part

two-stream architecture You can learn well state and goal The common structure between
- In many cases ,goal Can be defined as state In the form of /state The combination of ,
. thus Φ and φ There should be something to share feature. - This paper is in MLP Φ and φ in , The parameters of the previous layers are shared , therefore state and goal The common feature Can be learned
- ——>partially symmetric architecture
- In some cases ,UVFA It may be symmetrical

- For example, calculation state s and goal g The distance between UVFA
- At this point we can make Φ=φ,h Is a symmetric operator ( Like dot product )
- ——>symmetric architecture
2.1 Supervised learning UVFA
2.1.1 End to end learning
Through a suitable loss function( such as MSE
)+ Gradient descent realization
2.1.2 two-stage Study
- stage1: take V*(g) Put it in a matrix , Row representation state, Column means goal. Perform matrix decomposition , obtain
and
【 chart 1 The right half of the third picture 】 - stage2: take
and
As ground-truth, Study Φs and φg 【 chart 1 The left half of the third picture 】
2.2 Reinforcement learning UVFA
Intensive learning , There is no ground-truth V*(g) 了 , We have to find out in some ways Q-value
The article uses a kind of Horde The way of architecture can produce the corresponding Q-value, That article paper Didn't look , But use bootstriping(TD) Words , The result is similar 【TD Will be slightly unstable 】

【 Be careful. : Specifically, this goal How did you get it , The article still doesn't say 】
【 To the first 10 Step ,Q-value After the calculation , It has nothing to do with reinforcement learning , The next few steps are matrix decomposition + Two embedding network Of training】
边栏推荐
- grafana绘制走势图
- Month on month SQL implementation
- 国内有正规安全的外汇交易商吗?
- MongoDB系列之MongoDB工作原理简单介绍
- Brief introduction to mongodb working principle of mongodb series
- Render function parsing
- 列表加入计时器(正计时、倒计时)
- Sound network releases lingfalcon Internet of things cloud platform, which can build sample scenarios in one hour
- new String(“hello“)之后,到底创建了几个对象?
- matlab 受约束的 Delaunay 三角剖分
猜你喜欢

在arm版本rk3399中搭建halo博客

NanoPC-T4(RK3399) game1 oled(I2C)显示时间天气温度

There are thousands of roads. Why did this innovative storage company choose this one?

带你手把手实现grafana双轴图

春风动力携手华为打造智慧园区标杆,未来工厂创新迈上新台阶

团体程序设计天梯赛练习题-持续更新中

数据基础设施升级窗口下,AI 新引擎的技术方法论

How to remove dataframe field column names

About covariance and correlation
![[unity3d] emission (raycast) physical ray (Ray)](/img/46/a9fda743f597db9584c982b10c191c.png)
[unity3d] emission (raycast) physical ray (Ray)
随机推荐
列表加入计时器(正计时、倒计时)
devpi
pd.cut 区间参数设定之前后区别
new String(“hello“)之后,到底创建了几个对象?
几行代码就能实现复杂的 Excel 导入导出,这个工具类真心强大!
Show the actual work case of creating intermediate data table with SQL
微信小程序_8,视图与逻辑
《数字经济全景白皮书》消费金融数字化篇 重磅发布
基于趋势和季节性的时间序列预测
180.1. Log in continuously for n days (database)
[unity3d] emission (raycast) physical ray (Ray)
如何通过W3school学习JS/如何使用W3school的JS参考手册
How to resolve kernel errors? Solution to kernel error of win11 system
Chunfeng power and Huawei work together to build a smart Park benchmark, and the future factory innovation will reach a new level
Shell unknown rollup 1
About Significance Tests
首部元宇宙概念小说《元宇宙2086》获得2022年上袭元宇宙奖
How many objects are created after new string ("hello")?
Try except add auxiliary new column
判断字符串是否为空
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
. thus Φ and φ There should be something to share feature. 
and
【 chart 1 The right half of the third picture 】