当前位置:网站首页>Paper notes: universal value function approvers
Paper notes: universal value function approvers
2022-06-28 19:23:00 【UQI-LIUWJ】
PMLR 2015
1 Introduce
This article paper Put forward UVFA(universal value function approximators), That's according to a state( Other value function Some parts ) and goal( Other value function The part that doesn't have ) To estimate the expected return 
Study UVFA The challenge is , Generally speaking agent Only a small part will be seen (s,g) Combine , It's impossible to traverse all the state-goal Yes . If we use supervised learning to train
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
here UVFA A method similar to matrix decomposition is used , Think of the data as a sparse matrix , Each line is an observed state s, Each column is an observed target g. Then the matrix is decomposed into states embedding Φ(s) And the target embedding φ(g).
——> So we can learn from state To Φ(s);goal To φ(g) The nonlinearity of mapping
2 The model part

two-stream architecture You can learn well state and goal The common structure between
- In many cases ,goal Can be defined as state In the form of /state The combination of ,
. thus Φ and φ There should be something to share feature. - This paper is in MLP Φ and φ in , The parameters of the previous layers are shared , therefore state and goal The common feature Can be learned
- ——>partially symmetric architecture
- In some cases ,UVFA It may be symmetrical

- For example, calculation state s and goal g The distance between UVFA
- At this point we can make Φ=φ,h Is a symmetric operator ( Like dot product )
- ——>symmetric architecture
2.1 Supervised learning UVFA
2.1.1 End to end learning
Through a suitable loss function( such as MSE
)+ Gradient descent realization
2.1.2 two-stage Study
- stage1: take V*(g) Put it in a matrix , Row representation state, Column means goal. Perform matrix decomposition , obtain
and
【 chart 1 The right half of the third picture 】 - stage2: take
and
As ground-truth, Study Φs and φg 【 chart 1 The left half of the third picture 】
2.2 Reinforcement learning UVFA
Intensive learning , There is no ground-truth V*(g) 了 , We have to find out in some ways Q-value
The article uses a kind of Horde The way of architecture can produce the corresponding Q-value, That article paper Didn't look , But use bootstriping(TD) Words , The result is similar 【TD Will be slightly unstable 】

【 Be careful. : Specifically, this goal How did you get it , The article still doesn't say 】
【 To the first 10 Step ,Q-value After the calculation , It has nothing to do with reinforcement learning , The next few steps are matrix decomposition + Two embedding network Of training】
边栏推荐
- Differences and relations among rxjs map, mergemap and switchmap
- 图神经网络入门 (GNN, GCN)
- Constrained Delaunay triangulation in MATLAB
- sql面试题:求连续最大登录天数
- PCL calculation of center and radius of circumscribed circle of plane triangle
- matlab 二维或三维三角剖分
- I. The HR system is put on the enterprise wechat ISV to enhance the in-depth application of enterprise wechat in service chain retail and other industries
- In which industries did the fire virtual human start to make efforts?
- F(x)构建方程 ,梯度下降求偏导,损失函数确定偏导调整,激活函数处理非线性问题
- 180.1. Log in continuously for n days (database)
猜你喜欢

Cvpr2022 | Zhejiang University and ant group put forward a hierarchical residual multi granularity classification network based on label relation tree to model hierarchical knowledge among multi granu

Leetcode 周赛299

《数字经济全景白皮书》消费金融数字化篇 重磅发布

首部元宇宙概念小说《元宇宙2086》获得2022年上袭元宇宙奖

Openharmony - detailed source code of Kernel Object Events

道路千万条,为什么这家创新存储公司会选这条?

Technical methodology of new AI engine under the data infrastructure upgrade window

180.1. Log in continuously for n days (database)

Bayesian Reference problem, mCMC and variational reference

try except 添加辅助新列
随机推荐
从设计交付到开发,轻松畅快高效率!
A few lines of code can realize complex excel import and export. This tool class is really powerful!
腾讯汤道生:面向数实融合新世界,开发者是最重要的“建筑师”
数据基础设施升级窗口下,AI 新引擎的技术方法论
让企业数字化砸锅和IT主管背锅的软件供应链安全风险指南
基于趋势和季节性的时间序列预测
事实/论断/断言/结论/断定/判定
Month on month SQL implementation
C语言-函数知识点
I. The HR system is put on the enterprise wechat ISV to enhance the in-depth application of enterprise wechat in service chain retail and other industries
变分自编码器 (Variational Autoencoders, VAEs)
令人惊艳的NanoPC-T4(RK3399)作为工作站的初始配置和相关应用
[unity3d] emission (raycast) physical ray (Ray)
Question brushing analysis tool
Mindspire series one loading image classification data set
释放互联网价值的 Web3
matlab 二维或三维三角剖分
OpenHarmony—内核对象事件之源码详解
Matlab 2D or 3D triangulation
MongoDB系列之MongoDB工作原理简单介绍
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
. thus Φ and φ There should be something to share feature. 
and
【 chart 1 The right half of the third picture 】