当前位置:网站首页>Paper notes: universal value function approvers
Paper notes: universal value function approvers
2022-06-28 19:23:00 【UQI-LIUWJ】
PMLR 2015
1 Introduce
This article paper Put forward UVFA(universal value function approximators), That's according to a state( Other value function Some parts ) and goal( Other value function The part that doesn't have ) To estimate the expected return 
Study UVFA The challenge is , Generally speaking agent Only a small part will be seen (s,g) Combine , It's impossible to traverse all the state-goal Yes . If we use supervised learning to train
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
here UVFA A method similar to matrix decomposition is used , Think of the data as a sparse matrix , Each line is an observed state s, Each column is an observed target g. Then the matrix is decomposed into states embedding Φ(s) And the target embedding φ(g).
——> So we can learn from state To Φ(s);goal To φ(g) The nonlinearity of mapping
2 The model part

two-stream architecture You can learn well state and goal The common structure between
- In many cases ,goal Can be defined as state In the form of /state The combination of ,
. thus Φ and φ There should be something to share feature. - This paper is in MLP Φ and φ in , The parameters of the previous layers are shared , therefore state and goal The common feature Can be learned
- ——>partially symmetric architecture
- In some cases ,UVFA It may be symmetrical

- For example, calculation state s and goal g The distance between UVFA
- At this point we can make Φ=φ,h Is a symmetric operator ( Like dot product )
- ——>symmetric architecture
2.1 Supervised learning UVFA
2.1.1 End to end learning
Through a suitable loss function( such as MSE
)+ Gradient descent realization
2.1.2 two-stage Study
- stage1: take V*(g) Put it in a matrix , Row representation state, Column means goal. Perform matrix decomposition , obtain
and
【 chart 1 The right half of the third picture 】 - stage2: take
and
As ground-truth, Study Φs and φg 【 chart 1 The left half of the third picture 】
2.2 Reinforcement learning UVFA
Intensive learning , There is no ground-truth V*(g) 了 , We have to find out in some ways Q-value
The article uses a kind of Horde The way of architecture can produce the corresponding Q-value, That article paper Didn't look , But use bootstriping(TD) Words , The result is similar 【TD Will be slightly unstable 】

【 Be careful. : Specifically, this goal How did you get it , The article still doesn't say 】
【 To the first 10 Step ,Q-value After the calculation , It has nothing to do with reinforcement learning , The next few steps are matrix decomposition + Two embedding network Of training】
边栏推荐
- A few lines of code can realize complex excel import and export. This tool class is really powerful!
- 论文笔记:Universal Value Function Approximators
- Installing the nodejs environment
- First day of new work
- C语言-函数知识点
- 月环比sql实现
- 变分自编码器 (Variational Autoencoders, VAEs)
- Render function parsing
- How many objects are created after new string ("hello")?
- Ffmpeg learning summary
猜你喜欢

Group programming TIANTI competition exercise - continuously updating

pd.cut 区间参数设定之前后区别

About Statistical Distributions

论文阅读:Duplex Contextual Relation Network for Polyp Segmentation

PCL 环境下安装配置CGAL 5.4.1

First day of new work

Grafana draws the trend chart

Bayesian inference problem, MCMC and variational inference

数据基础设施升级窗口下,AI 新引擎的技术方法论

行业分析| 快对讲,楼宇对讲
随机推荐
如何通过W3school学习JS/如何使用W3school的JS参考手册
深度学习需要多强的数学基础?
async-validator. JS data verifier
Variable autoencoders (vaes)
Brief introduction to mongodb working principle of mongodb series
月环比sql实现
Render function parsing
Friends from Fujian, your old-age insurance is on the cloud!
Differences and relations among rxjs map, mergemap and switchmap
In which industries did the fire virtual human start to make efforts?
Mindspire series one loading image classification data set
微博评论的高性能高可用计算架构方案
福建的朋友们,你们的养老保险上云啦!
leetcode 1647. Minimum deletions to make character frequencies unique
Question brushing analysis tool
SQL calculates daily new users and retention rate indicators
Taishan Office Technology Lecture: word strange font height
行业分析| 快对讲,楼宇对讲
grafana绘制走势图
Upward and downward transformation
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
. thus Φ and φ There should be something to share feature. 
and
【 chart 1 The right half of the third picture 】