当前位置:网站首页>Paper notes: universal value function approvers
Paper notes: universal value function approvers
2022-06-28 19:23:00 【UQI-LIUWJ】
PMLR 2015
1 Introduce
This article paper Put forward UVFA(universal value function approximators), That's according to a state( Other value function Some parts ) and goal( Other value function The part that doesn't have ) To estimate the expected return 
Study UVFA The challenge is , Generally speaking agent Only a small part will be seen (s,g) Combine , It's impossible to traverse all the state-goal Yes . If we use supervised learning to train
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
here UVFA A method similar to matrix decomposition is used , Think of the data as a sparse matrix , Each line is an observed state s, Each column is an observed target g. Then the matrix is decomposed into states embedding Φ(s) And the target embedding φ(g).
——> So we can learn from state To Φ(s);goal To φ(g) The nonlinearity of mapping
2 The model part

two-stream architecture You can learn well state and goal The common structure between
- In many cases ,goal Can be defined as state In the form of /state The combination of ,
. thus Φ and φ There should be something to share feature. - This paper is in MLP Φ and φ in , The parameters of the previous layers are shared , therefore state and goal The common feature Can be learned
- ——>partially symmetric architecture
- In some cases ,UVFA It may be symmetrical

- For example, calculation state s and goal g The distance between UVFA
- At this point we can make Φ=φ,h Is a symmetric operator ( Like dot product )
- ——>symmetric architecture
2.1 Supervised learning UVFA
2.1.1 End to end learning
Through a suitable loss function( such as MSE
)+ Gradient descent realization
2.1.2 two-stage Study
- stage1: take V*(g) Put it in a matrix , Row representation state, Column means goal. Perform matrix decomposition , obtain
and
【 chart 1 The right half of the third picture 】 - stage2: take
and
As ground-truth, Study Φs and φg 【 chart 1 The left half of the third picture 】
2.2 Reinforcement learning UVFA
Intensive learning , There is no ground-truth V*(g) 了 , We have to find out in some ways Q-value
The article uses a kind of Horde The way of architecture can produce the corresponding Q-value, That article paper Didn't look , But use bootstriping(TD) Words , The result is similar 【TD Will be slightly unstable 】

【 Be careful. : Specifically, this goal How did you get it , The article still doesn't say 】
【 To the first 10 Step ,Q-value After the calculation , It has nothing to do with reinforcement learning , The next few steps are matrix decomposition + Two embedding network Of training】
边栏推荐
- About covariance and correlation
- Double contextual relationship network for polyp segmentation
- 机器学习笔记 temperature+Softmax
- Matlab 2D or 3D triangulation
- 从设计交付到开发,轻松畅快高效率!
- 事实/论断/断言/结论/断定/判定
- Hands on Teaching of servlet use (1)
- Graduation project - Design and development of restaurant management game based on unity (with source code, opening report, thesis, defense PPT, demonstration video and database)
- shell读取Json文件的值
- 小白创业做电商,选对商城系统很重要!
猜你喜欢

Group programming TIANTI competition exercise - continuously updating

MongoDB系列之MongoDB工作原理简单介绍

数据基础设施升级窗口下,AI 新引擎的技术方法论

Build halo blog in arm version rk3399

h5向日葵作业

In which industries did the fire virtual human start to make efforts?

sql计算每日新增用户、及留存率指标

道路千万条,为什么这家创新存储公司会选这条?

Idea merge other branches into dev branch

Cvpr2022 | Zhejiang University and ant group put forward a hierarchical residual multi granularity classification network based on label relation tree to model hierarchical knowledge among multi granu
随机推荐
sql面试题:求连续最大登录天数
Redis 如何实现库存扣减操作?如何防止商品被超卖?
Bayesian inference problem, MCMC and variational inference
Matlab 2D or 3D triangulation
C#连接数据库完成增删改查操作
Grafana draws the trend chart
Installation and configuration of CGAL in PCL environment 5.4.1
PCL calculation of center and radius of circumscribed circle of plane triangle
机器学习笔记 temperature+Softmax
论文笔记:Universal Value Function Approximators
About Significance Tests
月环比sql实现
Are there any regular and safe foreign exchange dealers in China?
C语言-函数知识点
Paper 3 vscode & texlive & sumatrapdf create a perfect tool for writing papers
让企业数字化砸锅和IT主管背锅的软件供应链安全风险指南
秋招经验分享 | 银行笔面试该怎么准备
Group programming TIANTI competition exercise - continuously updating
Question brushing analysis tool
F(x)构建方程 ,梯度下降求偏导,损失函数确定偏导调整,激活函数处理非线性问题
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
. thus Φ and φ There should be something to share feature. 
and
【 chart 1 The right half of the third picture 】