当前位置:网站首页>Paper notes: universal value function approvers
Paper notes: universal value function approvers
2022-06-28 19:23:00 【UQI-LIUWJ】
PMLR 2015
1 Introduce
This article paper Put forward UVFA(universal value function approximators), That's according to a state( Other value function Some parts ) and goal( Other value function The part that doesn't have ) To estimate the expected return 
Study UVFA The challenge is , Generally speaking agent Only a small part will be seen (s,g) Combine , It's impossible to traverse all the state-goal Yes . If we use supervised learning to train
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
here UVFA A method similar to matrix decomposition is used , Think of the data as a sparse matrix , Each line is an observed state s, Each column is an observed target g. Then the matrix is decomposed into states embedding Φ(s) And the target embedding φ(g).
——> So we can learn from state To Φ(s);goal To φ(g) The nonlinearity of mapping
2 The model part

two-stream architecture You can learn well state and goal The common structure between
- In many cases ,goal Can be defined as state In the form of /state The combination of ,
. thus Φ and φ There should be something to share feature. - This paper is in MLP Φ and φ in , The parameters of the previous layers are shared , therefore state and goal The common feature Can be learned
- ——>partially symmetric architecture
- In some cases ,UVFA It may be symmetrical

- For example, calculation state s and goal g The distance between UVFA
- At this point we can make Φ=φ,h Is a symmetric operator ( Like dot product )
- ——>symmetric architecture
2.1 Supervised learning UVFA
2.1.1 End to end learning
Through a suitable loss function( such as MSE
)+ Gradient descent realization
2.1.2 two-stage Study
- stage1: take V*(g) Put it in a matrix , Row representation state, Column means goal. Perform matrix decomposition , obtain
and
【 chart 1 The right half of the third picture 】 - stage2: take
and
As ground-truth, Study Φs and φg 【 chart 1 The left half of the third picture 】
2.2 Reinforcement learning UVFA
Intensive learning , There is no ground-truth V*(g) 了 , We have to find out in some ways Q-value
The article uses a kind of Horde The way of architecture can produce the corresponding Q-value, That article paper Didn't look , But use bootstriping(TD) Words , The result is similar 【TD Will be slightly unstable 】

【 Be careful. : Specifically, this goal How did you get it , The article still doesn't say 】
【 To the first 10 Step ,Q-value After the calculation , It has nothing to do with reinforcement learning , The next few steps are matrix decomposition + Two embedding network Of training】
边栏推荐
- PCL 环境下安装配置CGAL 5.4.1
- 春风动力携手华为打造智慧园区标杆,未来工厂创新迈上新台阶
- Hands on Teaching of servlet use (1)
- leetcode 1423. Maximum points you can obtain from cards
- About covariance and correlation
- About Statistical Distributions
- Graduation project - Design and development of restaurant management game based on unity (with source code, opening report, thesis, defense PPT, demonstration video and database)
- 直播app系统源码,动态遇到视频时开始自动播放
- 团体程序设计天梯赛练习题-持续更新中
- 如何通过W3school学习JS/如何使用W3school的JS参考手册
猜你喜欢

About covariance and correlation

Sound network releases lingfalcon Internet of things cloud platform, which can build sample scenarios in one hour

How to change the status bar at the bottom of win11 to black? How to change the status bar at the bottom of win11 to black

About Statistical Distributions

MongoDB系列之MongoDB工作原理简单介绍

h5向日葵作业

毕业设计-基于Unity的餐厅经营游戏的设计与开发(附源码、开题报告、论文、答辩PPT、演示视频,带数据库)

High performance and high availability computing architecture scheme commented by Weibo

电脑如何检查驱动程序是否正常
![[C #] explain the difference between value type and reference type](/img/23/5bcbfc5f9cc6e8f4d647acf9219b08.png)
[C #] explain the difference between value type and reference type
随机推荐
C#连接数据库完成增删改查操作
async-validator. JS data verifier
Hands on Teaching of servlet use (1)
Installing the nodejs environment
《数字经济全景白皮书》消费金融数字化篇 重磅发布
SQL interview question: find the maximum number of consecutive login days
In which industries did the fire virtual human start to make efforts?
Idea merge other branches into dev branch
Brief introduction to mongodb working principle of mongodb series
Technical methodology of new AI engine under the data infrastructure upgrade window
[unity3d] emission (raycast) physical ray (Ray)
腾讯汤道生:面向数实融合新世界,开发者是最重要的“建筑师”
智能计算系统1 环境搭建
释放互联网价值的 Web3
Jenkins Pipeline 对Job参数的处理
华为云OneMeeting告诉你全场景会议这么开!
pd.cut 区间参数设定之前后区别
深度学习需要多强的数学基础?
Anonymous function this pointing and variable promotion
事实/论断/断言/结论/断定/判定
, It is also likely that the data volume is insufficient and the fitting is not good , Become a difficult regression problem .
. thus Φ and φ There should be something to share feature. 
and
【 chart 1 The right half of the third picture 】