当前位置:网站首页>[five minute paper] reinforcement learning based on parameterized action space
[five minute paper] reinforcement learning based on parameterized action space
2022-07-26 15:21:00 【Little Mr He】
List of articles
- Thesis title :Reinforcement Learning with Parameterized Actions
The problem solved ?
background
Parameterized action space means that a discrete action has a vectorized parameter . At every decision step , An agent needs to decide which action to perform , And which parameter does this action take to execute .
The method adopted ?
Put forward Q-PAMDP Algorithm , Learn discrete actions and continuous actions alternately . So for in state s s s The probability of selecting a parameterized action can be expressed as π ( a , x ∣ s ) \pi(a,x|s) π(a,x∣s). The choice of discrete actions can be expressed as π d ( a ∣ s ) \pi^{d}(a|s) πd(a∣s), The choice of action parameters can be expressed as π a ( x ∣ s ) \pi^{a}(x|s) πa(x∣s), The probability of the whole strategy can be expressed as :
π ( a , x ∣ s ) = π d ( a ∣ s ) π a ( x ∣ s ) \pi(a,x|s) = \pi^{d}(a|s)\pi^{a}(x|s) π(a,x∣s)=πd(a∣s)πa(x∣s)
Select the strategy parameters of discrete actions with w w w Express , Then for π w d ( a ∣ s ) \pi_{w}^{d}(a|s) πwd(a∣s), Parameterized action strategies are represented by a set of parameters θ \theta θ, Defined as π θ a ( x ∣ s ) \pi_{\theta}^{a}(x|s) πθa(x∣s), This parameterized set can be expressed as θ = [ θ a 1 , ⋯ , θ a k ] \theta = [\theta_{a_{1}}, \cdots, \theta_{a_{k}}] θ=[θa1,⋯,θak].
Want to optimize parameters , The first way is to optimize directly θ \theta θ and w w w Two parameters :
J ( θ , ω ) = E s 0 ∼ D [ V π Θ ( s 0 ) ] J(\theta, \omega)=\mathbb{E}_{s_{0} \sim D}\left[V^{\pi_{\Theta}}\left(s_{0}\right)\right] J(θ,ω)=Es0∼D[VπΘ(s0)]
The second way is to update the two alternately , Fix θ \theta θ Can optimize w w w Parameters :
W ( θ ) = arg max ω J ( θ , ω ) = ω θ ∗ W(\theta)=\arg \max _{\omega} J(\theta, \omega)=\omega_{\theta}^{*} W(θ)=argωmaxJ(θ,ω)=ωθ∗
Later fixed w w w Optimize θ \theta θ Parameters :
J ω ( θ ) = J ( θ , ω ) H ( θ ) = J ( θ , W ( θ ) ) \begin{aligned} J_{\omega}(\theta) &=J(\theta, \omega) \\ H(\theta) &=J(\theta, W(\theta)) \end{aligned} Jω(θ)H(θ)=J(θ,ω)=J(θ,W(θ))
The pseudo code of the algorithm is :

The author also provides a theoretical analysis to prove , If you need it later, make it up .
The result is ?
Published information ? The author information ?
Reference link
Related papers
边栏推荐
- How to translate academic documents?
- Unity URP entry practice
- If food manufacturing enterprises want to realize intelligent and collaborative supplier management, it is enough to choose SRM supplier system
- 楚环科技深交所上市:市值27亿 民生证券是股东
- Qt最基本的布局,创建window界面
- [leetcode daily question] - 268. Missing numbers
- 如何查找国内各大学本科学位论文?
- The leader took credit for it. I changed the variable name and laid him off
- R语言使用lm函数构建带交互项的多元回归模型、使用step函数构建逐步回归模型筛选预测变量的最佳子集(step regression)
- Simulation of character function and string function
猜你喜欢

How much help does solid state disk have for game operation

什么是传输层协议TCP/UDP???

pytorch安装 CUDA对应

带你熟悉云网络的“电话簿”:DNS

Cve-2022-33891 vulnerability recurrence

基于物联网的环境调节系统(ESP32-C3+Onenet+微信小程序)

VS添加作者信息和时间信息的设置

【5分钟Paper】Pointer Network指针网络

外文文献查找技巧方法有哪些

Chuhuan technology is listed on Shenzhen Stock Exchange: Minsheng securities, with a market value of 2.7 billion, is a shareholder
随机推荐
Chapter 08_ Principles of index creation and design
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
数商云:引领化工业态数字升级,看摩贝如何快速打通全场景互融互通
NAT/NAPT地址转换(内外网通信)技术详解【华为eNSP】
R语言ggplot2可视化:使用ggpubr包的ggdotplot函数可视化点阵图(dot plot)、设置add参数添加均值和标准差竖线、设置error.plot参数实际显示箱体
R language tests the significance of correlation coefficient: use Cor The test function calculates the value and confidence interval of the correlation coefficient and its statistical significance (if
R语言wilcox.test函数比较两个非参数样本的总体的中心位置是否具有显著差异(如果两个样本数据是配对数据设置paired参数为TRUE)
pytorch---进阶篇(函数使用技巧/注意事项)
Practical purchasing skills, purchasing methods of five bottleneck materials
R语言ggplot2可视化:使用ggplot2可视化散点图、使用ggpubr包的theme_pubclean函数设置可视化图像不包含坐标轴线的主题(theme without axis lines)
Deep packet inspection using quotient filter paper summary
R language uses LM function to build a multiple regression model with interactive terms, and uses step function to build a stepwise regression model to screen the best subset of predictive variables (
Yifang biological fell 16% on the first day of listing: the company's market value was 8.8 billion, and Hillhouse and Lilly were shareholders
固态硬盘对游戏运行的帮助有多少
The IPO of shengtaier technology was terminated: it was planned to raise 560million yuan, and Qiming and Jifeng capital were shareholders
OpenGL学习日记2——着色器
小白哪个券商开户最好 开户最安全
R语言检验相关性系数的显著性:使用cor.test函数计算相关性系数的值和置信区间及其统计显著性(如果变量来自正态分布总体使用皮尔森方法pearson)
外文文献查找技巧方法有哪些
双屏协作效率翻倍 灵耀X双屏Pro引领双屏科技新潮流