当前位置:网站首页>Policy Gradient Methods
Policy Gradient Methods
2022-07-07 00:27:00 【Evergreen AAS】
In the last blog post, we sorted out how to approximate value function or action value function :
V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.
So let's briefly review RL Learning objectives of : adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is :
\pi_{\theta}(s, a) = P[a | s, \theta]
This is called the strategic gradient (Policy Gradient, abbreviation PG) Algorithm .
Of course , This article is also aimed at model-free Intensive learning of .
Value-Based vs. Policy-Based RL
Value-Based:
- Learning value function
- Implicit policy, such as ϵϵ-greedy
Policy-Based:
- No value function
- Direct learning strategies
Actor-Critic:
- Learning value function
- Learning strategies
The relationship between the three can be formally expressed as follows :
Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages :
advantage :
- Better convergence
- It is more effective for problems with high-dimensional or continuous action space
- You can learn random strategies
shortcoming :
- In most cases, it converges to the local optimum , Not the global optimum
- Evaluating a strategy is generally inefficient and has a high variance
Policy Search
We first define the objective function .
Policy Objective Functions
The goal is : Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ?
- about episode In terms of tasks , We can use start value:
J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]
- For continuous tasks , We can use average value:
J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)
Or the average return per step :
J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a
among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .
Policy Optimisation
After clarifying the goal , Let's look at strategy based RL For a typical optimization problem : find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient (gradient-free) The algorithm of :
- Mountain climbing algorithm
- Simulated annealing
- evolutionary algorithms
- …
But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect :
- gradient descent
- Conjugate gradient
- Quasi Newton method
- …
In this article, we discuss the method of gradient descent .
Strategy gradient Theorem
Monte Carlo strategy gradient algorithm (REINFORCE)
Actir-Critic Strategy gradient algorithm
Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q:
Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)
This is the famous Actor-Critic Algorithm , It has two sets of parameters :
- Critic: Update action value function parameters w
- Actor: Face Critic Direction update policy parameters θ
Actor-Critic The algorithm is an approximate strategy gradient algorithm :
\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)
Critic The essence is to evaluate strategies :How good is policy π~θ~ for current parameters θ
Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely :Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows :
stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors (bias), But when the following two conditions are met , The strategy gradient is accurate :
- The estimated value of the value function does not contradict the strategy , namely : \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
- Parameters of value function w Can minimize errors , namely : \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]
Finally, we summarize the strategy gradient algorithm :
边栏推荐
- 基於GO語言實現的X.509證書
- Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)
- 《LaTex》LaTex数学公式简介「建议收藏」
- JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
- iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
- What can the interactive slide screen demonstration bring to the enterprise exhibition hall
- 2021 SASE integration strategic roadmap (I)
- PXE server configuration
- AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
- 37頁數字鄉村振興智慧農業整體規劃建設方案
猜你喜欢
三维扫描体数据的VTK体绘制程序设计
从外企离开,我才知道什么叫尊重跟合规…
pytest多进程/多线程执行测试用例
X.509 certificate based on go language
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
ldap创建公司组织、人员
【2022全网最细】接口测试一般怎么测?接口测试的流程和步骤
Racher integrates LDAP to realize unified account login
Pytest multi process / multi thread execution test case
[boutique] Pinia Persistence Based on the plug-in Pinia plugin persist
随机推荐
三维扫描体数据的VTK体绘制程序设计
openresty ngx_ Lua subrequest
谷歌百度雅虎都是中国公司开发的通用搜索引擎_百度搜索引擎url
Personal digestion of DDD
37 page overall planning and construction plan for digital Village revitalization of smart agriculture
DAY SIX
C语言输入/输出流和文件操作【二】
web渗透测试是什么_渗透实战
GPIO简介
Idea automatically imports and deletes package settings
Leecode brush question record sword finger offer 56 - ii Number of occurrences of numbers in the array II
Oracle EMCC 13.5 environment in docker every minute
互动滑轨屏演示能为企业展厅带来什么
How to use vector_ How to use vector pointer
The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
Tourism Management System Based on jsp+servlet+mysql framework [source code + database + report]
After leaving a foreign company, I know what respect and compliance are
MySQL主从之多源复制(3主1从)搭建及同步测试
C language input / output stream and file operation [II]
Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)