当前位置:网站首页>Policy Gradient Methods
Policy Gradient Methods
2022-07-07 00:27:00 【Evergreen AAS】
In the last blog post, we sorted out how to approximate value function or action value function :
V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.
So let's briefly review RL Learning objectives of : adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is :
\pi_{\theta}(s, a) = P[a | s, \theta]
This is called the strategic gradient (Policy Gradient, abbreviation PG) Algorithm .
Of course , This article is also aimed at model-free Intensive learning of .
Value-Based vs. Policy-Based RL
Value-Based:
- Learning value function
- Implicit policy, such as ϵϵ-greedy
Policy-Based:
- No value function
- Direct learning strategies
Actor-Critic:
- Learning value function
- Learning strategies
The relationship between the three can be formally expressed as follows :
Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages :
advantage :
- Better convergence
- It is more effective for problems with high-dimensional or continuous action space
- You can learn random strategies
shortcoming :
- In most cases, it converges to the local optimum , Not the global optimum
- Evaluating a strategy is generally inefficient and has a high variance
Policy Search
We first define the objective function .
Policy Objective Functions
The goal is : Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ?
- about episode In terms of tasks , We can use start value:
J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]
- For continuous tasks , We can use average value:
J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)
Or the average return per step :
J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a
among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .
Policy Optimisation
After clarifying the goal , Let's look at strategy based RL For a typical optimization problem : find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient (gradient-free) The algorithm of :
- Mountain climbing algorithm
- Simulated annealing
- evolutionary algorithms
- …
But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect :
- gradient descent
- Conjugate gradient
- Quasi Newton method
- …
In this article, we discuss the method of gradient descent .
Strategy gradient Theorem
Monte Carlo strategy gradient algorithm (REINFORCE)
Actir-Critic Strategy gradient algorithm
Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q:
Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)
This is the famous Actor-Critic Algorithm , It has two sets of parameters :
- Critic: Update action value function parameters w
- Actor: Face Critic Direction update policy parameters θ
Actor-Critic The algorithm is an approximate strategy gradient algorithm :
\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)
Critic The essence is to evaluate strategies :How good is policy π~θ~ for current parameters θ
Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely :Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows :
stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors (bias), But when the following two conditions are met , The strategy gradient is accurate :
- The estimated value of the value function does not contradict the strategy , namely : \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
- Parameters of value function w Can minimize errors , namely : \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]
Finally, we summarize the strategy gradient algorithm :
边栏推荐
- Command line kills window process
- 【CVPR 2022】目标检测SOTA:DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
- C language input / output stream and file operation [II]
- 陀螺仪的工作原理
- Why should a complete knapsack be traversed in sequence? Briefly explain
- Leecode brush questions record interview questions 32 - I. print binary tree from top to bottom
- Devops can help reduce technology debt in ten ways
- [2022 the finest in the whole network] how to test the interface test generally? Process and steps of interface test
- 37 page overall planning and construction plan for digital Village revitalization of smart agriculture
- Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
猜你喜欢

2022/2/10 summary

37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme

DAY FOUR

如何判断一个数组中的元素包含一个对象的所有属性值

一图看懂对程序员的误解:西方程序员眼中的中国程序员

Building lease management system based on SSM framework

JWT signature does not match locally computed signature. JWT validity cannot be asserted and should

Are you ready to automate continuous deployment in ci/cd?

量子时代计算机怎么保证数据安全?美国公布四项备选加密算法

What can the interactive slide screen demonstration bring to the enterprise exhibition hall
随机推荐
Compilation of kickstart file
[CVPR 2022] target detection sota:dino: Detr with improved detecting anchor boxes for end to end object detection
Clipboard management tool paste Chinese version
37頁數字鄉村振興智慧農業整體規劃建設方案
Everyone is always talking about EQ, so what is EQ?
C语言输入/输出流和文件操作【二】
Typescript incremental compilation
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
Three application characteristics of immersive projection in offline display
After leaving a foreign company, I know what respect and compliance are
Business process testing based on functional testing
三维扫描体数据的VTK体绘制程序设计
数据运营平台-数据采集[通俗易懂]
509 certificat basé sur Go
Building lease management system based on SSM framework
Huawei mate8 battery price_ Huawei mate8 charges very slowly after replacing the battery
Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system
Things like random
The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
【自动化测试框架】关于unittest你需要知道的事