当前位置:网站首页>Policy Gradient Methods
Policy Gradient Methods
2022-07-07 00:27:00 【Evergreen AAS】
In the last blog post, we sorted out how to approximate value function or action value function :
V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.
So let's briefly review RL Learning objectives of : adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is :
\pi_{\theta}(s, a) = P[a | s, \theta]
This is called the strategic gradient (Policy Gradient, abbreviation PG) Algorithm .
Of course , This article is also aimed at model-free Intensive learning of .
Value-Based vs. Policy-Based RL
Value-Based:
- Learning value function
- Implicit policy, such as ϵϵ-greedy
Policy-Based:
- No value function
- Direct learning strategies
Actor-Critic:
- Learning value function
- Learning strategies
The relationship between the three can be formally expressed as follows :
Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages :
advantage :
- Better convergence
- It is more effective for problems with high-dimensional or continuous action space
- You can learn random strategies
shortcoming :
- In most cases, it converges to the local optimum , Not the global optimum
- Evaluating a strategy is generally inefficient and has a high variance
Policy Search
We first define the objective function .
Policy Objective Functions
The goal is : Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ?
- about episode In terms of tasks , We can use start value:
J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]
- For continuous tasks , We can use average value:
J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)
Or the average return per step :
J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a
among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .
Policy Optimisation
After clarifying the goal , Let's look at strategy based RL For a typical optimization problem : find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient (gradient-free) The algorithm of :
- Mountain climbing algorithm
- Simulated annealing
- evolutionary algorithms
- …
But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect :
- gradient descent
- Conjugate gradient
- Quasi Newton method
- …
In this article, we discuss the method of gradient descent .
Strategy gradient Theorem
Monte Carlo strategy gradient algorithm (REINFORCE)
Actir-Critic Strategy gradient algorithm
Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q:
Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)
This is the famous Actor-Critic Algorithm , It has two sets of parameters :
- Critic: Update action value function parameters w
- Actor: Face Critic Direction update policy parameters θ
Actor-Critic The algorithm is an approximate strategy gradient algorithm :
\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)
Critic The essence is to evaluate strategies :How good is policy π~θ~ for current parameters θ
Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely :Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows :
stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors (bias), But when the following two conditions are met , The strategy gradient is accurate :
- The estimated value of the value function does not contradict the strategy , namely : \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
- Parameters of value function w Can minimize errors , namely : \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]
Finally, we summarize the strategy gradient algorithm :
边栏推荐
- Pinia module division
- Clipboard management tool paste Chinese version
- Racher integrates LDAP to realize unified account login
- GPIO简介
- On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"
- How rider uses nuget package offline
- A way of writing SQL, update when matching, or insert
- Use type aliases in typescript
- MySQL learning notes (mind map)
- 【2022全网最细】接口测试一般怎么测?接口测试的流程和步骤
猜你喜欢
uniapp实现从本地上传头像并显示,同时将头像转化为base64格式存储在mysql数据库中
[2022 the finest in the whole network] how to test the interface test generally? Process and steps of interface test
Idea automatically imports and deletes package settings
GPIO簡介
Introduction au GPIO
AI超清修复出黄家驹眼里的光、LeCun大佬《深度学习》课程生还报告、绝美画作只需一行代码、AI最新论文 | ShowMeAI资讯日报 #07.06
DAY ONE
What is AVL tree?
On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"
DAY TWO
随机推荐
GPIO簡介
File and image comparison tool kaleidoscope latest download
Leecode brush questions record sword finger offer 44 A digit in a sequence of numbers
37 pages Digital Village revitalization intelligent agriculture Comprehensive Planning and Construction Scheme
uniapp中redirectTo和navigateTo的区别
Use type aliases in typescript
The programmer resigned and was sentenced to 10 months for deleting the code. Jingdong came home and said that it took 30000 to restore the database. Netizen: This is really a revenge
pinia 模块划分
[boutique] Pinia Persistence Based on the plug-in Pinia plugin persist
Interface master v3.9, API low code development tool, build your interface service platform immediately
SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
web渗透测试是什么_渗透实战
If the college entrance examination goes well, I'm already graying out at the construction site at the moment
Win10 startup error, press F9 to enter how to repair?
JS import excel & Export Excel
Testers, how to prepare test data
Data analysis course notes (III) array shape and calculation, numpy storage / reading data, indexing, slicing and splicing
DAY ONE
【CVPR 2022】半监督目标检测:Dense Learning based Semi-Supervised Object Detection
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化