当前位置:网站首页>Policy Gradient Methods
Policy Gradient Methods
2022-07-07 00:27:00 【Evergreen AAS】
In the last blog post, we sorted out how to approximate value function or action value function :
V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.
So let's briefly review RL Learning objectives of : adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is :
\pi_{\theta}(s, a) = P[a | s, \theta]
This is called the strategic gradient (Policy Gradient, abbreviation PG) Algorithm .
Of course , This article is also aimed at model-free Intensive learning of .
Value-Based vs. Policy-Based RL
Value-Based:
- Learning value function
- Implicit policy, such as ϵϵ-greedy
Policy-Based:
- No value function
- Direct learning strategies
Actor-Critic:
- Learning value function
- Learning strategies
The relationship between the three can be formally expressed as follows :
Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages :
advantage :
- Better convergence
- It is more effective for problems with high-dimensional or continuous action space
- You can learn random strategies
shortcoming :
- In most cases, it converges to the local optimum , Not the global optimum
- Evaluating a strategy is generally inefficient and has a high variance
Policy Search
We first define the objective function .
Policy Objective Functions
The goal is : Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ?
- about episode In terms of tasks , We can use start value:
J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]
- For continuous tasks , We can use average value:
J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)
Or the average return per step :
J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a
among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .
Policy Optimisation
After clarifying the goal , Let's look at strategy based RL For a typical optimization problem : find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient (gradient-free) The algorithm of :
- Mountain climbing algorithm
- Simulated annealing
- evolutionary algorithms
- …
But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect :
- gradient descent
- Conjugate gradient
- Quasi Newton method
- …
In this article, we discuss the method of gradient descent .
Strategy gradient Theorem
Monte Carlo strategy gradient algorithm (REINFORCE)
Actir-Critic Strategy gradient algorithm
Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q:
Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)
This is the famous Actor-Critic Algorithm , It has two sets of parameters :
- Critic: Update action value function parameters w
- Actor: Face Critic Direction update policy parameters θ
Actor-Critic The algorithm is an approximate strategy gradient algorithm :
\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)
Critic The essence is to evaluate strategies :How good is policy π~θ~ for current parameters θ
Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely :Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows :
stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors (bias), But when the following two conditions are met , The strategy gradient is accurate :
- The estimated value of the value function does not contradict the strategy , namely : \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
- Parameters of value function w Can minimize errors , namely : \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]
Finally, we summarize the strategy gradient algorithm :
边栏推荐
- Testers, how to prepare test data
- pinia 模块划分
- SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
- Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
- Business process testing based on functional testing
- Oracle EMCC 13.5 environment in docker every minute
- 2021 SASE integration strategic roadmap (I)
- C语言输入/输出流和文件操作【二】
- Quickly use various versions of PostgreSQL database in docker
- Idea automatically imports and deletes package settings
猜你喜欢

Liuyongxin report | microbiome data analysis and science communication (7:30 p.m.)

509 certificat basé sur Go

2021 SASE integration strategic roadmap (I)

System activity monitor ISTAT menus 6.61 (1185) Chinese repair

【自动化测试框架】关于unittest你需要知道的事

DAY FIVE

从外企离开,我才知道什么叫尊重跟合规…

【精品】pinia 基于插件pinia-plugin-persist的 持久化

刘永鑫报告|微生物组数据分析与科学传播(晚7点半)

Introduction to GPIO
随机推荐
uniapp中redirectTo和navigateTo的区别
Leecode brush question record sword finger offer 56 - ii Number of occurrences of numbers in the array II
PostgreSQL使用Pgpool-II实现读写分离+负载均衡
Quickly use various versions of PostgreSQL database in docker
JS import excel & Export Excel
【vulnhub】presidential1
1000字精选 —— 接口测试基础
2022年PMP项目管理考试敏捷知识点(9)
PostgreSQL highly available repmgr (1 master 2 slave +1witness) + pgpool II realizes master-slave switching + read-write separation
Leecode brush questions record sword finger offer 43 The number of occurrences of 1 in integers 1 to n
GPIO簡介
SuperSocket 1.6 创建一个简易的报文长度在头部的Socket服务器
Amazon MemoryDB for Redis 和 Amazon ElastiCache for Redis 的内存优化
PDF文档签名指南
How to set encoding in idea
从外企离开,我才知道什么叫尊重跟合规…
SQL的一种写法,匹配就更新,否则就是插入
pytest多进程/多线程执行测试用例
[boutique] Pinia Persistence Based on the plug-in Pinia plugin persist
DAY FOUR