当前位置:网站首页>Policy Gradient Methods
Policy Gradient Methods
2022-07-07 00:27:00 【Evergreen AAS】
In the last blog post, we sorted out how to approximate value function or action value function :
V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.
Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.
So let's briefly review RL Learning objectives of : adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is :
\pi_{\theta}(s, a) = P[a | s, \theta]
This is called the strategic gradient (Policy Gradient, abbreviation PG) Algorithm .
Of course , This article is also aimed at model-free Intensive learning of .
Value-Based vs. Policy-Based RL
Value-Based:
- Learning value function
- Implicit policy, such as ϵϵ-greedy
Policy-Based:
- No value function
- Direct learning strategies
Actor-Critic:
- Learning value function
- Learning strategies
The relationship between the three can be formally expressed as follows :
Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages :
advantage :
- Better convergence
- It is more effective for problems with high-dimensional or continuous action space
- You can learn random strategies
shortcoming :
- In most cases, it converges to the local optimum , Not the global optimum
- Evaluating a strategy is generally inefficient and has a high variance
Policy Search
We first define the objective function .
Policy Objective Functions
The goal is : Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ?
- about episode In terms of tasks , We can use start value:
J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]
- For continuous tasks , We can use average value:
J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)
Or the average return per step :
J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a
among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .
Policy Optimisation
After clarifying the goal , Let's look at strategy based RL For a typical optimization problem : find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient (gradient-free) The algorithm of :
- Mountain climbing algorithm
- Simulated annealing
- evolutionary algorithms
- …
But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect :
- gradient descent
- Conjugate gradient
- Quasi Newton method
- …
In this article, we discuss the method of gradient descent .
Strategy gradient Theorem
Monte Carlo strategy gradient algorithm (REINFORCE)
Actir-Critic Strategy gradient algorithm
Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q:
Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)
This is the famous Actor-Critic Algorithm , It has two sets of parameters :
- Critic: Update action value function parameters w
- Actor: Face Critic Direction update policy parameters θ
Actor-Critic The algorithm is an approximate strategy gradient algorithm :
\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)
Critic The essence is to evaluate strategies :How good is policy π~θ~ for current parameters θ
Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely :Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows :
stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors (bias), But when the following two conditions are met , The strategy gradient is accurate :
- The estimated value of the value function does not contradict the strategy , namely : \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
- Parameters of value function w Can minimize errors , namely : \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]
Finally, we summarize the strategy gradient algorithm :
边栏推荐
- Personal digestion of DDD
- GPIO简介
- Three application characteristics of immersive projection in offline display
- 一图看懂对程序员的误解:西方程序员眼中的中国程序员
- Racher integrates LDAP to realize unified account login
- 37 page overall planning and construction plan for digital Village revitalization of smart agriculture
- Tourism Management System Based on jsp+servlet+mysql framework [source code + database + report]
- Wechat applet UploadFile server, wechat applet wx Uploadfile[easy to understand]
- Leecode brush questions record interview questions 32 - I. print binary tree from top to bottom
- 陀螺仪的工作原理
猜你喜欢

After leaving a foreign company, I know what respect and compliance are

St table

Designed for decision tree, the National University of Singapore and Tsinghua University jointly proposed a fast and safe federal learning system

沉浸式投影在线下展示中的三大应用特点

What is a responsive object? How to create a responsive object?

@TableId can‘t more than one in Class: “com.example.CloseContactSearcher.entity.Activity“.

专为决策树打造,新加坡国立大学&清华大学联合提出快速安全的联邦学习新系统

陀螺仪的工作原理

On February 19, 2021ccf award ceremony will be held, "why in Hengdian?"

Interface master v3.9, API low code development tool, build your interface service platform immediately
随机推荐
MySQL learning notes (mind map)
2022 PMP project management examination agile knowledge points (9)
How can computers ensure data security in the quantum era? The United States announced four alternative encryption algorithms
GPIO簡介
Typescript incremental compilation
Personal digestion of DDD
2022/2/12 summary
从外企离开,我才知道什么叫尊重跟合规…
iMeta | 华南农大陈程杰/夏瑞等发布TBtools构造Circos图的简单方法
MySQL主从之多源复制(3主1从)搭建及同步测试
Racher integrates LDAP to realize unified account login
1000 words selected - interface test basis
How engineers treat open source -- the heartfelt words of an old engineer
[boutique] Pinia Persistence Based on the plug-in Pinia plugin persist
web渗透测试是什么_渗透实战
PostgreSQL高可用之repmgr(1主2从+1witness)+Pgpool-II实现主从切换+读写分离
Interface master v3.9, API low code development tool, build your interface service platform immediately
JWT signature does not match locally computed signature. JWT validity cannot be asserted and should
Supersocket 1.6 creates a simple socket server with message length in the header
kubernetes部署ldap