当前位置：网站首页>Policy Gradient Methods

Policy Gradient Methods

2022-07-07 00:27:00 【Evergreen AAS】

In the last blog post, we sorted out how to approximate value function or action value function ：

V_{\theta}(s)\approx V^{\pi}(s) \\ Q_{\theta}(s)\approx Q^{\pi}(s, a)

Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ-greedy.

Once we approximate the value function or action value function through machine learning, we can control it through some strategies , such as ϵ -greedy.

So let's briefly review RL Learning objectives of ： adopt agent Interact with the environment , Maximize cumulative returns . Now that we finally need to learn how to interact with the environment , Then we can learn strategies directly , Before that, approximate the value function , The idea of controlling through greedy strategy is more like ” Curve of national salvation ”. This is the content of this article , How can we learn strategies directly , Expressed in mathematical form is ：

\pi_{\theta}(s, a) = P[a | s, \theta]

This is called the strategic gradient （Policy Gradient, abbreviation PG） Algorithm .

Of course , This article is also aimed at model-free Intensive learning of .

Value-Based vs. Policy-Based RL

Value-Based：

Learning value function
Implicit policy, such as ϵϵ-greedy

Policy-Based：

No value function
Direct learning strategies

Actor-Critic：

Learning value function
Learning strategies

The relationship between the three can be formally expressed as follows ：

Be cognizant of Value-Based And Policy-Based After the difference , Let's talk about it again Policy-Based RL Advantages and disadvantages ：

advantage ：

Better convergence
It is more effective for problems with high-dimensional or continuous action space
You can learn random strategies

shortcoming ：

In most cases, it converges to the local optimum , Not the global optimum
Evaluating a strategy is generally inefficient and has a high variance

Policy Search

We first define the objective function .

Policy Objective Functions

The goal is ： Given a with parameters θ The strategy of π~θ~(s,a) , Find the best parameter θ . But how do we evaluate strategies with different parameters π~θ(~s,a) What are the advantages and disadvantages of ？

about episode In terms of tasks , We can use start value：

J_1(\theta)=V^{\pi_{\theta}}(s_1)=E_{\pi_{\theta}}[v_1]

For continuous tasks , We can use average value：

J_{avV}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)V^{\pi_{\theta}}(s)

Or the average return per step ：

J_{avR}(\theta)=\sum_{s}d^{\pi_{\theta}}(s)\sum_{a}\pi_{\theta}(s, a)R_s^a

among d^πθ^(s) It's the Markov chain in π~θ~ Static distribution under .

Policy Optimisation

After clarifying the goal , Let's look at strategy based RL For a typical optimization problem ： find θ Maximize J(θ) There are many optimization methods , For example, independent of gradient （gradient-free） The algorithm of ：

Mountain climbing algorithm
Simulated annealing
evolutionary algorithms
…

But in general , If we can get the gradient in the problem , The optimization method based on gradient has better effect ：

gradient descent
Conjugate gradient
Quasi Newton method
…

In this article, we discuss the method of gradient descent .

Strategy gradient Theorem

Monte Carlo strategy gradient algorithm （REINFORCE）

Actir-Critic Strategy gradient algorithm

Monte-Carlo The variance of the strategy gradient is high , So give up using return To estimate the action - Value function Q, But use critic To estimate Q：

Q_w(s, a)\approx Q^{\pi_{\theta}}(s, a)

This is the famous Actor-Critic Algorithm , It has two sets of parameters ：

Critic： Update action value function parameters w
Actor： Face Critic Direction update policy parameters θ

Actor-Critic The algorithm is an approximate strategy gradient algorithm ：

\triangledown_\theta J(\theta)\approx E_{\pi_{\theta}}[\triangledown_{\theta}\log \pi_{\theta}(s, a)Q_w(s, a)]\\ \Delta\theta = \alpha\triangledown_\theta\log\pi_{\theta}(s,a)Q_w(s,a)

Critic The essence is to evaluate strategies ：How good is policy π~θ~ for current parameters θ

Strategy evaluation we introduced before MC、TD、TD(λλ), And value function approximation . As shown below , ordinary Actir-Critic Algorithm Critic Approximate the action value function , Use the simplest linear equation , namely ：Q_w(s, a) = \phi(s, a)^T w, The specific pseudo code is as follows ：

stay Actir-Critic In the algorithm, , The strategy is estimated , This will produce errors （bias）, But when the following two conditions are met , The strategy gradient is accurate ：

The estimated value of the value function does not contradict the strategy , namely ： \triangledown_w Q_w(s,a) = \triangledown_\theta\log\pi_{\theta}(s,a)
Parameters of value function w Can minimize errors , namely ： \epsilon = E_{\pi_{\theta}}[(Q^{\pi_{\theta}}(s, a) - Q_w(s,a))^2]

Finally, we summarize the strategy gradient algorithm ：

原网站

版权声明
本文为[Evergreen AAS]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202130959332633.html