当前位置：网站首页>Model-Free Prediction

Model-Free Prediction

2022-07-07 00:26:00 【Evergreen AAS】

The last article introduced Model-based The general method of —— Dynamic programming , The content of this paper Model-Free Under the circumstances Prediction problem , namely “Estimate the value function of an unknown MDP”.

Model-based：MDP It is known that , That is, the transfer matrix and reward function are known
Model-Free：MDP Unknown

Monte Carlo learning

Monte Carlo method （Monte-Carlo Methods, abbreviation MC） Also called Monte Carlo simulation , It means using random numbers （ Or more common pseudo-random numbers ） To solve a lot of computational problems . In fact, the essence is , Produce a posterior by acting as randomly as possible , Then the target system is characterized by a posteriori .

stay Model-Free Under the circumstances ,MC The application of reinforcement learning is to obtain the value function , Its characteristics are as follows ：

MC From the complete episodes Middle school learning （no bootstrapping）
MC Calculate the value by the mean , namely value = mean(return)
MC Only applicable to episodic MDPs（ Co., LTD. MDPs）

First-Visit Monte Carlo strategy evaluation

First-Visit Monte-Carlo Policy Evaluation：

Evaluate the status s In a given strategy π Value function under v~π~(s)) when , In a episode in , state s At the moment t The first time I was interviewed , Counter N(s)←N(s)+1 , Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s) = \frac{S(s)}{N(s)} According to the law of large numbers （Law of Large Numbers）：V(s) → v_{\pi}(s) \text{ as } N(s) → \infty

Every-Visit Monte Carlo strategy evaluation

Every-Visit Monte-Carlo Policy Evaluation：

Evaluate the status s In a given strategy π Value function under v~π(~s) when , In a episode in , state s At the moment t Every time When interviewed , Counter N(s)←N(s)+1, Cumulative value S(s)←S(s)+Gt When the whole process is over , state s The value of V(s)=S(s)/N(s) According to the law of large numbers （Law of Large Numbers）：V(s)→v~π~(s) as N(s)→∞

Incremental Monte-Carlo

Incremental averaging ： The mean μ1,μ2,… of a sequence x1,x2,… . can be computed incrementally：

According to the above formula, we can get incremental MC Updated formula ： Every time episode After the end , Incremental Updating V(s) , For each state St , Their corresponding return by Gt :

N(S_t) ← N(S_t) + 1 \\ V(S_t) ← V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))

In non static problems , The form of the update formula can be changed as follows ：

V(S_t) ← V(S_t) + \alpha (G_t - V(S_t))

Timing difference learning

Time series difference method （Temporal-Difference Methods, abbreviation TD） characteristic ：

TD Can pass bootstrapping From incomplete episodes Middle school learning
TD updates a guess towards a guess

TD(λ)

The following figure for TD target In different n Diagram below ：

As can be seen from the above figure , When n When the termination is reached , It's a episode, At this time, the corresponding method is MC, So from this point of view ,MC Belong to TD In special circumstances .

n-step Return

n-step returns It can be expressed as follows ：

n=1 when ：G_{t}^{(1)} = R_{t+1} + \gamma V(S_{t+1})

n=2 when ：G_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})…

n=∞ when ：G_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1} R_T)

therefore ,n-step return G_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + … + \gamma^{n}V(S_{t+n})

n-step TD Update formula ：

V(S_t) ← V(S_t) + \alpha (G_t^{(n)} - V(S_t))

Forward View of TD(λ)

Can we put all n-step return combined ？ The answer must be yes , After the combination of return Known as yes λ-return, among λ To combine different n-step returns The weight factor introduced .

λ-return:

G_t^{\lambda} = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)}

Forward-view TD(λλ)：

V(S_t) ← V(S_t) + \alpha\Bigl(G_t^{\lambda} - V(S_t)\Bigr)

TD(λ) The corresponding weight formula is (1−λ)λ^n−1^, The distribution diagram is as follows ：

Forward-view TD(λ) Characteristics ：

Update value function towards the λ-return
Forward-view looks into the future to compute GλtGtλ
Like MC, can only be computed from complete episodes

Backward View TD(λ)

Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences

With qualification marks TD(λλ)：

among δt by TD-error,Et(s) For qualification trace .

Qualification trail (Eligibility Traces)

The essence of qualification trace is for high frequency , The recent state gives higher trust （credit）/ The weight .

The following figure is a description of the qualification track ：

About TD(λ) There is a conclusion ：

The sum of offline updates is identical for forward-view and backward-view TD(λ).

The content of this section will not be introduced in depth , Those who are interested can read it Sutton Books and David A tutorial for .

Monte Carlo learning vs. Timing difference learning

MC And TD Similarities and differences

The same thing ： They all learn given strategies online from experience π The value function of v~π~

Difference ：

Incremental every-visit Monte-Carlo： Towards the real return G~t~ to update V(S~t~)

V(S_t) ← V(S_t) + \alpha (\textcolor{Red}{G_t} - V(S_t))

Simplest temporal-difference learning algorithm: TD(0)
Towards the estimated return \color{Red}{R_{t+1} + \gamma V(S_{t+1})} to update V(S~t~)
\color{Red}{R_{t+1} + \gamma V(S_{t+1})} It's called TD target
\color{Red}{R_{t+1} + \gamma V(S_{t+1})}−V(S_{t}) It's called TD error

In the figure below Drive Home Give an example to illustrate the difference between the two ,MC The prediction of the time of returning home can only be changed after returning home , and TD In each step, constantly adjust your prediction according to the actual situation .

MC And TD Advantages and disadvantages

Learning style

TD You can learn before you know the final result （ As an example above ）
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD You can learn without the final result （ Such as infinite / continuity MDPs）
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Variance and bias

MC has high variance, zero bias（ High variance , Zero deviation ）
- Good convergence properties
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias（ Low variance , There is a certain deviation ）
- Usually more efficient than MC
- TD(0) converges to vπ(s)vπ(s)
- More sensitive to initial value

About MC and TD Explanation of variance and deviation in ：

MC The update is based on real return G_t = R_{t+1} + \gamma R_{t+2} + … + \gamma^{T-1}R_{T} yes v~π~(St) Unbiased estimation of .
Actual TD target R_{t+1} + \gamma v_{\pi}(S_{t+1}) It's also vπ(St) Unbiased estimation of . But it is used in the actual update TD target R_{t+1} + \gamma V(S_{t+1}) yes vπ(St) The biased estimate of .
TD target With lower deviation ：
- Return Each simulation depends on many random actions 、 Transfer probability and return
- TD target Rely on only one random action at a time 、 Transfer probability and return

Markov nature

TD exploits Markov property
- Usually more efficient in Markov environments
MC does not exploit Markov property
- Usually more effective in non-Markov environments