当前位置：网站首页>Understand HMM

Understand HMM

2022-06-13 02:14:00 【zjuPeco】

List of articles

1 summary
2 Symbol description
3 Two assumptions
4 Evaluation
- 4.1 Forward algorithm （forward algorithm）
- 4.2 Backward algorithm （backward algorithm）
5 Learning
6 Decoding
Reference material

1 summary

This article is about B On the site machine learning - Whiteboard derivation series ( fourteen )- hidden Markov model HMM Learning notes of ,UP The main speaker made it very clear , Make a note , In case you forget later .

Some details have been changed according to personal understanding .

HMM Sketch Map

chart 1 HMM Sketch Map

HMM Full name Hidden Markov Model, The schematic diagram is shown in the figure above , It's a probability graph . Observation variables , seeing the name of a thing one thinks of its function , Is the amount we observe , For example, speech recognition is the sound signal we hear ; State variables Is a hidden feature , In speech recognition , It can be a pronunciation unit Phoneme, Even smaller units Tri-phone, No matter what , This must be a A discrete enumerable set . When the state variable becomes a continuous variable , If the continuous variable is linear , The typical representative is Kalman Filter, If it is a continuous variable, it is nonlinear , The typical representative is Particle Filter. This article only speak HMM.

At the same time , Change from state variable to observation variable , Obey a distribution , It is generally a mixed Gaussian distribution （GMM）.

The state variable changes from the previous time to the next time , Also obey a certain distribution , It's usually the same GMM.

Between the observed variables at each moment , must No Independent identically distributed .

2 Symbol description

Let the whole sequence share T individual time step.

The state sequence is $S = [s_1, s_2, ..., s_t, ..., s_{T-1}, s_T]$ , $s_t$ Is an enumerable discrete variable , The value range is ${q_1, q_2, ..., q_N\}$ , $N$ Express $N$ States .

The observation sequence is $O = [o_1, o_2, ..., o_t, ..., o_{T-1}, o_T]$ , $o_t$ It can be a continuous variable .

$\pi_i$ Is the state probability at the initial time , namely $\pi_i = P(s_1=q_i)$ .

$A$ Is the state transition matrix $[a_{ij}]_{N \times N}$ , Matrix $a_{ij}=P(s_{t}=q_j|s_{t-1}=q_i)$ , Indicates that the time state of any two adjacent time points is from $q_i$ Turn into $q_j$ Probability .

$b_j(o_t)$ Is the launch probability , Represents slave state $q_j$ Become an observation $o_t$ Probability , namely $b_j(o_t)=P(o_t|s_t=q_j)$ .

$\lambda = (\pi, a, b)$ Represent all learnable parameters in the model .

A graph with symbols 1 It becomes

HMM Sketch Map - A signed

chart 2 HMM Sketch Map （ A signed ）

3 Two assumptions

（1） Homogeneous Markov hypothesis
$t + 1$ The state of time is only related to $t$ The state of the moment is related to .
$P(s_{t+1} | s_1, s_2, ..., s_t, o_1, o_2, ..., o_t) = P(s_{t+1}|s_t) \tag{3-1}$

（2） Observation independence Hypothesis
$t$ The observed variables at time are only related to $t$ The state variable of the moment is related to .
$P(o_t|s_1, s_2, ..., s_t, o_1, o_2, ..., o_{t-1}) = P(o_t|s_t) \tag{3-2}$

These two hypotheses play an extremely important role in the later derivation .

4 Evaluation

Evaluation What to do is , Given all the model parameters , namely $\lambda$ , after , Find an observation sequence $O=[o_1, o_2, ..., o_T]$ Probability , Write it down as $P(O|\lambda)$ . Be careful , Now we know all the parameters of the model , Just make a inference The process of .

Let's first look at the direct solution . Let's change the conditional probability a little , Bring in the state variables

$P(O|\lambda) = \sum_{all\ S}P(S, O| \lambda) = \sum_{all\ S} P(O|S, \lambda)P(S|\lambda) \tag{4-1}$

Is this OK , We put all possible $S$ The sequence is taken into account , This is a full probability .

And then we put $P(S|\lambda)$ Expand to see , $\lambda$ It only means that all model parameters are known , Write but not write

$\begin{aligned} P(S|\lambda) &= P(s_1, s_2, ..., s_T | \lambda) \\ &= P(s_T|s_1, s_2, ..., s_{T-1}, \lambda)P(s_1, s_2, ..., s_{T-1}, \lambda) \\ & Using homogeneous Markov hypothesis (3-1),\lambda Write but not write \\ &= P(s_T|s_{T-1})P(s_1, s_2, ..., s_{T-1}) \\ & Continue to dismantle \\ &= P(s_T|s_{T-1})P(s_{T-1}|s_{T-2})...P(s_2|s_1)P(s_1) \\ & Except that the last term is in the state transition matrix \\ &=\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \end{aligned} \tag{4-2}$

Then we put $\lambda)$ Spread out and look at

$\begin{aligned} P(O|S, \lambda) &= P(o_1, o_2, ..., o_T |s_1, s_2, ..., s_T, \lambda ) \\ &= P(o_T | o_1, o_2, ..., o_{T-1}, s_1, s_2, ..., s_T, \lambda)P(o_1, o_2, ..., o_{T-1} | s_1, s_2, ..., s_T, \lambda) \\ & Using the observation independence Hypothesis (3-2) \\ &=P(o_T|s_T)P(o_1, o_2, ..., o_{T-1} | s_1, s_2, ..., s_T, \lambda)\\ & Continue to dismantle \\ &=P(o_T|s_T)P(o_{T-1}|s_{T-1})...P(o_1|s_1)\\ & Using the launch probability function \\ &=\prod_{t=1}^{T}b_{s_t}(o_t) \end{aligned} \tag{4-3}$

take (4-2) and (4-3) Into the (4-1) Available

$\begin{aligned} P(O|\lambda) &= \sum_{all\ S} \prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \\ & hold all\ S an \\ &=\sum_{s_1=q_1}^{q_N}\sum_{s_2=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}\prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \end{aligned} \tag{4-4}$

The complexity of this is $O(N^T)$ Of , The amount of computation increases exponentially with the length of the sequence , It doesn't work .

therefore , Some people have proposed forward and backward algorithms to reduce the computational cost .

4.1 Forward algorithm （forward algorithm）

Here's the picture 3 Shown , The forward algorithm considers the joint probability of the variables in the orange box , Write it down as

$\alpha_t(q_i) = P(o_1, o_2, ..., o_t, s_t=q_i | \lambda) \tag{4-5}$

To ask why if $(4 - 5)$ , I really can't answer that , This is a design , If there are other designs, it should also be OK .
Schematic diagram of forward algorithm

chart 3 Schematic diagram of forward algorithm

Let's see $\alpha_T(q_i)$ What is it like

$\alpha_T(q_i) = P(o_1, o_2, ..., o_T, s_T=q_i | \lambda) =P(O, s_T=q_i|\lambda) \tag{4-6}$

$s_T$ The states of are enumerable , We traverse all of them $s_T$ The possibility of , And then find a sum , And then there is

$\sum_{i=1}^N \alpha_T(q_i) = \sum_{i=1}^N P(O, s_T=q_i|\lambda) = P(O|\lambda) \tag{4-7}$

see , $P(O|\lambda)$ There is .

Our next job is , See how to find this $\alpha_t(q_i)$ . $\alpha_1(q_i)$ We know

$\alpha_1(q_i) = P(o_1,s_1=q_i|\lambda)=P(o_1|s_1=q_i, \lambda)P(s_1=q_i|\lambda)$

See that ？ One is our launch probability , One is our initial probability , So there is

$\alpha_1(q_i) = b_i(o_1) \pi_i \tag{4-8}$

Now that I know $\alpha_1(q_i)$ , If we can still know $\alpha_t(q_i)$ To $\alpha_{t+1}(q_i)$ The recurrence formula of , Has this problem been solved soon ？ Let's try ！ $\lambda$ It doesn't matter whether you write or not , If only we knew in our hearts , I won't write it next .

$\begin{aligned} \alpha_{t+1}(q_i) &= P(o_1, o_2, ..., o_{t+1}, s_{t+1}=q_i) \\ & Use the total probability to make a figure s_t Come out and try \\ &=\sum_{j=1}^N P(o_1, o_2, ..., o_{t+1}, s_t=q_j, s_{t+1}=q_i) \\ & carry o_{t+1}\\ &=\sum_{j=1}^N P(o_{t+1}|o_1, o_2, ...o_t, s_t=q_j, s_{t+1}=q_i)P(o_1, o_2, ...,o_t, s_t=q_j, s_{t+1}=q_i)\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(o_1, o_2, ...,o_t, s_t=q_j, s_{t+1}=q_i)\\ & Of the last item s_{t+1}\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(s_{t+1}=q_i|o_1, o_2, ...,o_t, s_t=q_j)P(o_1, o_2, ...,o_t, s_t=q_j)\\ & Using homogeneous Markov hypothesis (3-1)\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(s_{t+1}=q_i|s_t=q_j)P(o_1, o_2, ...,o_t, s_t=q_j)\\ \end{aligned}$

Have you found it? ？ These three terms are the launch probability , State transition probability and $\alpha_t(q_j)$ .

So we get the recursion

$\alpha_{t+1}(q_i) = \sum_{j=1}^N b_i(o_{t+1})a_{ji}\alpha_t(q_j) \tag{4-9}$
combination $(4 - 8)$ and $(4 - 9)$ We can get... In all States $\alpha_T(q_i)$ , $(4 - 7)$ To solution , At this time, the complexity is $O((TN)^2)$ .

4.2 Backward algorithm （backward algorithm）

Here's the picture 4 Shown , The backward algorithm considers the joint probability of variables in the green box , Write it down as

$\beta_t(q_i) = P(o_{t+1}, ..., o_{T-1}, o_T | s_t = q_i, \lambda) \tag{4-10}$

This is also a design , And forward complementarity . Pay attention to $(4 - 5)$ The difference between , The backward derivation is a little more convoluted than the forward one .
Schematic diagram of backward algorithm

chart 4 Schematic diagram of backward algorithm

Let's see $\beta_1(q_i)$ What is it like

$\beta_1(q_i) = P(o_2, ..., o_{T-1}, o_T | s_1 = q_i | \lambda) \tag{4-11}$

Let's take a look at this $\beta_1(q_i)$ And what we asked $P(O|\lambda)$ What does it matter

$\begin{aligned} P(O|\lambda) &= P(o_1, o_2, ..., o_T|\lambda) \\ & Omit \lambda, introduce s_1\\ &=\sum_{i=1}^{N} P(o_1, o_2, ..., o_T, s_1=q_i) \\ & hold s_1 As a condition \\ &=\sum_{i=1}^{N} P(o_1, o_2, ..., o_T | s_1=q_i)P(s_1=q_i)\\ & Dismantle o_1, Note that backward is the initial probability \\ &=\sum_{i=1}^{N} P(o_1 | o_2, ..., o_T, s_1=q_i)P(o_2, ..., o_T | s_1=q_i)\pi_i\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{i=1}^{N} P(o_1 | s_1=q_i)P(o_2, ..., o_T | s_1=q_i)\pi_i\\ & Plug in (4-11) And launch probability \\ &=\sum_{i=1}^{N} b_i(o_1)\beta_{1}(q_i)\pi_i \tag{4-12} \end{aligned}$

Since then , $P(O|\lambda)$ and $\beta_1(q_i)$ A relationship is found , All we have to do is ask for this $\beta_1(q_i)$ 了 .

We make

$\beta_T(q_i) = 1 \tag{4-13}$

Then I'll do the math again $\beta_t(q_i)$ and $\beta_{t+1}(q_j)$ Recursive relations , $\lambda$ I just omitted .

$\begin{aligned} \beta_t(q_i) &= P(o_{t+1}, ..., o_{T-1}, o_T | s_t = q_i) \\ & Using full probability, we introduce s_{t+1}\\ &= \sum_{j=1}^{N}P(o_{t+1}, ..., o_T, s_{t+1} = q_j | s_t = q_i) \\ & hold s_{t+1} Lead to the conditions \\ &=\sum_{j=1}^{N}P(o_{t+1}, ..., o_T | s_{t+1} = q_j, s_t = q_i)P(s_{t+1} = q_j | s_t = q_i)\\ & In the preceding paragraph o_{t+1}, ..., o_T And the only s_{t+1} = q_j of , This can be proved , But there is no evidence here \\ & The latter term is the state transition probability \\ &=\sum_{j=1}^{N}P(o_{t+1}, ..., o_T | s_{t+1} = q_j)a_{ij}\\ & hold o_{t+1} Take it out \\ &=\sum_{j=1}^{N}P(o_{t+1}| o_{t+2},..., o_T, s_{t+1} = q_j)P(o_{t+2}, ..., o_T | s_{t+1} = q_j)a_{ij}\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{j=1}^{N}P(o_{t+1}|s_{t+1} = q_j)\beta_{t+1}(q_j)a_{ij}\\ & The preceding item is the launch probability \\ &=\sum_{j=1}^{N}b_j(o_{t+1})\beta_{t+1}(q_j)a_{ij} \end{aligned} \tag{4-14}$

combination $(4 - 13)$ and $(4 - 14)$ , Then we can find $\beta_1(q_i)$ , Then we can get $P(O|\lambda)$ .

It is worth noting that , At any moment $t$ , We combine forward and backward algorithms , You can have

$P(O|\lambda) = \sum_{i=1}^N \alpha_t(q_i)\beta_t(q_i) \tag{4-15}$

Here is a brief introduction , This will depend on an unreliable assumption , Namely $o_1,...,o_t$ and $o_{t+1},...,o_{T}$ It's irrelevant .

$\begin{aligned} P(O|\lambda) &= \sum_{i=1}^N P(O, s_t=q_i|\lambda) \\ &= \sum_{i=1}^N P(o_1,...,o_t, o_{t+1}, .., o_T, s_t=q_i|\lambda) \\ &= \sum_{i=1}^N P(o_1,...,o_t, s_t=q_i|\lambda)P(o_{t+1}, .., o_T | o_1,...,o_t, s_t=q_i)\\ & The preceding paragraph is \alpha_t(q_i), Backward dependence on assumptions ignores o_1,...,o_t Namely \beta_t(q_i)\\ &=\sum_{i=1}^N \alpha_t(q_i)\beta_t(q_i) \end{aligned}$

Although this assumption is not reliable , But to simplify the calculation , It's all done .

5 Learning

Learning One thing to do is , After a given observation sequence , Find the group of parameters with the greatest probability of obtaining the observation sequence $\lambda$ , namely

$\lambda_{MLE} = arg\max_{\lambda}P(O|\lambda) \tag{5-1}$

there MLE Namely Max Likelyhood Estimation.

Be reasonable , If we can find the expression of the derivative , This $\lambda_{MLE}$ It came out all at once , But here it is. $P(O|\lambda)$ It is generally a Gaussian mixture function , There is no direct derivation , So we need to use EM The algorithm .

This article does not cover EM What is the algorithm , We use it directly EM Algorithm , Want to know EM What is the algorithm , It is recommended to see Mr. xuyida's EM Algorithm explanation . In a word, it is , There is no way to find the derivative directly , We added an implicit variable , It becomes to find the derivative of another equation , The derivation of this equation is relatively simple , But it can't be achieved in one step , Need to iterate , Gradually approaching the local optimum .

EM The iterative formula of the algorithm is

$\theta^{(t+1)} = arg\max_{\theta} \int_{z}log P(x,z|\theta)P(z|x, \theta^{(t)})dz \tag{5-2}$

there $\theta$ It's our model parameters $\lambda$ , $x$ Is our observed variable $O$ , $z$ Is our state variable $S$ , We have $S$ Is discrete , Integration becomes accumulation . Let's rewrite $(5 - 2)$ There is

$\lambda^{(t+1)} = arg\max_{\lambda} \sum_{all\ S}log P(O,S|\lambda)P(S|O, \lambda^{(t)}) \tag{5-3}$

Let's do it again $P(S|O,\lambda^{(t)})$ Make a small change

$P(S|O,\lambda^{(t)}) = \frac{P(S,O|\lambda^{(t)})}{P(O|\lambda^{(t)})}$

there $\lambda^{(t)}$ It's a constant , $O$ Again with $\lambda$ Irrelevant , therefore $P(O|\lambda^{(t)})$ This term is a constant , You can ignore . Therefore, the $(5 - 3)$ Change it to

$\lambda^{(t+1)} = arg\max_{\lambda} \sum_{all\ S}log P(O,S|\lambda)P(O, S|\lambda^{(t)}) \tag{5-4}$

Operating time , It's just iteration $(5 - 4)$ . But look here , This argmax I still can't ask . This is actually quite complicated , The initial probability parameter will be found below $\pi$ For example , Simple description , The rest is not required , It's too complicated , I can't bear it .

We define

$Q(\lambda, \lambda^{(t)}) = \sum_{all\ S} log P(O,S|\lambda)P(O, S|\lambda^{(t)}) \tag{5-5}$

Let's do it $(4 - 4)$ Dai Jin has a look

$\begin{aligned} Q(\lambda, \lambda^{(t)}) &= \sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}log(\prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1})P(O, S|\lambda^{(t)})\\ &=\sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}(log\pi_{s_1} + \sum_{t=1}^T logb_{s_t}(o_t) + \sum_{t=1}^{T-1}loga_{s_t,s_{t+1}})P(O, S|\lambda^{(t)}) \end{aligned}$

good , Let's make No $t$ During the next iteration , The parameter of the initial probability is $\pi^{(t)}$ , that

$\begin{aligned} \pi^{(t+1)} &= arg\max_{\pi}Q(\lambda, \lambda^{(t)})\\ & Filter out and \pi Independent variables \\ &=arg\max_{\pi} \sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}log\pi_{s_1}P(O, s_1,...,s_T|\lambda^{(t)})\\ & and s_1 Irrelevant state variables are removed with full probability \\ &=arg\max_{\pi} \sum_{s_1=q_1}^{q_N}log\pi_{s_1}P(O, s_1|\lambda^{(t)})\\ \end{aligned} \tag{5-6}$

thus , So we can easily find the derivative . But don't forget that there is a constraint , Namely
$\sum_{s_1=q_1}^{q_N}\pi_{s_1} = 1 \tag{5-7}$

Constrained extremum , Lagrange multiplier method . Make

$L(\pi_{s_1}, \eta) = \sum_{s_1=q_1}^{q_N}(log\pi_{s_1}P(O, s_1|\lambda^{(t)})) + \eta (\sum_{s_1=q_1}^{q_N}\pi_{s_1} - 1) \tag{5-8}$

Yes $(5 - 8)$ Do partial derivation , Yes
$\frac{\partial L}{\partial \pi_{s_1}} = \frac{1}{\pi_{s_1}}P(O, s_1|\lambda^{(t)})) + \eta \tag{5-9}$

Let the partial derivative be equal to 0, Yes

$s_1|\lambda^{(t)})) + \pi_{s_1}^{(t+1)}\eta = 0 \tag{5-10}$

Sum all the state variables , Yes

$\sum_{s_1=q_1}^{q_N} (P(O, s_1|\lambda^{(t)}) + \pi_{s_1}^{(t+1)}\eta) = 0$

There are

$P(O|\lambda^{(t)}) + \eta = 0$

namely
$\eta = -P(O|\lambda^{(t)}) \tag{5-11}$

take $(5 - 11)$ Substitute into $(5 - 12)$ Yes

$\pi_{s_1}^{(t+1)} =\frac{P(O, s_1|\lambda^{(t)})}{P(O|\lambda^{(t)})} \tag{5-12}$

Finally I got it , Other parameters can be calculated in a similar way , But it will be more complicated .

6 Decoding

Decoding The problem to be solved is

$\hat{S} = arg\max_{S}P(S|O, \lambda) \tag{6-1}$

Which translates as , The model parameters are given , Which group of the corresponding state variable sequence is the best .

because $P(O|\lambda)$ Are variables that have been observed , We could also say $(6 - 1)$ Equivalent to

$\hat{S} = arg\max_{S}P(S|O, \lambda)P(O|\lambda) = arg\max_{S}P(S, O| \lambda) \tag{6-2}$

Let's draw a picture to see

decoding Sketch Map

chart 5 decoding Sketch Map

Look at the picture and you will understand it at once , Our every time step The state variables of all have $N$ Status , We are in each time step Select a state variable , Form a path , Make the joint probability of passing through the whole path maximum .

There's a total of $N^T$ Paths , If you calculate the probability of each path , Take the one with the greatest probability , The time complexity is too high . therefore , We use the idea of dynamic programming to solve this problem , It's also called Viterbi algorithm.

We make

$\delta_t(q_i)=\max_{s_1, ..., s_{t-1}}P(o_1,...,o_t, s_1, ..., s_{t-1}, s_t=q_i | \lambda) \tag{6-3}$

Translation is , When $t$ The state of the moment is $q_i$ when , Make it possible to $t$ The state path with the maximum joint probability up to time $s_1, ..., s_{t-1}]$ by $\delta_t(q_i)$ .

Let's see $\delta_{t+1}(q_j)$ Time and $\delta_t(q_i)$ The relationship between

$\begin{aligned} \delta_{t+1}(q_j) &= \max_{s_1, ..., s_{t}}P(o_1,...,o_t, s_1, ..., s_{t}, s_{t+1}=q_j | \lambda)\\ & Traverse t All the time \delta_t{}(q_i)\\ &=\max_{1\leq i \leq N}\delta_{t}(q_i)a_{ij}b_j(o_{t+1}) \end{aligned} \tag{6-4}$