当前位置:网站首页>Understand HMM
Understand HMM
2022-06-13 02:14:00 【zjuPeco】
List of articles
1 summary
This article is about B On the site machine learning - Whiteboard derivation series ( fourteen )- hidden Markov model HMM Learning notes of ,UP The main speaker made it very clear , Make a note , In case you forget later .
Some details have been changed according to personal understanding .

HMM Full name Hidden Markov Model, The schematic diagram is shown in the figure above , It's a probability graph . Observation variables , seeing the name of a thing one thinks of its function , Is the amount we observe , For example, speech recognition is the sound signal we hear ; State variables Is a hidden feature , In speech recognition , It can be a pronunciation unit Phoneme, Even smaller units Tri-phone, No matter what , This must be a A discrete enumerable set . When the state variable becomes a continuous variable , If the continuous variable is linear , The typical representative is Kalman Filter, If it is a continuous variable, it is nonlinear , The typical representative is Particle Filter. This article only speak HMM.
At the same time , Change from state variable to observation variable , Obey a distribution , It is generally a mixed Gaussian distribution (GMM).
The state variable changes from the previous time to the next time , Also obey a certain distribution , It's usually the same GMM.
Between the observed variables at each moment , must No Independent identically distributed .
2 Symbol description
Let the whole sequence share T individual time step.
The state sequence is S = [ s 1 , s 2 , . . . , s t , . . . , s T − 1 , s T ] S = [s_1, s_2, ..., s_t, ..., s_{T-1}, s_T] S=[s1,s2,...,st,...,sT−1,sT], s t s_t st Is an enumerable discrete variable , The value range is { q 1 , q 2 , . . . , q N } \{q_1, q_2, ..., q_N\} { q1,q2,...,qN}, N N N Express N N N States .
The observation sequence is O = [ o 1 , o 2 , . . . , o t , . . . , o T − 1 , o T ] O = [o_1, o_2, ..., o_t, ..., o_{T-1}, o_T] O=[o1,o2,...,ot,...,oT−1,oT], o t o_t ot It can be a continuous variable .
π i \pi_i πi Is the state probability at the initial time , namely π i = P ( s 1 = q i ) \pi_i = P(s_1=q_i) πi=P(s1=qi).
A A A Is the state transition matrix [ a i j ] N × N [a_{ij}]_{N \times N} [aij]N×N, Matrix a i j = P ( s t = q j ∣ s t − 1 = q i ) a_{ij}=P(s_{t}=q_j|s_{t-1}=q_i) aij=P(st=qj∣st−1=qi), Indicates that the time state of any two adjacent time points is from q i q_i qi Turn into q j q_j qj Probability .
b j ( o t ) b_j(o_t) bj(ot) Is the launch probability , Represents slave state q j q_j qj Become an observation o t o_t ot Probability , namely b j ( o t ) = P ( o t ∣ s t = q j ) b_j(o_t)=P(o_t|s_t=q_j) bj(ot)=P(ot∣st=qj).
λ = ( π , a , b ) \lambda = (\pi, a, b) λ=(π,a,b) Represent all learnable parameters in the model .
A graph with symbols 1 It becomes

3 Two assumptions
(1) Homogeneous Markov hypothesis
t + 1 t+1 t+1 The state of time is only related to t t t The state of the moment is related to .
P ( s t + 1 ∣ s 1 , s 2 , . . . , s t , o 1 , o 2 , . . . , o t ) = P ( s t + 1 ∣ s t ) (3-1) P(s_{t+1} | s_1, s_2, ..., s_t, o_1, o_2, ..., o_t) = P(s_{t+1}|s_t) \tag{3-1} P(st+1∣s1,s2,...,st,o1,o2,...,ot)=P(st+1∣st)(3-1)
(2) Observation independence Hypothesis
t t t The observed variables at time are only related to t t t The state variable of the moment is related to .
P ( o t ∣ s 1 , s 2 , . . . , s t , o 1 , o 2 , . . . , o t − 1 ) = P ( o t ∣ s t ) (3-2) P(o_t|s_1, s_2, ..., s_t, o_1, o_2, ..., o_{t-1}) = P(o_t|s_t) \tag{3-2} P(ot∣s1,s2,...,st,o1,o2,...,ot−1)=P(ot∣st)(3-2)
These two hypotheses play an extremely important role in the later derivation .
4 Evaluation
Evaluation What to do is , Given all the model parameters , namely λ \lambda λ, after , Find an observation sequence O = [ o 1 , o 2 , . . . , o T ] O=[o_1, o_2, ..., o_T] O=[o1,o2,...,oT] Probability , Write it down as P ( O ∣ λ ) P(O|\lambda) P(O∣λ). Be careful , Now we know all the parameters of the model , Just make a inference The process of .
Let's first look at the direct solution . Let's change the conditional probability a little , Bring in the state variables
P ( O ∣ λ ) = ∑ a l l S P ( S , O ∣ λ ) = ∑ a l l S P ( O ∣ S , λ ) P ( S ∣ λ ) (4-1) P(O|\lambda) = \sum_{all\ S}P(S, O| \lambda) = \sum_{all\ S} P(O|S, \lambda)P(S|\lambda) \tag{4-1} P(O∣λ)=all S∑P(S,O∣λ)=all S∑P(O∣S,λ)P(S∣λ)(4-1)
Is this OK , We put all possible S S S The sequence is taken into account , This is a full probability .
And then we put P ( S ∣ λ ) P(S|\lambda) P(S∣λ) Expand to see , λ \lambda λ It only means that all model parameters are known , Write but not write
P ( S ∣ λ ) = P ( s 1 , s 2 , . . . , s T ∣ λ ) = P ( s T ∣ s 1 , s 2 , . . . , s T − 1 , λ ) P ( s 1 , s 2 , . . . , s T − 1 , λ ) benefit use Qi Time M a r k o v false set up ( 3 − 1 ) , λ can Write can No Write = P ( s T ∣ s T − 1 ) P ( s 1 , s 2 , . . . , s T − 1 ) Following To continue Demolition Demolition Demolition = P ( s T ∣ s T − 1 ) P ( s T − 1 ∣ s T − 2 ) . . . P ( s 2 ∣ s 1 ) P ( s 1 ) except 了 most after One term all yes shape state turn move Moment front in Of = ∏ t = 1 T − 1 a s t , s t + 1 π s 1 (4-2) \begin{aligned} P(S|\lambda) &= P(s_1, s_2, ..., s_T | \lambda) \\ &= P(s_T|s_1, s_2, ..., s_{T-1}, \lambda)P(s_1, s_2, ..., s_{T-1}, \lambda) \\ & Using homogeneous Markov hypothesis (3-1),\lambda Write but not write \\ &= P(s_T|s_{T-1})P(s_1, s_2, ..., s_{T-1}) \\ & Continue to dismantle \\ &= P(s_T|s_{T-1})P(s_{T-1}|s_{T-2})...P(s_2|s_1)P(s_1) \\ & Except that the last term is in the state transition matrix \\ &=\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \end{aligned} \tag{4-2} P(S∣λ)=P(s1,s2,...,sT∣λ)=P(sT∣s1,s2,...,sT−1,λ)P(s1,s2,...,sT−1,λ) benefit use Qi Time Markov false set up (3−1),λ can Write can No Write =P(sT∣sT−1)P(s1,s2,...,sT−1) Following To continue Demolition Demolition Demolition =P(sT∣sT−1)P(sT−1∣sT−2)...P(s2∣s1)P(s1) except 了 most after One term all yes shape state turn move Moment front in Of =t=1∏T−1ast,st+1πs1(4-2)
Then we put P ( O ∣ S , λ ) P(O|S, \lambda) P(O∣S,λ) Spread out and look at
P ( O ∣ S , λ ) = P ( o 1 , o 2 , . . . , o T ∣ s 1 , s 2 , . . . , s T , λ ) = P ( o T ∣ o 1 , o 2 , . . . , o T − 1 , s 1 , s 2 , . . . , s T , λ ) P ( o 1 , o 2 , . . . , o T − 1 ∣ s 1 , s 2 , . . . , s T , λ ) benefit use view measuring single state false set up ( 3 − 2 ) = P ( o T ∣ s T ) P ( o 1 , o 2 , . . . , o T − 1 ∣ s 1 , s 2 , . . . , s T , λ ) Following To continue Demolition Demolition Demolition = P ( o T ∣ s T ) P ( o T − 1 ∣ s T − 1 ) . . . P ( o 1 ∣ s 1 ) use Hair shoot General rate Letter Count = ∏ t = 1 T b s t ( o t ) (4-3) \begin{aligned} P(O|S, \lambda) &= P(o_1, o_2, ..., o_T |s_1, s_2, ..., s_T, \lambda ) \\ &= P(o_T | o_1, o_2, ..., o_{T-1}, s_1, s_2, ..., s_T, \lambda)P(o_1, o_2, ..., o_{T-1} | s_1, s_2, ..., s_T, \lambda) \\ & Using the observation independence Hypothesis (3-2) \\ &=P(o_T|s_T)P(o_1, o_2, ..., o_{T-1} | s_1, s_2, ..., s_T, \lambda)\\ & Continue to dismantle \\ &=P(o_T|s_T)P(o_{T-1}|s_{T-1})...P(o_1|s_1)\\ & Using the launch probability function \\ &=\prod_{t=1}^{T}b_{s_t}(o_t) \end{aligned} \tag{4-3} P(O∣S,λ)=P(o1,o2,...,oT∣s1,s2,...,sT,λ)=P(oT∣o1,o2,...,oT−1,s1,s2,...,sT,λ)P(o1,o2,...,oT−1∣s1,s2,...,sT,λ) benefit use view measuring single state false set up (3−2)=P(oT∣sT)P(o1,o2,...,oT−1∣s1,s2,...,sT,λ) Following To continue Demolition Demolition Demolition =P(oT∣sT)P(oT−1∣sT−1)...P(o1∣s1) use Hair shoot General rate Letter Count =t=1∏Tbst(ot)(4-3)
take (4-2) and (4-3) Into the (4-1) Available
P ( O ∣ λ ) = ∑ a l l S ∏ t = 1 T b s t ( o t ) ∏ t = 1 T − 1 a s t , s t + 1 π s 1 hold a l l S exhibition open = ∑ s 1 = q 1 q N ∑ s 2 = q 1 q N . . . ∑ s T = q 1 q N ∏ t = 1 T b s t ( o t ) ∏ t = 1 T − 1 a s t , s t + 1 π s 1 (4-4) \begin{aligned} P(O|\lambda) &= \sum_{all\ S} \prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \\ & hold all\ S an \\ &=\sum_{s_1=q_1}^{q_N}\sum_{s_2=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}\prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1} \end{aligned} \tag{4-4} P(O∣λ)=all S∑t=1∏Tbst(ot)t=1∏T−1ast,st+1πs1 hold all S exhibition open =s1=q1∑qNs2=q1∑qN...sT=q1∑qNt=1∏Tbst(ot)t=1∏T−1ast,st+1πs1(4-4)
The complexity of this is O ( N T ) O(N^T) O(NT) Of , The amount of computation increases exponentially with the length of the sequence , It doesn't work .
therefore , Some people have proposed forward and backward algorithms to reduce the computational cost .
4.1 Forward algorithm (forward algorithm)
Here's the picture 3 Shown , The forward algorithm considers the joint probability of the variables in the orange box , Write it down as
α t ( q i ) = P ( o 1 , o 2 , . . . , o t , s t = q i ∣ λ ) (4-5) \alpha_t(q_i) = P(o_1, o_2, ..., o_t, s_t=q_i | \lambda) \tag{4-5} αt(qi)=P(o1,o2,...,ot,st=qi∣λ)(4-5)
To ask why if ( 4 − 5 ) (4-5) (4−5), I really can't answer that , This is a design , If there are other designs, it should also be OK .
Let's see α T ( q i ) \alpha_T(q_i) αT(qi) What is it like
α T ( q i ) = P ( o 1 , o 2 , . . . , o T , s T = q i ∣ λ ) = P ( O , s T = q i ∣ λ ) (4-6) \alpha_T(q_i) = P(o_1, o_2, ..., o_T, s_T=q_i | \lambda) =P(O, s_T=q_i|\lambda) \tag{4-6} αT(qi)=P(o1,o2,...,oT,sT=qi∣λ)=P(O,sT=qi∣λ)(4-6)
s T s_T sT The states of are enumerable , We traverse all of them s T s_T sT The possibility of , And then find a sum , And then there is
∑ i = 1 N α T ( q i ) = ∑ i = 1 N P ( O , s T = q i ∣ λ ) = P ( O ∣ λ ) (4-7) \sum_{i=1}^N \alpha_T(q_i) = \sum_{i=1}^N P(O, s_T=q_i|\lambda) = P(O|\lambda) \tag{4-7} i=1∑NαT(qi)=i=1∑NP(O,sT=qi∣λ)=P(O∣λ)(4-7)
see , P ( O ∣ λ ) P(O|\lambda) P(O∣λ) There is .
Our next job is , See how to find this α t ( q i ) \alpha_t(q_i) αt(qi). α 1 ( q i ) \alpha_1(q_i) α1(qi) We know
α 1 ( q i ) = P ( o 1 , s 1 = q i ∣ λ ) = P ( o 1 ∣ s 1 = q i , λ ) P ( s 1 = q i ∣ λ ) \alpha_1(q_i) = P(o_1,s_1=q_i|\lambda)=P(o_1|s_1=q_i, \lambda)P(s_1=q_i|\lambda) α1(qi)=P(o1,s1=qi∣λ)=P(o1∣s1=qi,λ)P(s1=qi∣λ)
See that ? One is our launch probability , One is our initial probability , So there is
α 1 ( q i ) = b i ( o 1 ) π i (4-8) \alpha_1(q_i) = b_i(o_1) \pi_i \tag{4-8} α1(qi)=bi(o1)πi(4-8)
Now that I know α 1 ( q i ) \alpha_1(q_i) α1(qi), If we can still know α t ( q i ) \alpha_t(q_i) αt(qi) To α t + 1 ( q i ) \alpha_{t+1}(q_i) αt+1(qi) The recurrence formula of , Has this problem been solved soon ? Let's try ! λ \lambda λ It doesn't matter whether you write or not , If only we knew in our hearts , I won't write it next .
α t + 1 ( q i ) = P ( o 1 , o 2 , . . . , o t + 1 , s t + 1 = q i ) use whole General rate Gather together individual s t Out Come on try try = ∑ j = 1 N P ( o 1 , o 2 , . . . , o t + 1 , s t = q j , s t + 1 = q i ) carry o t + 1 = ∑ j = 1 N P ( o t + 1 ∣ o 1 , o 2 , . . . o t , s t = q j , s t + 1 = q i ) P ( o 1 , o 2 , . . . , o t , s t = q j , s t + 1 = q i ) benefit use view measuring single state false set up ( 3 − 2 ) = ∑ j = 1 N P ( o t + 1 ∣ s t + 1 = q i ) P ( o 1 , o 2 , . . . , o t , s t = q j , s t + 1 = q i ) carry after term Of s t + 1 = ∑ j = 1 N P ( o t + 1 ∣ s t + 1 = q i ) P ( s t + 1 = q i ∣ o 1 , o 2 , . . . , o t , s t = q j ) P ( o 1 , o 2 , . . . , o t , s t = q j ) benefit use Qi Time M a r k o v false set up ( 3 − 1 ) = ∑ j = 1 N P ( o t + 1 ∣ s t + 1 = q i ) P ( s t + 1 = q i ∣ s t = q j ) P ( o 1 , o 2 , . . . , o t , s t = q j ) \begin{aligned} \alpha_{t+1}(q_i) &= P(o_1, o_2, ..., o_{t+1}, s_{t+1}=q_i) \\ & Use the total probability to make a figure s_t Come out and try \\ &=\sum_{j=1}^N P(o_1, o_2, ..., o_{t+1}, s_t=q_j, s_{t+1}=q_i) \\ & carry o_{t+1}\\ &=\sum_{j=1}^N P(o_{t+1}|o_1, o_2, ...o_t, s_t=q_j, s_{t+1}=q_i)P(o_1, o_2, ...,o_t, s_t=q_j, s_{t+1}=q_i)\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(o_1, o_2, ...,o_t, s_t=q_j, s_{t+1}=q_i)\\ & Of the last item s_{t+1}\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(s_{t+1}=q_i|o_1, o_2, ...,o_t, s_t=q_j)P(o_1, o_2, ...,o_t, s_t=q_j)\\ & Using homogeneous Markov hypothesis (3-1)\\ &=\sum_{j=1}^N P(o_{t+1}|s_{t+1}=q_i)P(s_{t+1}=q_i|s_t=q_j)P(o_1, o_2, ...,o_t, s_t=q_j)\\ \end{aligned} αt+1(qi)=P(o1,o2,...,ot+1,st+1=qi) use whole General rate Gather together individual st Out Come on try try =j=1∑NP(o1,o2,...,ot+1,st=qj,st+1=qi) carry ot+1=j=1∑NP(ot+1∣o1,o2,...ot,st=qj,st+1=qi)P(o1,o2,...,ot,st=qj,st+1=qi) benefit use view measuring single state false set up (3−2)=j=1∑NP(ot+1∣st+1=qi)P(o1,o2,...,ot,st=qj,st+1=qi) carry after term Of st+1=j=1∑NP(ot+1∣st+1=qi)P(st+1=qi∣o1,o2,...,ot,st=qj)P(o1,o2,...,ot,st=qj) benefit use Qi Time Markov false set up (3−1)=j=1∑NP(ot+1∣st+1=qi)P(st+1=qi∣st=qj)P(o1,o2,...,ot,st=qj)
Have you found it? ? These three terms are the launch probability , State transition probability and α t ( q j ) \alpha_t(q_j) αt(qj).
So we get the recursion
α t + 1 ( q i ) = ∑ j = 1 N b i ( o t + 1 ) a j i α t ( q j ) (4-9) \alpha_{t+1}(q_i) = \sum_{j=1}^N b_i(o_{t+1})a_{ji}\alpha_t(q_j) \tag{4-9} αt+1(qi)=j=1∑Nbi(ot+1)ajiαt(qj)(4-9)
combination ( 4 − 8 ) (4-8) (4−8) and ( 4 − 9 ) (4-9) (4−9) We can get... In all States α T ( q i ) \alpha_T(q_i) αT(qi), ( 4 − 7 ) (4-7) (4−7) To solution , At this time, the complexity is O ( ( T N ) 2 ) O((TN)^2) O((TN)2).
4.2 Backward algorithm (backward algorithm)
Here's the picture 4 Shown , The backward algorithm considers the joint probability of variables in the green box , Write it down as
β t ( q i ) = P ( o t + 1 , . . . , o T − 1 , o T ∣ s t = q i , λ ) (4-10) \beta_t(q_i) = P(o_{t+1}, ..., o_{T-1}, o_T | s_t = q_i, \lambda) \tag{4-10} βt(qi)=P(ot+1,...,oT−1,oT∣st=qi,λ)(4-10)
This is also a design , And forward complementarity . Pay attention to ( 4 − 5 ) (4-5) (4−5) The difference between , The backward derivation is a little more convoluted than the forward one .
Let's see β 1 ( q i ) \beta_1(q_i) β1(qi) What is it like
β 1 ( q i ) = P ( o 2 , . . . , o T − 1 , o T ∣ s 1 = q i ∣ λ ) (4-11) \beta_1(q_i) = P(o_2, ..., o_{T-1}, o_T | s_1 = q_i | \lambda) \tag{4-11} β1(qi)=P(o2,...,oT−1,oT∣s1=qi∣λ)(4-11)
Let's take a look at this β 1 ( q i ) \beta_1(q_i) β1(qi) And what we asked P ( O ∣ λ ) P(O|\lambda) P(O∣λ) What does it matter
P ( O ∣ λ ) = P ( o 1 , o 2 , . . . , o T ∣ λ ) province A little λ , lead Enter into s 1 = ∑ i = 1 N P ( o 1 , o 2 , . . . , o T , s 1 = q i ) hold s 1 When become strip Pieces of = ∑ i = 1 N P ( o 1 , o 2 , . . . , o T ∣ s 1 = q i ) P ( s 1 = q i ) Demolition Out o 1 , notes It means after towards by first beginning General rate = ∑ i = 1 N P ( o 1 ∣ o 2 , . . . , o T , s 1 = q i ) P ( o 2 , . . . , o T ∣ s 1 = q i ) π i benefit use view measuring single state false set up ( 3 − 2 ) = ∑ i = 1 N P ( o 1 ∣ s 1 = q i ) P ( o 2 , . . . , o T ∣ s 1 = q i ) π i generation Enter into ( 4 − 11 ) and Hair shoot General rate = ∑ i = 1 N b i ( o 1 ) β 1 ( q i ) π i (4-12) \begin{aligned} P(O|\lambda) &= P(o_1, o_2, ..., o_T|\lambda) \\ & Omit \lambda, introduce s_1\\ &=\sum_{i=1}^{N} P(o_1, o_2, ..., o_T, s_1=q_i) \\ & hold s_1 As a condition \\ &=\sum_{i=1}^{N} P(o_1, o_2, ..., o_T | s_1=q_i)P(s_1=q_i)\\ & Dismantle o_1, Note that backward is the initial probability \\ &=\sum_{i=1}^{N} P(o_1 | o_2, ..., o_T, s_1=q_i)P(o_2, ..., o_T | s_1=q_i)\pi_i\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{i=1}^{N} P(o_1 | s_1=q_i)P(o_2, ..., o_T | s_1=q_i)\pi_i\\ & Plug in (4-11) And launch probability \\ &=\sum_{i=1}^{N} b_i(o_1)\beta_{1}(q_i)\pi_i \tag{4-12} \end{aligned} P(O∣λ)=P(o1,o2,...,oT∣λ) province A little λ, lead Enter into s1=i=1∑NP(o1,o2,...,oT,s1=qi) hold s1 When become strip Pieces of =i=1∑NP(o1,o2,...,oT∣s1=qi)P(s1=qi) Demolition Out o1, notes It means after towards by first beginning General rate =i=1∑NP(o1∣o2,...,oT,s1=qi)P(o2,...,oT∣s1=qi)πi benefit use view measuring single state false set up (3−2)=i=1∑NP(o1∣s1=qi)P(o2,...,oT∣s1=qi)πi generation Enter into (4−11) and Hair shoot General rate =i=1∑Nbi(o1)β1(qi)πi(4-12)
Since then , P ( O ∣ λ ) P(O|\lambda) P(O∣λ) and β 1 ( q i ) \beta_1(q_i) β1(qi) A relationship is found , All we have to do is ask for this β 1 ( q i ) \beta_1(q_i) β1(qi) 了 .
We make
β T ( q i ) = 1 (4-13) \beta_T(q_i) = 1 \tag{4-13} βT(qi)=1(4-13)
Then I'll do the math again β t ( q i ) \beta_t(q_i) βt(qi) and β t + 1 ( q j ) \beta_{t+1}(q_j) βt+1(qj) Recursive relations , λ \lambda λ I just omitted .
β t ( q i ) = P ( o t + 1 , . . . , o T − 1 , o T ∣ s t = q i ) benefit use whole General rate lead Enter into s t + 1 = ∑ j = 1 N P ( o t + 1 , . . . , o T , s t + 1 = q j ∣ s t = q i ) hold s t + 1 lead To strip Pieces of When in Go to = ∑ j = 1 N P ( o t + 1 , . . . , o T ∣ s t + 1 = q j , s t = q i ) P ( s t + 1 = q j ∣ s t = q i ) front term in Of o t + 1 , . . . , o T only and s t + 1 = q j Yes Turn off , this individual can Prove , but this in No Prove after term by shape state turn move General rate = ∑ j = 1 N P ( o t + 1 , . . . , o T ∣ s t + 1 = q j ) a i j hold o t + 1 take Out Come on = ∑ j = 1 N P ( o t + 1 ∣ o t + 2 , . . . , o T , s t + 1 = q j ) P ( o t + 2 , . . . , o T ∣ s t + 1 = q j ) a i j benefit use view measuring single state false set up ( 3 − 2 ) = ∑ j = 1 N P ( o t + 1 ∣ s t + 1 = q j ) β t + 1 ( q j ) a i j front term by Hair shoot General rate = ∑ j = 1 N b j ( o t + 1 ) β t + 1 ( q j ) a i j (4-14) \begin{aligned} \beta_t(q_i) &= P(o_{t+1}, ..., o_{T-1}, o_T | s_t = q_i) \\ & Using full probability, we introduce s_{t+1}\\ &= \sum_{j=1}^{N}P(o_{t+1}, ..., o_T, s_{t+1} = q_j | s_t = q_i) \\ & hold s_{t+1} Lead to the conditions \\ &=\sum_{j=1}^{N}P(o_{t+1}, ..., o_T | s_{t+1} = q_j, s_t = q_i)P(s_{t+1} = q_j | s_t = q_i)\\ & In the preceding paragraph o_{t+1}, ..., o_T And the only s_{t+1} = q_j of , This can be proved , But there is no evidence here \\ & The latter term is the state transition probability \\ &=\sum_{j=1}^{N}P(o_{t+1}, ..., o_T | s_{t+1} = q_j)a_{ij}\\ & hold o_{t+1} Take it out \\ &=\sum_{j=1}^{N}P(o_{t+1}| o_{t+2},..., o_T, s_{t+1} = q_j)P(o_{t+2}, ..., o_T | s_{t+1} = q_j)a_{ij}\\ & Using the observation independence Hypothesis (3-2)\\ &=\sum_{j=1}^{N}P(o_{t+1}|s_{t+1} = q_j)\beta_{t+1}(q_j)a_{ij}\\ & The preceding item is the launch probability \\ &=\sum_{j=1}^{N}b_j(o_{t+1})\beta_{t+1}(q_j)a_{ij} \end{aligned} \tag{4-14} βt(qi)=P(ot+1,...,oT−1,oT∣st=qi) benefit use whole General rate lead Enter into st+1=j=1∑NP(ot+1,...,oT,st+1=qj∣st=qi) hold st+1 lead To strip Pieces of When in Go to =j=1∑NP(ot+1,...,oT∣st+1=qj,st=qi)P(st+1=qj∣st=qi) front term in Of ot+1,...,oT only and st+1=qj Yes Turn off , this individual can Prove , but this in No Prove after term by shape state turn move General rate =j=1∑NP(ot+1,...,oT∣st+1=qj)aij hold ot+1 take Out Come on =j=1∑NP(ot+1∣ot+2,...,oT,st+1=qj)P(ot+2,...,oT∣st+1=qj)aij benefit use view measuring single state false set up (3−2)=j=1∑NP(ot+1∣st+1=qj)βt+1(qj)aij front term by Hair shoot General rate =j=1∑Nbj(ot+1)βt+1(qj)aij(4-14)
combination ( 4 − 13 ) (4-13) (4−13) and ( 4 − 14 ) (4-14) (4−14), Then we can find β 1 ( q i ) \beta_1(q_i) β1(qi), Then we can get P ( O ∣ λ ) P(O|\lambda) P(O∣λ).
It is worth noting that , At any moment t t t, We combine forward and backward algorithms , You can have
P ( O ∣ λ ) = ∑ i = 1 N α t ( q i ) β t ( q i ) (4-15) P(O|\lambda) = \sum_{i=1}^N \alpha_t(q_i)\beta_t(q_i) \tag{4-15} P(O∣λ)=i=1∑Nαt(qi)βt(qi)(4-15)
Here is a brief introduction , This will depend on an unreliable assumption , Namely o 1 , . . . , o t o_1,...,o_t o1,...,ot and o t + 1 , . . . , o T o_{t+1},...,o_{T} ot+1,...,oT It's irrelevant .
P ( O ∣ λ ) = ∑ i = 1 N P ( O , s t = q i ∣ λ ) = ∑ i = 1 N P ( o 1 , . . . , o t , o t + 1 , . . , o T , s t = q i ∣ λ ) = ∑ i = 1 N P ( o 1 , . . . , o t , s t = q i ∣ λ ) P ( o t + 1 , . . , o T ∣ o 1 , . . . , o t , s t = q i ) front term by α t ( q i ) , after towards In accordance with the lai On false set up Suddenly A little o 1 , . . . , o t Just yes β t ( q i ) = ∑ i = 1 N α t ( q i ) β t ( q i ) \begin{aligned} P(O|\lambda) &= \sum_{i=1}^N P(O, s_t=q_i|\lambda) \\ &= \sum_{i=1}^N P(o_1,...,o_t, o_{t+1}, .., o_T, s_t=q_i|\lambda) \\ &= \sum_{i=1}^N P(o_1,...,o_t, s_t=q_i|\lambda)P(o_{t+1}, .., o_T | o_1,...,o_t, s_t=q_i)\\ & The preceding paragraph is \alpha_t(q_i), Backward dependence on assumptions ignores o_1,...,o_t Namely \beta_t(q_i)\\ &=\sum_{i=1}^N \alpha_t(q_i)\beta_t(q_i) \end{aligned} P(O∣λ)=i=1∑NP(O,st=qi∣λ)=i=1∑NP(o1,...,ot,ot+1,..,oT,st=qi∣λ)=i=1∑NP(o1,...,ot,st=qi∣λ)P(ot+1,..,oT∣o1,...,ot,st=qi) front term by αt(qi), after towards In accordance with the lai On false set up Suddenly A little o1,...,ot Just yes βt(qi)=i=1∑Nαt(qi)βt(qi)
Although this assumption is not reliable , But to simplify the calculation , It's all done .
5 Learning
Learning One thing to do is , After a given observation sequence , Find the group of parameters with the greatest probability of obtaining the observation sequence λ \lambda λ, namely
λ M L E = a r g max λ P ( O ∣ λ ) (5-1) \lambda_{MLE} = arg\max_{\lambda}P(O|\lambda) \tag{5-1} λMLE=argλmaxP(O∣λ)(5-1)
there MLE Namely Max Likelyhood Estimation.
Be reasonable , If we can find the expression of the derivative , This λ M L E \lambda_{MLE} λMLE It came out all at once , But here it is. P ( O ∣ λ ) P(O|\lambda) P(O∣λ) It is generally a Gaussian mixture function , There is no direct derivation , So we need to use EM The algorithm .
This article does not cover EM What is the algorithm , We use it directly EM Algorithm , Want to know EM What is the algorithm , It is recommended to see Mr. xuyida's EM Algorithm explanation . In a word, it is , There is no way to find the derivative directly , We added an implicit variable , It becomes to find the derivative of another equation , The derivation of this equation is relatively simple , But it can't be achieved in one step , Need to iterate , Gradually approaching the local optimum .
EM The iterative formula of the algorithm is
θ ( t + 1 ) = a r g max θ ∫ z l o g P ( x , z ∣ θ ) P ( z ∣ x , θ ( t ) ) d z (5-2) \theta^{(t+1)} = arg\max_{\theta} \int_{z}log P(x,z|\theta)P(z|x, \theta^{(t)})dz \tag{5-2} θ(t+1)=argθmax∫zlogP(x,z∣θ)P(z∣x,θ(t))dz(5-2)
there θ \theta θ It's our model parameters λ \lambda λ, x x x Is our observed variable O O O, z z z Is our state variable S S S, We have S S S Is discrete , Integration becomes accumulation . Let's rewrite ( 5 − 2 ) (5-2) (5−2) There is
λ ( t + 1 ) = a r g max λ ∑ a l l S l o g P ( O , S ∣ λ ) P ( S ∣ O , λ ( t ) ) (5-3) \lambda^{(t+1)} = arg\max_{\lambda} \sum_{all\ S}log P(O,S|\lambda)P(S|O, \lambda^{(t)}) \tag{5-3} λ(t+1)=argλmaxall S∑logP(O,S∣λ)P(S∣O,λ(t))(5-3)
Let's do it again P ( S ∣ O , λ ( t ) ) P(S|O,\lambda^{(t)}) P(S∣O,λ(t)) Make a small change
P ( S ∣ O , λ ( t ) ) = P ( S , O ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) P(S|O,\lambda^{(t)}) = \frac{P(S,O|\lambda^{(t)})}{P(O|\lambda^{(t)})} P(S∣O,λ(t))=P(O∣λ(t))P(S,O∣λ(t))
there λ ( t ) \lambda^{(t)} λ(t) It's a constant , O O O Again with λ \lambda λ Irrelevant , therefore P ( O ∣ λ ( t ) ) P(O|\lambda^{(t)}) P(O∣λ(t)) This term is a constant , You can ignore . Therefore, the ( 5 − 3 ) (5-3) (5−3) Change it to
λ ( t + 1 ) = a r g max λ ∑ a l l S l o g P ( O , S ∣ λ ) P ( O , S ∣ λ ( t ) ) (5-4) \lambda^{(t+1)} = arg\max_{\lambda} \sum_{all\ S}log P(O,S|\lambda)P(O, S|\lambda^{(t)}) \tag{5-4} λ(t+1)=argλmaxall S∑logP(O,S∣λ)P(O,S∣λ(t))(5-4)
Operating time , It's just iteration ( 5 − 4 ) (5-4) (5−4). But look here , This argmax I still can't ask . This is actually quite complicated , The initial probability parameter will be found below π \pi π For example , Simple description , The rest is not required , It's too complicated , I can't bear it .
We define
Q ( λ , λ ( t ) ) = ∑ a l l S l o g P ( O , S ∣ λ ) P ( O , S ∣ λ ( t ) ) (5-5) Q(\lambda, \lambda^{(t)}) = \sum_{all\ S} log P(O,S|\lambda)P(O, S|\lambda^{(t)}) \tag{5-5} Q(λ,λ(t))=all S∑logP(O,S∣λ)P(O,S∣λ(t))(5-5)
Let's do it ( 4 − 4 ) (4-4) (4−4) Dai Jin has a look
Q ( λ , λ ( t ) ) = ∑ s 1 = q 1 q N . . . ∑ s T = q 1 q N l o g ( ∏ t = 1 T b s t ( o t ) ∏ t = 1 T − 1 a s t , s t + 1 π s 1 ) P ( O , S ∣ λ ( t ) ) = ∑ s 1 = q 1 q N . . . ∑ s T = q 1 q N ( l o g π s 1 + ∑ t = 1 T l o g b s t ( o t ) + ∑ t = 1 T − 1 l o g a s t , s t + 1 ) P ( O , S ∣ λ ( t ) ) \begin{aligned} Q(\lambda, \lambda^{(t)}) &= \sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}log(\prod_{t=1}^{T}b_{s_t}(o_t)\prod_{t=1}^{T-1} a_{s_t, s_{t+1}} \pi_{s_1})P(O, S|\lambda^{(t)})\\ &=\sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}(log\pi_{s_1} + \sum_{t=1}^T logb_{s_t}(o_t) + \sum_{t=1}^{T-1}loga_{s_t,s_{t+1}})P(O, S|\lambda^{(t)}) \end{aligned} Q(λ,λ(t))=s1=q1∑qN...sT=q1∑qNlog(t=1∏Tbst(ot)t=1∏T−1ast,st+1πs1)P(O,S∣λ(t))=s1=q1∑qN...sT=q1∑qN(logπs1+t=1∑Tlogbst(ot)+t=1∑T−1logast,st+1)P(O,S∣λ(t))
good , Let's make No t t t During the next iteration , The parameter of the initial probability is π ( t ) \pi^{(t)} π(t), that
π ( t + 1 ) = a r g max π Q ( λ , λ ( t ) ) too filter fall and π nothing Turn off Of change The amount = a r g max π ∑ s 1 = q 1 q N . . . ∑ s T = q 1 q N l o g π s 1 P ( O , s 1 , . . . , s T ∣ λ ( t ) ) and s 1 nothing Turn off Of shape state change The amount use whole General rate Go to fall = a r g max π ∑ s 1 = q 1 q N l o g π s 1 P ( O , s 1 ∣ λ ( t ) ) (5-6) \begin{aligned} \pi^{(t+1)} &= arg\max_{\pi}Q(\lambda, \lambda^{(t)})\\ & Filter out and \pi Independent variables \\ &=arg\max_{\pi} \sum_{s_1=q_1}^{q_N}...\sum_{s_T=q_1}^{q_N}log\pi_{s_1}P(O, s_1,...,s_T|\lambda^{(t)})\\ & and s_1 Irrelevant state variables are removed with full probability \\ &=arg\max_{\pi} \sum_{s_1=q_1}^{q_N}log\pi_{s_1}P(O, s_1|\lambda^{(t)})\\ \end{aligned} \tag{5-6} π(t+1)=argπmaxQ(λ,λ(t)) too filter fall and π nothing Turn off Of change The amount =argπmaxs1=q1∑qN...sT=q1∑qNlogπs1P(O,s1,...,sT∣λ(t)) and s1 nothing Turn off Of shape state change The amount use whole General rate Go to fall =argπmaxs1=q1∑qNlogπs1P(O,s1∣λ(t))(5-6)
thus , So we can easily find the derivative . But don't forget that there is a constraint , Namely
s . t . ∑ s 1 = q 1 q N π s 1 = 1 (5-7) s.t. \sum_{s_1=q_1}^{q_N}\pi_{s_1} = 1 \tag{5-7} s.t.s1=q1∑qNπs1=1(5-7)
Constrained extremum , Lagrange multiplier method . Make
L ( π s 1 , η ) = ∑ s 1 = q 1 q N ( l o g π s 1 P ( O , s 1 ∣ λ ( t ) ) ) + η ( ∑ s 1 = q 1 q N π s 1 − 1 ) (5-8) L(\pi_{s_1}, \eta) = \sum_{s_1=q_1}^{q_N}(log\pi_{s_1}P(O, s_1|\lambda^{(t)})) + \eta (\sum_{s_1=q_1}^{q_N}\pi_{s_1} - 1) \tag{5-8} L(πs1,η)=s1=q1∑qN(logπs1P(O,s1∣λ(t)))+η(s1=q1∑qNπs1−1)(5-8)
Yes ( 5 − 8 ) (5-8) (5−8) Do partial derivation , Yes
∂ L ∂ π s 1 = 1 π s 1 P ( O , s 1 ∣ λ ( t ) ) ) + η (5-9) \frac{\partial L}{\partial \pi_{s_1}} = \frac{1}{\pi_{s_1}}P(O, s_1|\lambda^{(t)})) + \eta \tag{5-9} ∂πs1∂L=πs11P(O,s1∣λ(t)))+η(5-9)
Let the partial derivative be equal to 0, Yes
P ( O , s 1 ∣ λ ( t ) ) ) + π s 1 ( t + 1 ) η = 0 (5-10) P(O, s_1|\lambda^{(t)})) + \pi_{s_1}^{(t+1)}\eta = 0 \tag{5-10} P(O,s1∣λ(t)))+πs1(t+1)η=0(5-10)
Sum all the state variables , Yes
∑ s 1 = q 1 q N ( P ( O , s 1 ∣ λ ( t ) ) + π s 1 ( t + 1 ) η ) = 0 \sum_{s_1=q_1}^{q_N} (P(O, s_1|\lambda^{(t)}) + \pi_{s_1}^{(t+1)}\eta) = 0 s1=q1∑qN(P(O,s1∣λ(t))+πs1(t+1)η)=0
There are
P ( O ∣ λ ( t ) ) + η = 0 P(O|\lambda^{(t)}) + \eta = 0 P(O∣λ(t))+η=0
namely
η = − P ( O ∣ λ ( t ) ) (5-11) \eta = -P(O|\lambda^{(t)}) \tag{5-11} η=−P(O∣λ(t))(5-11)
take ( 5 − 11 ) (5-11) (5−11) Substitute into ( 5 − 12 ) (5-12) (5−12) Yes
π s 1 ( t + 1 ) = P ( O , s 1 ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) (5-12) \pi_{s_1}^{(t+1)} =\frac{P(O, s_1|\lambda^{(t)})}{P(O|\lambda^{(t)})} \tag{5-12} πs1(t+1)=P(O∣λ(t))P(O,s1∣λ(t))(5-12)
Finally I got it , Other parameters can be calculated in a similar way , But it will be more complicated .
6 Decoding
Decoding The problem to be solved is
S ^ = a r g max S P ( S ∣ O , λ ) (6-1) \hat{S} = arg\max_{S}P(S|O, \lambda) \tag{6-1} S^=argSmaxP(S∣O,λ)(6-1)
Which translates as , The model parameters are given , Which group of the corresponding state variable sequence is the best .
because P ( O ∣ λ ) P(O|\lambda) P(O∣λ) Are variables that have been observed , We could also say ( 6 − 1 ) (6-1) (6−1) Equivalent to
S ^ = a r g max S P ( S ∣ O , λ ) P ( O ∣ λ ) = a r g max S P ( S , O ∣ λ ) (6-2) \hat{S} = arg\max_{S}P(S|O, \lambda)P(O|\lambda) = arg\max_{S}P(S, O| \lambda) \tag{6-2} S^=argSmaxP(S∣O,λ)P(O∣λ)=argSmaxP(S,O∣λ)(6-2)
Let's draw a picture to see

Look at the picture and you will understand it at once , Our every time step The state variables of all have N N N Status , We are in each time step Select a state variable , Form a path , Make the joint probability of passing through the whole path maximum .
There's a total of N T N^T NT Paths , If you calculate the probability of each path , Take the one with the greatest probability , The time complexity is too high . therefore , We use the idea of dynamic programming to solve this problem , It's also called Viterbi algorithm.
We make
δ t ( q i ) = max s 1 , . . . , s t − 1 P ( o 1 , . . . , o t , s 1 , . . . , s t − 1 , s t = q i ∣ λ ) (6-3) \delta_t(q_i)=\max_{s_1, ..., s_{t-1}}P(o_1,...,o_t, s_1, ..., s_{t-1}, s_t=q_i | \lambda) \tag{6-3} δt(qi)=s1,...,st−1maxP(o1,...,ot,s1,...,st−1,st=qi∣λ)(6-3)
Translation is , When t t t The state of the moment is q i q_i qi when , Make it possible to t t t The state path with the maximum joint probability up to time [ s 1 , . . . , s t − 1 ] [s_1, ..., s_{t-1}] [s1,...,st−1] by δ t ( q i ) \delta_t(q_i) δt(qi).
Let's see δ t + 1 ( q j ) \delta_{t+1}(q_j) δt+1(qj) Time and δ t ( q i ) \delta_t(q_i) δt(qi) The relationship between
δ t + 1 ( q j ) = max s 1 , . . . , s t P ( o 1 , . . . , o t , s 1 , . . . , s t , s t + 1 = q j ∣ λ ) All over calendar t when moment the Yes Of δ t ( q i ) = max 1 ≤ i ≤ N δ t ( q i ) a i j b j ( o t + 1 ) (6-4) \begin{aligned} \delta_{t+1}(q_j) &= \max_{s_1, ..., s_{t}}P(o_1,...,o_t, s_1, ..., s_{t}, s_{t+1}=q_j | \lambda)\\ & Traverse t All the time \delta_t{}(q_i)\\ &=\max_{1\leq i \leq N}\delta_{t}(q_i)a_{ij}b_j(o_{t+1}) \end{aligned} \tag{6-4} δt+1(qj)=s1,...,stmaxP(o1,...,ot,s1,...,st,st+1=qj∣λ) All over calendar t when moment the Yes Of δt(qi)=1≤i≤Nmaxδt(qi)aijbj(ot+1)(6-4)
This is the recursion . With this recursion , Put all the δ t ( q i ) \delta_t(q_i) δt(qi) Figure out .
At the same time, we also need to define a record to each time step The optimal previous state of each state of
ψ t ( j ) = a r g max i δ t ( q i ) a i j (6-5) \psi_t(j) = arg\max_{i}\delta_t(q_i)a_{ij} \tag{6-5} ψt(j)=argimaxδt(qi)aij(6-5)
This ( 6 − 5 ) (6-5) (6−5) It is used to trace back the path , This is familiar to those who have studied dynamic programming .
Eventually we will be at the last time step Find the state with the greatest probability of yes , Write it down as
q T ∗ = a r g max i ( δ T ( q i ) ) (6-6) q_T^* = arg\max_{i}(\delta_T(q_i)) \tag{6-6} qT∗=argimax(δT(qi))(6-6)
And then use ( 6 − 5 ) (6-5) (6−5) Just keep going back .
Reference material
[1] machine learning - Whiteboard derivation series ( fourteen )- hidden Markov model HMM
边栏推荐
- js获取元素
- Easydl related documents and codes
- [the third day of actual combat of smart lock project based on stm32f401ret6 in 10 days] communication foundation and understanding serial port
- What are the differences in cache/tlb?
- Mbedtls migration experience
- [the second day of actual combat of smart lock project based on stm32f401ret6 in 10 days] GPIO and register
- [unity] problems encountered in packaging webgl project and their solutions
- Uniapp preview function
- Introduction to armv8/armv9 - learning this article is enough
- Basic exercise of test questions decimal to hexadecimal
猜你喜欢

Deep learning the principle of armv8/armv9 cache

Gome's ambition of "folding up" app

Barrykay electronics rushes to the scientific innovation board: it is planned to raise 360million yuan. Mr. and Mrs. Wang Binhua are the major shareholders

Introduction to armv8/armv9 - learning this article is enough

Huawei equipment is configured with dual reflectors to optimize the backbone layer of the virtual private network

Ruixing coffee moves towards "national consumption"

Looking at Qianxin's "wild prospect" of network security from the 2021 annual performance report

Mbedtls migration experience
![[the 4th day of the 10 day smart lock project based on stm32f401ret6] what is interrupt, interrupt service function, system tick timer](/img/c4/0d97def5fb587b8301bcb907fc6fcf.jpg)
[the 4th day of the 10 day smart lock project based on stm32f401ret6] what is interrupt, interrupt service function, system tick timer

Top level configuration + cooling black technology + cool appearance, the Red Devils 6S Pro is worthy of the flagship game of the year
随机推荐
Sensor: MQ-5 gas module measures the gas value (code attached at the bottom)
[work with notes] NDK compiles the open source library ffmpeg
[arithmetic, relation, logic, bit, compound assignment, self increasing, self decreasing and other] operators (learning note 4 -- C language operators)
ROS learning-6 detailed explanation of publisher programming syntax
[51nod.3210] binary Statistics (bit operation)
When AI meets music, iFLYTEK music leads the industry reform with technology
STM32 sensorless brushless motor drive
Leetcode 93 recovery IP address
The execution results of i+=2 and i++ i++ under synchronized are different
华为设备配置IP和虚拟专用网混合FRR
1000粉丝啦~
ROS learning-8 pit for custom action programming
JS get element
Sensor: sht30 temperature and humidity sensor testing ambient temperature and humidity experiment (code attached at the bottom)
反爬虫策略(ip代理、设置随机休眠时间、哔哩哔哩视频信息爬取、真实URL的获取、特殊字符的处理、时间戳的处理、多线程处理)
Configuring virtual private network FRR for Huawei equipment
Basic exercise of test questions decimal to hexadecimal
Viewing the ambition of Xiaodu technology from intelligent giant screen TV v86
Ruixing coffee moves towards "national consumption"
[the fourth day of actual combat of stm32f401ret6 smart lock project in 10 days] voice control is realized by externally interrupted keys