当前位置：网站首页>Hidden Markov model (HMM): model parameter estimation

Hidden Markov model (HMM): model parameter estimation

2022-07-01 16:53:00 【HadesZ~】

It is estimated that HMM Model parameters , According to whether the observation sequence corresponds to the state sequence , It can be divided into supervised learning algorithm and unsupervised learning algorithm .

1. Supervised learning estimation HMM Model parameters

Suppose the given training data contains $n$ Observation sequence and corresponding state sequence （ The length of different observation sequences can be the same , It can be different ） $\{(X_1, Y_1), (X_2, Y_2), \cdots, (X_n, Y_n)\}$ , When the state and hang test sequence are known , The model parameter values on each sample can be obtained by direct statistics according to the definition of model parameters , Then the estimation of model parameters can be obtained by calculating the expectation on all samples of the whole training data set .

1.1 Transfer probability $a_{ij}$ Estimation

Set the first $k$ Among samples , $t$ Always in the State $s_i$ 、 $t + 1$ Always in the State $s_j$ The frequency of is $A_{ij}$ , So the state transition probability $a_{ij}$ My estimate is ：
$\hat{a}_{ij} = \frac{ \sum_{k=1}^{n} A_{ij} }{ \sum_{k=1}^{n} \sum_{j=1}^{N} A_{ij} }, \ \ \ \ \ \ i = 1, 2, \cdots, N; \ \ \ \ j = 1, 2, \cdots, N \tag{1.1}$

1.2 Observed probability of occurrence $b_j(v_l)$ Estimation

Set the first $k$ Among samples , Status as $s_j$ And the observation is $o_l$ The frequency of this is $B_{jl}$ , Then the observed probability of occurrence $b_j(v_l)$ My estimate is ：
$\hat{b}_{j}(o_l) = \frac{ \sum_{k=1}^{n} B_{jl} }{ \sum_{k=1}^{n} \sum_{l=1}^{M} B_{jl} }, \ \ \ \ \ \ i = 1, 2, \cdots, N; \ \ \ \ l = 1, 2, \cdots, M \tag{1.2}$

1.3 Initial state probability $\pi_i$ Estimation

Initial state probability $\pi_i$ My estimate is $n$ Among samples , The frequency of the corresponding state ：
$\hat{\pi}_i = \frac{1}{n}\sum_{k=1}^{n}\ if(y_1 = s_i, \ 1, \ 0), \ \ \ \ \ \ i = 1, 2, \cdots, N \tag{1.3}$

2. Unsupervised learning estimation HMM Model parameters

Because the cost of status tagging is high , So only the observation sequence data is given 、 Request estimate HMM Model parameters are more common . Assume that the training data only contains a length of $T$ The sequence of observations $X$ And there's no corresponding state sequence $Y$ , The goal is to estimate the parameters of hidden Markov model $\lambda = (A, B, \pi)$ . In this case , Sequence of States $Y$ Is an unobservable hidden variable （hidden variable）,HHM The model is a probability model with hidden variables ：

$\lambda)= \sum_{Y}P(X, Y | \lambda) = \sum_{Y}P(X | Y, \lambda)P(Y | \lambda) \tag{2.1}$ according to EM Algorithm ,HHM The maximum likelihood estimation of model parameters is ¹：
$\hat{\lambda} = \argmax_{\lambda} Q(\lambda, \bar{\lambda}) \tag{2.2}$ $Q(\lambda, \bar{\lambda}) = \sum_{Y} P(X, Y | \bar{\lambda}) \cdot logP(X, Y | \lambda) \tag{2.3}$

according to $Q$ Function definition , $type (2.3)$ Omitted $\lambda$ The constant factor of $\lambda)$ .

Because by definition ,HMM In the model
$\lambda) = \pi_{i_1}b_{i_1}(x_1) \cdot a_{i_1i_2}b_{i_2}(x_2) \cdots a_{i_{T-1}i_T}b_{i_T}(x_T)$

So the algorithm based on logarithm , The subitems involved in each parameter of the model in parameter estimation can be disassembled and collected , $type (2.3)$ It can be rewritten as follows ：
$Q(\lambda, \bar{\lambda}) = \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) + \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) + \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda})$
therefore , Find the model parameters $\lambda$ Maximum likelihood estimation of , It can be converted to separate calculation Maximum likelihood estimation of each parameter of the model ：
$\hat{\pi}_i = \argmax_{\pi_i} \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) \tag{2.4}$ $\hat{a}_{ij} = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) \tag{2.5}$ $\hat{b}_j(k) = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda}) \tag{2.6}$

2.1 Initial state probability $\pi_i$ Estimation

$\hat{\pi}_i = \argmax_{\pi_i} \sum_{Y} log(\pi_{i_1}) \cdot P(X, Y | \bar{\lambda}) \tag{2.4}$ $\hat{\pi}_i = \argmax_{\pi_i} \sum_{i=1}^{N} log(\pi_{i}) \cdot P(X, y1=s_i | \bar{\lambda}) \tag{2.1.1}$

be aware $\pi_i$ Meet the constraints $\sum_{i=1}^{N} \pi_i = 1$ , Write $type (2.1.1)$ Lagrange function of ：

$\sum_{i=1}^{N} log(\pi_{i}) \cdot P(X, y1=s_i | \bar{\lambda}) + \gamma(\sum_{i=1}^{N} \pi_i - 1) \tag{2.1.2}$

Yes $type (2.1.2)$ About $\pi_i$ And make the result equal to 0, have to ：

$\frac{P(X, y1=s_i | \bar{\lambda})}{\pi_{i}} + \gamma = 0$ $y1=s_i | \bar{\lambda}) + \gamma \pi_{i} = 0 \tag{2.1.3}$

Yes $type (2.1.3)$ All in $i$ Sum of possible situations , obtain ：

$\sum_{i=1}^{N} [P(X, y1=s_i | \bar{\lambda}) + \gamma \pi_{i}] = 0$ $\sum_{i=1}^{N} P(X, y1=s_i | \bar{\lambda}) + \gamma \sum_{i=1}^{N} \pi_{i} = 0 \tag{2.1.4}$ because
$\begin{cases} \sum_{i=1}^{N} P(X, y1=s_i | \bar{\lambda}) = P(X | \bar {\lambda}) \\ \\ \sum_{i=1}^{N} \pi_{i} = 1 \end{cases}$ therefore
$\gamma = -P(X | \bar {\lambda}) \tag{2.1.5}$

So bring it into $type (2.1.3)$ after , Get parameters $\pi_i$ Maximum likelihood estimation of ：

$\hat{\pi}_i = \frac{P(X, y1=s_i | \bar{\lambda})}{P(X | \bar {\lambda})} \tag{2.1.6}$

2.2 Transfer probability $a_{ij}$ Estimation

$\hat{a}_{ij} = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T-1} log(a_{i_ti_{t+1}})] \cdot P(X, Y | \bar{\lambda}) \tag{2.5}$ $\hat{a}_{ij} = \argmax_{\pi_i} \sum_{i=1}^{N} \sum_{j=1}^{N} \sum_{t=1}^{T-1} log(a_{ij}) \cdot P(X, y_t=s_i, y_{t+1}=s_j | \bar{\lambda}) \tag{2.2.1}$
Empathy , The application has constraints $\sum_{j=1}^{N} a_{ij} = 1$ Lagrange multiplier method , We can work out $a_{ij}$ Maximum likelihood estimation of ：

$\hat{a}_{ij} = \frac{ \sum_{t=1}^{T-1} P(X, y_t=s_i, y_{t+1}=s_j | \bar{\lambda}) }{ \sum_{t=1}^{T-1} P(X, y_t=s_i | \bar{\lambda}) } \tag{2.2.2}$

2.3 Observed probability of occurrence $b_j(v_l)$ Estimation

$\hat{b}_j(k) = \argmax_{\pi_i} \sum_{Y} [\sum_{t=1}^{T} log(b_{i_t}(x_t)) ] \cdot P(X, Y | \bar{\lambda}) \tag{2.6}$ $\hat{b}_j(k) = \argmax_{\pi_i} \sum_{j=1}^{N} [\sum_{t=1}^{T} log(b_{j}(x_t)) ] \cdot P(X, y_t=s_j | \bar{\lambda}) \tag{2.3.1}$
The Lagrange multiplier method is also applied , The constraint is $\sum_{k=1}^{M} b_j(k) = 1$ . Be careful , Only in $x_t = o_k$ when $b_j(x_t)$ Yes $b_j(k)$ The partial derivative of is not always 0, Get $b_j(k)$ Maximum likelihood estimation of ：

$\hat{b}_j(k) = \frac{ \sum_{t=1}^{T} P(X \cap \bar{x}_t, x_t=o_k, y_t=s_j | \bar{\lambda}) }{ \sum_{t=1}^{T} \sum_{k=1}^{M} P(X \cap \bar{x}_t, x_t=o_k, y_t=s_j | \bar{\lambda}) }$

2.4 Baum-Welch Algorithm implementation

Input ： Observation sequence of random process
Output ： Maximum likelihood estimation of hidden Markov model parameters .

（1） initialization
about $n = 0$ , Select any that meets the defined range $a_{ij}^{(0)}, \ b_{j}(k)^{(0)}, \ \pi_i^{(0)}$ , Get the initial value of model parameters $\lambda^{(0)} = (A^{(0)}, B^{(0)}, \pi^{(0)})$ ;

（2） Iterative training
$a_{ij}^{(n+1)} = \ \frac{ \sum_{t=1}^{T-1} \xi_t(i,j | X, \lambda^{(n)}) }{ \sum_{t=1}^{T-1} \gamma_t(i | X, \lambda^{(n)}) }$
$b_j(k)^{(n+1)} = \ \frac{ \sum_{t=1, \ x_t=o_k}^{T-1} \gamma_t(j | X, \lambda^{(n)}) }{ \sum_{t=1}^{T-1} \gamma_t(j | X, \lambda^{(n)}) }$
$\pi_i^{(n+1)} = \gamma_1(i | X, \lambda^{(n)})$
In style $\xi_t(i,j)$ and $\gamma_t(i)$ from HMM Forward algorithm and backward algorithm of the model are introduced , Please refer to the author's article for the specific derivation process ： The hidden Markov model （HMM）： Calculate the probability of occurrence of the observation sequence No 4 Section .

（3） End
When $\lambda^{(n+1)}$ Almost no longer changes or changes are less than a given threshold （ That is, it has converged ） when , Stop iterative training . Get the maximum likelihood estimation of the model parameters ：
$\begin{cases} \hat{a}_{ij} = a_{ij}^{(n+1)} \\ \\ \hat{b}_j(k) = b_j(k)^{(n+1)} \\ \\ \hat{\pi}_i = \pi_i^{(n+1)} \end{cases}$