当前位置：网站首页>Hands on deep learning (45) -- bundle search

Hands on deep learning (45) -- bundle search

2022-07-04 09:41:00 【Stay a little star】

List of articles

Beam search

Beam search

stay seq2seq in , We predict the markers of the output sequence one by one , Until the end of the sequence marker appears in the prediction sequence “<eos>”. In this section , We will first discuss this Greedy search （greedy search） Strategy , And discuss its existing problems , Then compare this strategy with other alternative strategies ： Exhaustive search （exhaustive search） and Beam search （beam search）.

Before introducing greedy search , Let's use seq2seq The same mathematical symbols in define the search problem . At any time step $t^{'}$ , Decoder output $y_{t'}$ The probability of depends on the time step $t^{'}$ Previous output subsequence $y_1, \ldots, y_{t'-1}$ The context variable encoded with the information of the input sequence $\mathbf{c}$ . In order to quantify the cost , use $\mathcal{Y}$ （ It contains “<eos>”） Represents the output vocabulary . So the cardinality of this vocabulary set $\left|\mathcal{Y}\right|$ Is the size of the vocabulary . We also specify the maximum number of tags in the output sequence as $T^{'}$ . therefore , Our goal is from all $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ Find the ideal output in a possible output sequence . Of course , For all output sequences , These sequences contain “<eos>” And subsequent parts will be discarded in the actual output .

One 、 Greedy search

First , Let's look at a simple strategy ： Greedy search . This strategy has been used seq2seq Sequence prediction of . For any time step of the output sequence $t^{'}$ , We will all search from... Based on greed $\mathcal{Y}$ Find the marker with the highest conditional probability , namely ：

$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$

Once the output sequence contains “<eos>” Or reach its maximum length $T^{'}$ , Then the output is completed .

So what's the problem with greedy search ?

actually , The optimal sequence （optimal sequence） It should be maximization $\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$ Output sequence of values , This is the conditional probability of generating the output sequence based on the input sequence . Unfortunately , There is no guarantee that the optimal sequence can be obtained through greedy search .

Let's use an example to describe . Suppose there are four tags in the output “A”、“B”、“C” and “<eos>”. In the diagram above , The four numbers under each time step represent the generation of “A”、“B”、“C” and “<eos>” Conditional probability of . At each time step , Greedy search selects the marker with the highest conditional probability . therefore , The output sequence will be predicted in the figure “A”、“B”、“C” and “<eos>”. The conditional probability of this output sequence is $0.5 \times 0.4 \times 0.4 \times 0.6 = 0.048$ .

Next , Let's take a look at another example in the figure above . Different from Figure 1 , In time step 2 in , We choose the tag in Figure 1 “C”, It has second High conditional probability . Because of the time step 3 Based on time steps 1 and 2 The output subsequence at has been removed from “A” and “B” Change for “A” and “C”, So time step 3 The conditional probability of each marker at is also shown in the figure 2 Change in China . Let's say we're in the time step 3 Choose the mark “B”. Now? , Time step 4 The previous three time steps “A”、“C” and “B” The output subsequence of is conditional , This is similar to “A”、“B” and “C” Different . therefore , In the figure 2 Time step in 4 The conditional probability of generating each tag is also different from the graph 1 Conditional probabilities in . result , chart 2 Output sequence in “A”、“C”、“B” and “<eos>” The conditional probability of is $0.5\times0.3 \times0.6\times0.6=0.054$ , This is larger than figure 1 Conditional probability of greedy search in .

In this case , Output sequence obtained by greedy search “A”、“B”、“C” and “<eos>” Not the best sequence .

Two 、 Exhaustive search

If the goal is to obtain the optimal sequence , We can consider using Exhaustive search （exhaustive search）： Enumerate exhaustively all possible output sequences and their conditional probabilities , Then output the one with the highest conditional probability .

Although we can use exhaustive search to obtain the optimal sequence , But it's Amount of computation $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ May be too high . for example , When $|\mathcal{Y}|=10000$ and $T^{'} = 10$ when , We need to evaluate $10000^{10} = 10^{40}$ Sequence . It's almost impossible . On the other hand , The calculation amount of greedy search is $\mathcal{O}(\left|\mathcal{Y}\right|T')$ ： It is usually significantly smaller than exhaustive search . for example , When $|\mathcal{Y}|=10000$ and $T^{'} = 10$ when , We just need to evaluate $10000\times10=10^5$ A sequence of .

3、 ... and 、 Beam search

Determining the sequence search strategy depends on a range , In any extreme case, there are problems . If only accuracy is the most important ？ Is obviously an exhaustive search . If calculating the cost is the most important ？ Is obviously greedy search . Practical applications lie between these two extremes .

Beam search （beam search） Is an improved version of greedy search . It has a super parameter , be known as Beam width （beam size） $k$ . In time step $1$ , We choose the one with the highest conditional probability $k$ A sign . this $k$ The marks will be $k$ The first tag of the candidate output sequence . At each subsequent time step , Based on the last time step $k$ Candidate output sequences , We will continue from $k\left|\mathcal{Y}\right|$ Choose the one with the highest conditional probability among the possible choices $k$ Candidate output sequences .

The above figure illustrates the process of beam search . Suppose the output vocabulary contains only five elements ： $\mathcal{Y} = \{A, B, C, D, E\}$ , One of them is “<eos>”. Set the beam width to 2, The maximum length of the output sequence is 3. In time step 1, Assume the highest conditional probability $P(y_1 \mid \mathbf{c})$ The sign of is $A$ and $C$ . In time step 2, We calculate all $y_2 \in \mathcal{Y}$ ：

$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$

Choose the largest two of these ten values , such as $\mid \mathbf{c})$ and $\mid \mathbf{c})$ . Then in the time step 3, For all $y_3 \in \mathcal{Y}$ , We calculated ：

$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$

Choose the largest two of these ten values , namely $\mid \mathbf{c})$ and $\mid \mathbf{c})$ . result , We get six candidate output sequences ：（1） $A$ ;（2） $C$ ;（3） $A, B$ ;（4） $C, E$ ;（5） $A, B, D$ ;（6） $C, E, D$ .

Last , We are based on these six sequences （ for example , Discarding includes “<eos>” And the rest ） Obtain the final set of candidate output sequences . Then we choose the following sequence with the highest score as the output sequence ：

$\frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$

among $L$ Is the length of the final candidate sequence , $\alpha$ Usually set to 0.75. Because a longer sequence will have more logarithmic terms in the sum of the above formula , So in the denominator $L^\alpha$ Used to punish long sequences .

The calculation amount of beam search is $\mathcal{O}(k\left|\mathcal{Y}\right|T')$ . This result is between greedy search and exhaustive search . actually , Greedy search can be seen as a beam width of 1 Special type of bundle search . By flexibly selecting the beam width , Beam search can make a trade-off between accuracy and computational cost .

Summary

The sequence search strategy includes greedy search 、 Exhaustive search and beam search .
Beam search through flexible selection of beam width , Find a balance between accuracy and computational cost .