当前位置:网站首页>Hands on deep learning (45) -- bundle search
Hands on deep learning (45) -- bundle search
2022-07-04 09:41:00 【Stay a little star】
List of articles
Beam search
stay seq2seq in , We predict the markers of the output sequence one by one , Until the end of the sequence marker appears in the prediction sequence “<eos>”. In this section , We will first discuss this Greedy search (greedy search) Strategy , And discuss its existing problems , Then compare this strategy with other alternative strategies : Exhaustive search (exhaustive search) and Beam search (beam search).
Before introducing greedy search , Let's use seq2seq The same mathematical symbols in define the search problem . At any time step t ′ t' t′, Decoder output y t ′ y_{t'} yt′ The probability of depends on the time step t ′ t' t′ Previous output subsequence y 1 , … , y t ′ − 1 y_1, \ldots, y_{t'-1} y1,…,yt′−1 The context variable encoded with the information of the input sequence c \mathbf{c} c. In order to quantify the cost , use Y \mathcal{Y} Y( It contains “<eos>”) Represents the output vocabulary . So the cardinality of this vocabulary set ∣ Y ∣ \left|\mathcal{Y}\right| ∣Y∣ Is the size of the vocabulary . We also specify the maximum number of tags in the output sequence as T ′ T' T′. therefore , Our goal is from all O ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|^{T'}) O(∣Y∣T′) Find the ideal output in a possible output sequence . Of course , For all output sequences , These sequences contain “<eos>” And subsequent parts will be discarded in the actual output .
One 、 Greedy search
First , Let's look at a simple strategy : Greedy search . This strategy has been used seq2seq Sequence prediction of . For any time step of the output sequence t ′ t' t′, We will all search from... Based on greed Y \mathcal{Y} Y Find the marker with the highest conditional probability , namely :
y t ′ = * a r g m a x y ∈ Y P ( y ∣ y 1 , … , y t ′ − 1 , c ) y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}) yt′=*argmaxy∈YP(y∣y1,…,yt′−1,c)
Once the output sequence contains “<eos>” Or reach its maximum length T ′ T' T′, Then the output is completed .
So what's the problem with greedy search ?
actually , The optimal sequence (optimal sequence) It should be maximization ∏ t ′ = 1 T ′ P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) \prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}) ∏t′=1T′P(yt′∣y1,…,yt′−1,c) Output sequence of values , This is the conditional probability of generating the output sequence based on the input sequence . Unfortunately , There is no guarantee that the optimal sequence can be obtained through greedy search .

Let's use an example to describe . Suppose there are four tags in the output “A”、“B”、“C” and “<eos>”. In the diagram above , The four numbers under each time step represent the generation of “A”、“B”、“C” and “<eos>” Conditional probability of . At each time step , Greedy search selects the marker with the highest conditional probability . therefore , The output sequence will be predicted in the figure “A”、“B”、“C” and “<eos>”. The conditional probability of this output sequence is 0.5 × 0.4 × 0.4 × 0.6 = 0.048 0.5 \times 0.4 \times 0.4 \times 0.6 = 0.048 0.5×0.4×0.4×0.6=0.048.

Next , Let's take a look at another example in the figure above . Different from Figure 1 , In time step 2 in , We choose the tag in Figure 1 “C”, It has second High conditional probability . Because of the time step 3 Based on time steps 1 and 2 The output subsequence at has been removed from “A” and “B” Change for “A” and “C”, So time step 3 The conditional probability of each marker at is also shown in the figure 2 Change in China . Let's say we're in the time step 3 Choose the mark “B”. Now? , Time step 4 The previous three time steps “A”、“C” and “B” The output subsequence of is conditional , This is similar to “A”、“B” and “C” Different . therefore , In the figure 2 Time step in 4 The conditional probability of generating each tag is also different from the graph 1 Conditional probabilities in . result , chart 2 Output sequence in “A”、“C”、“B” and “<eos>” The conditional probability of is 0.5 × 0.3 × 0.6 × 0.6 = 0.054 0.5\times0.3 \times0.6\times0.6=0.054 0.5×0.3×0.6×0.6=0.054, This is larger than figure 1 Conditional probability of greedy search in .
In this case , Output sequence obtained by greedy search “A”、“B”、“C” and “<eos>” Not the best sequence .
Two 、 Exhaustive search
If the goal is to obtain the optimal sequence , We can consider using Exhaustive search (exhaustive search): Enumerate exhaustively all possible output sequences and their conditional probabilities , Then output the one with the highest conditional probability .
Although we can use exhaustive search to obtain the optimal sequence , But it's Amount of computation O ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|^{T'}) O(∣Y∣T′) May be too high . for example , When ∣ Y ∣ = 10000 |\mathcal{Y}|=10000 ∣Y∣=10000 and T ′ = 10 T'=10 T′=10 when , We need to evaluate 1000 0 10 = 1 0 40 10000^{10} = 10^{40} 1000010=1040 Sequence . It's almost impossible . On the other hand , The calculation amount of greedy search is O ( ∣ Y ∣ T ′ ) \mathcal{O}(\left|\mathcal{Y}\right|T') O(∣Y∣T′): It is usually significantly smaller than exhaustive search . for example , When ∣ Y ∣ = 10000 |\mathcal{Y}|=10000 ∣Y∣=10000 and T ′ = 10 T'=10 T′=10 when , We just need to evaluate 10000 × 10 = 1 0 5 10000\times10=10^5 10000×10=105 A sequence of .
3、 ... and 、 Beam search
Determining the sequence search strategy depends on a range , In any extreme case, there are problems . If only accuracy is the most important ? Is obviously an exhaustive search . If calculating the cost is the most important ? Is obviously greedy search . Practical applications lie between these two extremes .
Beam search (beam search) Is an improved version of greedy search . It has a super parameter , be known as Beam width (beam size) k k k. In time step 1 1 1, We choose the one with the highest conditional probability k k k A sign . this k k k The marks will be k k k The first tag of the candidate output sequence . At each subsequent time step , Based on the last time step k k k Candidate output sequences , We will continue from k ∣ Y ∣ k\left|\mathcal{Y}\right| k∣Y∣ Choose the one with the highest conditional probability among the possible choices k k k Candidate output sequences .

The above figure illustrates the process of beam search . Suppose the output vocabulary contains only five elements : Y = { A , B , C , D , E } \mathcal{Y} = \{A, B, C, D, E\} Y={ A,B,C,D,E}, One of them is “<eos>”. Set the beam width to 2, The maximum length of the output sequence is 3. In time step 1, Assume the highest conditional probability P ( y 1 ∣ c ) P(y_1 \mid \mathbf{c}) P(y1∣c) The sign of is A A A and C C C. In time step 2, We calculate all y 2 ∈ Y y_2 \in \mathcal{Y} y2∈Y:
P ( A , y 2 ∣ c ) = P ( A ∣ c ) P ( y 2 ∣ A , c ) , P ( C , y 2 ∣ c ) = P ( C ∣ c ) P ( y 2 ∣ C , c ) , \begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned} P(A,y2∣c)=P(A∣c)P(y2∣A,c),P(C,y2∣c)=P(C∣c)P(y2∣C,c),
Choose the largest two of these ten values , such as P ( A , B ∣ c ) P(A, B \mid \mathbf{c}) P(A,B∣c) and P ( C , E ∣ c ) P(C, E \mid \mathbf{c}) P(C,E∣c). Then in the time step 3, For all y 3 ∈ Y y_3 \in \mathcal{Y} y3∈Y, We calculated :
P ( A , B , y 3 ∣ c ) = P ( A , B ∣ c ) P ( y 3 ∣ A , B , c ) , P ( C , E , y 3 ∣ c ) = P ( C , E ∣ c ) P ( y 3 ∣ C , E , c ) , \begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned} P(A,B,y3∣c)=P(A,B∣c)P(y3∣A,B,c),P(C,E,y3∣c)=P(C,E∣c)P(y3∣C,E,c),
Choose the largest two of these ten values , namely P ( A , B , D ∣ c ) P(A, B, D \mid \mathbf{c}) P(A,B,D∣c) and P ( C , E , D ∣ c ) P(C, E, D \mid \mathbf{c}) P(C,E,D∣c). result , We get six candidate output sequences :(1) A A A;(2) C C C;(3) A , B A,B A,B;(4) C , E C,E C,E;(5) A , B , D A,B,D A,B,D ;(6) C , E , D C,E,D C,E,D.
Last , We are based on these six sequences ( for example , Discarding includes “<eos>” And the rest ) Obtain the final set of candidate output sequences . Then we choose the following sequence with the highest score as the output sequence :
1 L α log P ( y 1 , … , y L ) = 1 L α ∑ t ′ = 1 L log P ( y t ′ ∣ y 1 , … , y t ′ − 1 , c ) , \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}), Lα1logP(y1,…,yL)=Lα1t′=1∑LlogP(yt′∣y1,…,yt′−1,c),
among L L L Is the length of the final candidate sequence , α \alpha α Usually set to 0.75. Because a longer sequence will have more logarithmic terms in the sum of the above formula , So in the denominator L α L^\alpha Lα Used to punish long sequences .
The calculation amount of beam search is O ( k ∣ Y ∣ T ′ ) \mathcal{O}(k\left|\mathcal{Y}\right|T') O(k∣Y∣T′). This result is between greedy search and exhaustive search . actually , Greedy search can be seen as a beam width of 1 Special type of bundle search . By flexibly selecting the beam width , Beam search can make a trade-off between accuracy and computational cost .
Summary
- The sequence search strategy includes greedy search 、 Exhaustive search and beam search .
- Beam search through flexible selection of beam width , Find a balance between accuracy and computational cost .
practice
- Can we regard exhaustive search as a special kind of bundle search ? Why? ?
k = 1 Greedy search ;k = n It's time to search exhaustively
- stay seq2seq Application of bundle search in machine translation problems . How beam width affects results and prediction speed ?
The smaller the beam width , The fewer times it is exhausted , So the prediction speed will be faster , But the result may not be the optimal result
- stay seq2seq in , We use a language model to generate user prefixed text . What search strategy does it use ? Can you improve ?
边栏推荐
- pcl::fromROSMsg报警告Failed to find match for field ‘intensity‘.
- How can people not love the amazing design of XXL job
- How web pages interact with applets
- 165 webmaster online toolbox website source code / hare online tool system v2.2.7 Chinese version
- 2022-2028 global small batch batch batch furnace industry research and trend analysis report
- Ultimate bug finding method - two points
- Global and Chinese market of bipolar generators 2022-2028: Research Report on technology, participants, trends, market size and share
- Investment analysis and future production and marketing demand forecast report of China's paper industry Ⓥ 2022 ~ 2028
- Kotlin:集合使用
- Reload CUDA and cudnn (for tensorflow and pytorch) [personal sorting summary]
猜你喜欢
2022-2028 global elastic strain sensor industry research and trend analysis report
C language pointer classic interview question - the first bullet
Daughter love: frequency spectrum analysis of a piece of music
If you can quickly generate a dictionary from two lists
libmysqlclient.so.20: cannot open shared object file: No such file or directory
Fabric of kubernetes CNI plug-in
Hands on deep learning (33) -- style transfer
回复评论的sql
C语言指针面试题——第二弹
You can see the employment prospects of PMP project management
随机推荐
ArrayBuffer
C language pointer classic interview question - the first bullet
Flutter 小技巧之 ListView 和 PageView 的各種花式嵌套
"How to connect the network" reading notes - Web server request and response (4)
ASP. Net to access directory files outside the project website
技术管理进阶——如何设计并跟进不同层级同学的绩效
2022-2028 global edible probiotic raw material industry research and trend analysis report
Hands on deep learning (36) -- language model and data set
Global and Chinese market of wheel hubs 2022-2028: Research Report on technology, participants, trends, market size and share
Solution to null JSON after serialization in golang
Some points needing attention in PMP learning
Upgrading Xcode 12 caused Carthage to build cartfile containing only rxswift to fail
Rules for using init in golang
How web pages interact with applets
mmclassification 标注文件生成
Hands on deep learning (35) -- text preprocessing (NLP)
Golang 类型比较
Golang defer
Basic data types in golang
Problems encountered by scan, scanf and scanln in golang