当前位置：网站首页>ABM thesis translation

ABM thesis translation

2022-07-02 07:36:00 【wxplol】

List of articles

Address of thesis ： https://arxiv.org/pdf/2112.03603.pdf

Project address ：https://github.com/XH-B/ABM

One 、Abstract

This paper proposes an attention aggregation model based on two-way interactive learning （ABM）, This model consists of two parallel encoders with opposite directions （L2R and R2L） form . These two encoders distill each other , In the training of one-to-one information transmission at each step , Complementary information in both directions is fully utilized . in addition , In order to deal with mathematical symbols of different scales , This paper proposes an attention aggregation model （AAM）, This model can aggregate attention at different scales . It is worth noting that , In the reasoning stage , Considering that the model has learned knowledge from two directions , So just use L2R Some branches of reasoning , In this way, the size of the original parameters and the reasoning speed can be maintained .

Two 、Introduction

WAP First, two-dimensional attention is introduced , To solve the problem of insufficient space coverage , Here's the picture 1 The two-dimensional attention shown focuses on the sum of the past , Designed to track past alignment information , In this way, the attention model can be guided , Assign a higher probability of attention to the untranslated region of the image . However , The main limitation of this approach is , It only uses historical alignment information , Without considering future information （ Untranslated area ）. Such as , Many mathematical expressions are symmetrical structures , Left “{” And right “}” Parentheses always appear together , Sometimes far away . Some symbols in the equation are related , such as $f$ and $d x$ . Most methods only use left to right attention to recognize the current symbol , While ignoring the future information from the right , This may cause attention drift . The dependent information between the symbol and the previous symbol becomes weaker with the increase of its distance . therefore , They do not take full advantage of long-distance correlations or grammatical norms of mathematical expressions .

BTTR Use with two directions transformer Decoder to solve the problem of attention drift , But there is no effective way to BTTR Learn the opposite direction supervision information , also BTTR There is no alignment of attention in the whole learning process , This makes it still limited in identifying long formulas .

DWAP-MSA By adding multi-scale features to the code, we can alleviate the problem of recognition difficulty or uncertainty caused by the change of character scale in mathematical expressions . However , They do not scale the local acceptance domain , Only zoom the feature map , This makes it impossible to pay attention to small characters accurately in the recognition process .

therefore , We proposed ABM frame , The framework contains three models ：（1） feature extraction . Use DenseNet The extracted features ;（2） Attention aggregation module （Attention Aggregation module）. We propose multiscale attention , Recognize characters of different sizes in mathematical expressions , Thus, the current recognition accuracy is improved , It alleviates the problem of error accumulation .（3） Two way learning module （ Bi-directional Mutual Learning module）. We propose a new decoder framework , There are two parallel decoder branches with opposite decoding directions (L2R and R2L), And use mutual distillation to learn from each other . Be careful , Although we use two decoders for training , But we only use one L2R Branch for reasoning .

chart 1 Typical HMER（Handwritten Mathematical Expression Recognition） Model architecture

3、 ... and 、Method

We propose a new end-to-end attention aggregation and two-way mutual learning (ABM) framework , Pictured 2 Shown . It mainly consists of three modules ：

Feature extraction module （FEM), This module can extract feature information from a mathematical expression image .
Note the aggregation module (AAM) Integrated multiscale coverage note , Align history note information , In the decoding stage, the different scale features of symbols of different sizes are effectively aggregated .
Two way mutual learning module (BML) By two parallel decoders with opposite decoding directions (L2R and R2L) form , To complement each other . In the process of training , Each decoder branch can not only learn real latex Sequence , You can also learn the opposite latex Sequence , So as to improve the decoding ability .

chart 2 ABM Model architecture

3.1、 Feature extraction module （Feature Extraction Modul）

Use DenseNet The extracted features , Output $\times W \times D$ , We encode information transformation M dimension （ $\times W$ ）, The output vector is $a=(a_{1},a_{2},\dots,a_{M} )$ .

3.2、 Note the aggregation module （Attention Aggregation Module）

The attention mechanism can guide the encoder to pay more attention to the specific area of the input picture . Especially the attention mechanism based on the overall situation , It can better track the alignment information and guide the model to allocate a higher probability of attention to the translation area . Inspired by this , We proposed AAM Modules aggregate different receptive fields on global attention . Different from the traditional attention mechanism ,AAM Not only pay attention to local information , At the same time, we also pay attention to the overall information on the larger receptive field . therefore ,AAM Will produce finer alignment information , And help the model capture more accurate spatial relationships . differ DWAP-MSA The model passes dense Multi-scale branches of the encoder to generate low-level and high-level features ,AAM Propose a hidden state $h_{t}$ 、 Characteristics of figure $F$ And global attention $\beta_{t}$ Calculate the current attention weight $\alpha_{t}$ , Then get the context vector $c_{t}$ .
$A_{s} = U_{s}β_{t}, A_{l} = U_{l}β_{t} \\ β_{t}=\sum^{t-1}_{l=1}\alpha_{l}$
$U_{s}$ and $U_{l}$ It means small nucleus and large nucleus respectively （ Such as 5、11） Convolution of , $β_{t}$ Represents the sum of all the attention probabilities in the past , Initialize to zero vector . among , $α_{l}$ For the first time $l$ Step's attention score .

therefore , The current attention $α_{t}$ The calculation process is as follows ：
$\alpha_{t}=v^{T}_{a}tanh(W_{h}h_{t}+U_{f}F+W_{s}A_{s}+W_{l}A_{l})$
The final context vector $c_{t}$ For characteristic information $a$ And attention t force $α_{t}$ Weighted sum of , The calculation formula is as follows ：
$c_{t}=\sum^{M}_{i=1}\alpha_{t,i}a_{i}$

3.3、 Two way mutual learning module （Bi-directional Mutual Learning Module）

Given a mathematical formula, input the image , The traditional method is decoding from left to right (L2R), This method does not consider the problem of long-distance dependence . Therefore, we propose a bidirectional decoder to translate the input image into two opposite directions （L2R 、R2L） Of Latex Sequence , Then learn to decode information from each other . The two branches have the same architecture , Only in its decoding direction .

For two-way training , Let's add $< s o s >$ and $< e o s >$ As the beginning and end symbols of latex sequence . Specially , For length is T Of Latex Sequence $Y=(Y_{1},Y_{2},...,Y_{T})$ ,

From left to right (L2R) Express ： $y = (< s o s >, Y 1, Y 2, . . ., Y T, < e o s >)$

From right to left (R2L) Express ： $y=(<eos>,Y_{T},Y_{T−1},...,Y_{1},<eos>)$

L2R and R2L Branch in step t The probability of prediction at is calculated as follows ：
$p(\vec y|\vec y_{y-1})=W_{o}max(W_{y}E \vec y_{t-1}+W_{h}h_{t}+W_{t}c_{t}) \\ p(\overleftarrow y|\overleftarrow y_{y-1})=W^{'}_{o}max(W_{y}E^{'} \overleftarrow y_{t-1}+W^{'}_{h}h^{'}_{t}+W^{'}_{t}c^{'}_{t})$
among , $h_{t}$ 、 $\vec y_{t}$ Express L2R Step in branch t Current state of and previous predicted output . $*^{'}$ Express R2L Branch . $W_{o}\in R^{K\times d}$ , $W_{y}\in R^{d \times n}$ 、 $W_{h} \in R^{d \times n}$ and $W^{d \times D}$ Is a trainable matrix .d、K and n Respectively denote attention dimension 、 Number of all tag classes and GRU Dimension of .E Is an embedded matrix .Max Indicates the maximum activation function . Hide representation ${h1、h2、...,ht\}$ from ：
$\widehat h_{t} =f_{1}(h_{t-1},E \vec y_{t-1}), \\ h_{t}=f_{2}(\widehat h_{t},c_{t})$
We define L2R The probability of branching is $\vec p_{l2r}= \{ <sos>,\vec y_{1},\vec y_{2},...,\vec y_{T},<eos> \}$ ,R2L The probability of branching is $\vec p_{r2l}= \{ <eos>,\overleftarrow y_{1},\overleftarrow y_{2},...,\overleftarrow y_{T},<sos> \}$ . $y_{i}$ Is to execute the i Prediction probability of label symbol in step decoding . In order to learn the predicted distribution of the two branches from each other , We need to align by L2R and R2L Generated by decoder LaTeX Sequence . meanwhile , introduce kullback-leibler(KL) Loss to quantify the difference in the predicted distribution between them . In the process of training , We use the soft probability generated by the model to provide more information . therefore , about k Categories , come from L2R The soft probability of a branch is defined as ：
$\sigma(\vec z_{i,k},S)=\frac{exp(\vec z_{i,k})/S}{\sum^{K}_{j=1}exp(\vec Z_{i,j}/S)}$
among ,S Indicates the parameters for generating soft labels . The... Of the sequence calculated by the decoder network $i$ The logarithm of symbols is defined as $z_{i}=\{z_{1},z_{2},...,z_{K}\}$ . Our goal is to minimize the distance between two branch probability distributions . therefore , $\vec p_{l2r}$ And $\overleftarrow P^{∗}_{r2l}$ Between KL The distance is calculated as follows ：
$L_{KL}= S^{2}\sum^{T}_{i=1}\sum^{K}_{j=1}\sigma(\vec Z_{i,j},S)log\frac{\sigma(\vec Z_{i,j},S)}{\sigma(\overleftarrow Z_{T+1-i,j},S)}$

3.4、 Loss function （Loss Function）

Specially , For length is T Of Latex Sequence $\vec y_{l2r}=\{<sos>,Y_{1},Y_{2},...,Y_{T},<eos>\}$ , We will be the first to i Time steps correspond to one-hot The real label is expressed as $Y_{i}=\{x_{1},x_{2},...,x_{K}\}$ . The first k It's symbolic softmax The probability is calculated as ：
$\vec y_{i,k}=\frac{exp(\vec Z_{i,k})}{\sum^{K}_{j=1}exp(\vec Z_{i,j})}$
For multi classification , Target tag with two branches softmax The cross entropy loss between probabilities is defined as ：
$L^{l2r}_{ce}=\sum^{T}_{i=1}\sum^{K}_{j=1}-Y_{i,j}log(\vec y_{i,j}) \\ L^{r2l}_{ce}=\sum^{T}_{i=1}\sum^{K}_{j=1}-Y_{i,j}log(\overleftarrow y_{T+1-i,j})$
The global loss function is ：
$L=L^{l2r}_{ce}+L^{r2l}_{ce}+\lambda L_{KL}$

Four 、Experiments

chart 3 ABM Compared with the previous network

chart 4 Translate handwritten mathematical expressions into two directions (L2R and R2L) Of LaTeX The coverage of the sequence pays attention to the visualization process