当前位置：网站首页>Text error correction -- crisp model

Text error correction -- crisp model

2022-06-13 11:39:00 【xuanningmeng】

Text correction –CRASpell Model

CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction This paper was published in 22 year ACL, stay Chinese spelling correction (CSC) The task is SOTA. be based on bert In the pre training model CSC Our model has two limits ：
(1) The model does not work well on multi error text , Usually in misspelled text , Misspelled characters occur at least 1 Time , This will bring noise , This kind of noisy text leads to the performance degradation of multi wrong word text .
(2) because bert Mask task , These models over correct the useful expressions of high-frequency words .
CRASpell Model each training sample to construct a noisy sample ,correct The model is based on the output which is more similar to the original training data and the noise sample output , To solve the over correction problem , A copy mechanism is combined to enable our model to select input characters when error correction and input characters are valid according to the given context .
The address of the article is ： article
The code address is ：code

Model

Task description

The purpose of Chinese spelling correction is to detect and correct spelling errors in text . Usually expressed as $\Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\}$ It's a length of $n$ Contains misspelled text , $\Large\boldsymbol{Y} = \{y_{1}, y_{2}, \dots, y_{n}\}$ It's a length of $n$ Correct text for , Model input $\Large\boldsymbol{X}$ Generate correct text $\Large\boldsymbol{Y}$ .

CRASpell Model

CRASpell Model
On the left is Correction Model , On the right is Noise Model , The following describes the model in detail .
(1) Correction Module
Given the input text $\Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\}$ obtain embedding vector $\Large\boldsymbol{E} = \{e_{1}, e_{2}, \dots, e_{n}\}$ , Each of these characters $x_{i}$ Corresponding embedding The vector is written as $e_{i}$ , take $\Large\boldsymbol{E}$ Input to Transformer Encoder Get in hidden state matrix $\Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\}$ , among $h_{i}\in\Large\boldsymbol{R}^{768}$ Is the character $x_{i}$ after Transformer Encoder Get the feature .
(2) Generative Distribution
$\Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\}$ after Transformer Encoder Get the eigenvector $\Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\}$ , Through a forward linear layer and a softmax Layer gets each character token Generation probability of , The formula is as follows :

$p_{g} = softmax(W_{g}h_{i} + b_{g})$
among $W_{g}\in \Large\boldsymbol{R}^{n_{v}\times768}$ , $b_{g}\in \Large\boldsymbol{R}^{768}$ , $n_{v}$ Is the size of the pre training model vocabulary .

(3) Copy Distribution
$x_{i}$ Of copy distribution $p_{c} \in \{0,1\}^{n_{v}}$ yes $x_{i}$ In the dictionary $idx(x_{i})$ Of one-hot Express , The specific expression is as follows ：
Insert picture description here
(4) Copy Probability
Copy Probability Is in the model diagram Copy Block The output in $\omega \in\Large\boldsymbol{R}$ , namely transformers encoder The obtained hidden layer eigenvector $h_{i}$ Through two forward linear layers and one layer normalization obtain $\omega$ , The formula is as follows ：

$h_{c} = W_{ch}f_{ln}(h_{i}) + b_{ch} \\ h_{c}^{'} = f_{ln}(f_{act}(h_{c})) \\ \omega = Sigmoid(W_{c}h_{c}^{'})$
among $W_{ch}\in\Large\boldsymbol{R}^{768\times d_{c}}$ , $b_{ch} \in \Large\boldsymbol{R}^{d_{c}}$ , $W_{c}\in\Large\boldsymbol{R}^{d_{c}\times 1}$ , $f_{ln}$ yes layer normalization, $f_{act}$ Is the activation function , The activation function used in the code is glue. See the code for details

Copy Block Output probability $p$ Combined generation Generative Distribution $p_{g}$ and Copy Distribution $p_{c}$
$\omega\times p_{c} + (1 - \omega)\times p_{g}$
Prior to CSC The difference between models is that ,CRASpell The model takes into account... In the final generation output of the model Copy Probability $p_{c}$ , Make the model valid in input characters but not the most suitable BERT There are more opportunities to choose input characters , Avoid overcorrection .
(5) Noise Modeling Module
Noise Modeling Module The correction model produces a similar distribution for the original context and the noise context to solve the problem of context misspelling interference . As shown on the right side of the model diagram above ,Noise Modeling Module It can be roughly divided into the following processes ：
a. According to the input sample $\Large\boldsymbol{X}$ Generate noise context $\Large\widetilde{\boldsymbol{X}}$
b. Noise context $\Large\widetilde{\boldsymbol{X}}$ As input, get Transformer Encoder Get the hidden eigenvector $\widetilde{\boldsymbol{H}}$
c. Based on hidden eigenvectors $\widetilde{\boldsymbol{H}}$ Generate generate distribution $\widetilde{p_{g}}$
d. The generated distribution is similar to that generated by the calibration model . The generated distribution is similar to the distribution generated by the calibration model minimizing the bidirectional Kullback-
Leibler divergence reflect , The formula is as follows ：
$\mathcal{L}_{KL} = \frac{1}{2}(\mathcal{D}_{KL}(p_{g}\Vert\widetilde{p_{g}}) + \mathcal{D}_{KL}(\widetilde{p_{g}}\Vert p_{g}))$
remarks ：Noise Modeling Module Only during training , Model reasoning only uses correction networks
Noisy Block
The following describes adding noise data to data . The noise sample is generated by replacing the characters of the original training sample . In the process of replacing characters, only the misspelled character context is replaced $d_{t}$ Word , If the training sample is not misspelled , This sample is not replaced to generate noise samples . As shown in the figure below ：
Insert picture description here
$d_{t}$ Select the experimental results

We replace each selected position with a similar character based on the publicly available confusion set . say concretely , We choose the word at the position to replace
(i) 70% The replacement randomly selects phonetically similar characters
(ii) 15% Random selection of glyph like characters
(iii) 15% A random selection of alternatives from the vocabulary .

Loss

Given a training sample $(\Large\boldsymbol{X}, \Large\boldsymbol{Y})$ , $\Large\boldsymbol{X}$ Is an input error sample , $\Large\boldsymbol{Y}$ Is to correct the sample , Correct sample for each calibration $\Large\boldsymbol{Y_{i}}$ Of loss by
$\mathcal{L}_{c}^{i} = -\log(p(\Large\boldsymbol{Y_{i}}|\Large\boldsymbol{X}))$
among $p$ by
$\omega\times p_{c} + (1 - \omega)\times p_{g}$
See the introduction above for details .
Model loss by $\mathcal{L}$
$\mathcal{L}^{i} = (1 - \alpha_{i})\mathcal{L}_{c}^{i} + \alpha_{i} \mathcal{L}_{KL}^{i}$
among $\alpha_{i}$
Insert picture description here
among $\alpha$ yes $\mathcal{L}_{c}$ and $\mathcal{L}_{KL}$ The trade-off factor . The constructed noise samples themselves will not participate in the training process , It will only participate as a context . This strategy aims to ensure that the constructed noise data will not change the proportion of positive and negative samples in the training corpus .