当前位置:网站首页>Text error correction -- crisp model
Text error correction -- crisp model
2022-06-13 11:39:00 【xuanningmeng】
Text correction –CRASpell Model
CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction This paper was published in 22 year ACL, stay Chinese spelling correction (CSC) The task is SOTA. be based on bert In the pre training model CSC Our model has two limits :
(1) The model does not work well on multi error text , Usually in misspelled text , Misspelled characters occur at least 1 Time , This will bring noise , This kind of noisy text leads to the performance degradation of multi wrong word text .
(2) because bert Mask task , These models over correct the useful expressions of high-frequency words .
CRASpell Model each training sample to construct a noisy sample ,correct The model is based on the output which is more similar to the original training data and the noise sample output , To solve the over correction problem , A copy mechanism is combined to enable our model to select input characters when error correction and input characters are valid according to the given context .
The address of the article is : article
The code address is :code
Model
Task description
The purpose of Chinese spelling correction is to detect and correct spelling errors in text . Usually expressed as X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} It's a length of n n n Contains misspelled text , Y = { y 1 , y 2 , … , y n } \Large\boldsymbol{Y} = \{y_{1}, y_{2}, \dots, y_{n}\} Y={ y1,y2,…,yn} It's a length of n n n Correct text for , Model input X \Large\boldsymbol{X} X Generate correct text Y \Large\boldsymbol{Y} Y.
CRASpell Model
On the left is Correction Model , On the right is Noise Model , The following describes the model in detail .
(1) Correction Module
Given the input text X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} obtain embedding vector E = { e 1 , e 2 , … , e n } \Large\boldsymbol{E} = \{e_{1}, e_{2}, \dots, e_{n}\} E={ e1,e2,…,en}, Each of these characters x i x_{i} xi Corresponding embedding The vector is written as e i e_{i} ei, take E \Large\boldsymbol{E} E Input to Transformer Encoder Get in hidden state matrix H = { h 1 , h 2 , … , h n } \Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\} H={ h1,h2,…,hn}, among h i ∈ R 768 h_{i}\in\Large\boldsymbol{R}^{768} hi∈R768 Is the character x i x_{i} xi after Transformer Encoder Get the feature .
(2) Generative Distribution
X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} after Transformer Encoder Get the eigenvector H = { h 1 , h 2 , … , h n } \Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\} H={ h1,h2,…,hn}, Through a forward linear layer and a softmax Layer gets each character token Generation probability of , The formula is as follows :
p g = s o f t m a x ( W g h i + b g ) p_{g} = softmax(W_{g}h_{i} + b_{g}) pg=softmax(Wghi+bg)
among W g ∈ R n v × 768 W_{g}\in \Large\boldsymbol{R}^{n_{v}\times768} Wg∈Rnv×768, b g ∈ R 768 b_{g}\in \Large\boldsymbol{R}^{768} bg∈R768, n v n_{v} nv Is the size of the pre training model vocabulary .
(3) Copy Distribution
x i x_{i} xi Of copy distribution p c ∈ { 0 , 1 } n v p_{c} \in \{0,1\}^{n_{v}} pc∈{ 0,1}nv yes x i x_{i} xi In the dictionary i d x ( x i ) idx(x_{i}) idx(xi) Of one-hot Express , The specific expression is as follows :
(4) Copy Probability
Copy Probability Is in the model diagram Copy Block The output in $\omega \in\Large\boldsymbol{R}$
, namely transformers encoder The obtained hidden layer eigenvector $h_{i}$
Through two forward linear layers and one layer normalization obtain $\omega$
, The formula is as follows :
h c = W c h f l n ( h i ) + b c h h c ′ = f l n ( f a c t ( h c ) ) ω = S i g m o i d ( W c h c ′ ) h_{c} = W_{ch}f_{ln}(h_{i}) + b_{ch} \\ h_{c}^{'} = f_{ln}(f_{act}(h_{c})) \\ \omega = Sigmoid(W_{c}h_{c}^{'}) hc=Wchfln(hi)+bchhc′=fln(fact(hc))ω=Sigmoid(Wchc′)
among W c h ∈ R 768 × d c W_{ch}\in\Large\boldsymbol{R}^{768\times d_{c}} Wch∈R768×dc, b c h ∈ R d c b_{ch} \in \Large\boldsymbol{R}^{d_{c}} bch∈Rdc, W c ∈ R d c × 1 W_{c}\in\Large\boldsymbol{R}^{d_{c}\times 1} Wc∈Rdc×1, f l n f_{ln} fln yes layer normalization, f a c t f_{act} fact Is the activation function , The activation function used in the code is glue. See the code for details
Copy Block Output probability p p p Combined generation Generative Distribution p g p_{g} pg and Copy Distribution p c p_{c} pc
p = ω × p c + ( 1 − ω ) × p g p = \omega\times p_{c} + (1 - \omega)\times p_{g} p=ω×pc+(1−ω)×pg
Prior to CSC The difference between models is that ,CRASpell The model takes into account... In the final generation output of the model Copy Probability p c p_{c} pc, Make the model valid in input characters but not the most suitable BERT There are more opportunities to choose input characters , Avoid overcorrection .
(5) Noise Modeling Module
Noise Modeling Module The correction model produces a similar distribution for the original context and the noise context to solve the problem of context misspelling interference . As shown on the right side of the model diagram above ,Noise Modeling Module It can be roughly divided into the following processes :
a. According to the input sample X \Large\boldsymbol{X} X Generate noise context X ~ \Large\widetilde{\boldsymbol{X}} X
b. Noise context X ~ \Large\widetilde{\boldsymbol{X}} X As input, get Transformer Encoder Get the hidden eigenvector H ~ \widetilde{\boldsymbol{H}} H
c. Based on hidden eigenvectors H ~ \widetilde{\boldsymbol{H}} H Generate generate distribution p g ~ \widetilde{p_{g}} pg
d. The generated distribution is similar to that generated by the calibration model . The generated distribution is similar to the distribution generated by the calibration model minimizing the bidirectional Kullback-
Leibler divergence reflect , The formula is as follows :
L K L = 1 2 ( D K L ( p g ∥ p g ~ ) + D K L ( p g ~ ∥ p g ) ) \mathcal{L}_{KL} = \frac{1}{2}(\mathcal{D}_{KL}(p_{g}\Vert\widetilde{p_{g}}) + \mathcal{D}_{KL}(\widetilde{p_{g}}\Vert p_{g})) LKL=21(DKL(pg∥pg)+DKL(pg∥pg))
remarks :Noise Modeling Module Only during training , Model reasoning only uses correction networks
Noisy Block
The following describes adding noise data to data . The noise sample is generated by replacing the characters of the original training sample . In the process of replacing characters, only the misspelled character context is replaced d t d_{t} dt Word , If the training sample is not misspelled , This sample is not replaced to generate noise samples . As shown in the figure below :
d t d_{t} dt Select the experimental results
We replace each selected position with a similar character based on the publicly available confusion set . say concretely , We choose the word at the position to replace
(i) 70% The replacement randomly selects phonetically similar characters
(ii) 15% Random selection of glyph like characters
(iii) 15% A random selection of alternatives from the vocabulary .
Loss
Given a training sample ( X , Y ) (\Large\boldsymbol{X}, \Large\boldsymbol{Y}) (X,Y), X \Large\boldsymbol{X} X Is an input error sample , Y \Large\boldsymbol{Y} Y Is to correct the sample , Correct sample for each calibration Y i \Large\boldsymbol{Y_{i}} Yi Of loss by
L c i = − log ( p ( Y i ∣ X ) ) \mathcal{L}_{c}^{i} = -\log(p(\Large\boldsymbol{Y_{i}}|\Large\boldsymbol{X})) Lci=−log(p(Yi∣X))
among p p p by
p = ω × p c + ( 1 − ω ) × p g p = \omega\times p_{c} + (1 - \omega)\times p_{g} p=ω×pc+(1−ω)×pg
See the introduction above for details .
Model loss by L \mathcal{L} L
L i = ( 1 − α i ) L c i + α i L K L i \mathcal{L}^{i} = (1 - \alpha_{i})\mathcal{L}_{c}^{i} + \alpha_{i} \mathcal{L}_{KL}^{i} Li=(1−αi)Lci+αiLKLi
among α i \alpha_{i} αi
among α \alpha α yes L c \mathcal{L}_{c} Lc and L K L \mathcal{L}_{KL} LKL The trade-off factor . The constructed noise samples themselves will not participate in the training process , It will only participate as a context . This strategy aims to ensure that the constructed noise data will not change the proportion of positive and negative samples in the training corpus .
experimental result
Later supplement sighan The result on the dataset .
CRASpell Model experimental results
边栏推荐
- Socket programming (Part 1)
- Show/exec and close/hide of QT form are not executed when calling the close destructor
- (Yousheng small message-04) how to use mobile WPS for electronic signature on PDF
- Euler function and finding Euler function by linear sieve
- 高斯消元求n元方程组
- 5.5 clock screensaver
- Web 3.0? High cost version of P2P
- 银行业务系统数据库设计与实现
- ue5 小知识点 random point in Bounding Boxf From Stream
- 17张图:读懂国内首个《主机安全能力建设指南》
猜你喜欢
Analysis and summary of 2021ccpc online games
欧拉函数和线性筛求欧拉函数
[tcapulusdb knowledge base] tcapulusdb doc acceptance - Introduction to creating game area
文本纠错--CRASpell模型
How vscode converts a tab key in an edited file into a spacebar
数据库连接数设置多少合适?
Interval modification multiplication and addition (a good example of understanding lazy tags)
[tcapulusdb knowledge base] Introduction to tmonitor background one click installation (II)
[tcapulusdb knowledge base] Introduction to tmonitor background one click installation (I)
(一)爬取Best Sellers的所有分类信息:爬取流程
随机推荐
[tcapulusdb knowledge base] tcapulusdb tmonitor module architecture introduction
Meta universe land: what makes digital real estate valuable
TS进阶之keyof
容斥原理(能被整除的数)
Nim游戏阶梯 Nim游戏和SG函数应用(集合游戏)
Audio and video technology development weekly 𞓜 249
How fastapi responds to file downloads
【TcaplusDB知识库】TcaplusDB新增机型介绍
[tcapulusdb knowledge base] Introduction to tcapulusdb general documents
很妙的贪心(F2. Nearest Beautiful Number (hard version))
Apache APISIX v2.14.1 探索性版本发布,进军更多领域
ARM64 上的性能怪兽:API 网关 Apache APISIX 在 AWS Graviton3 上的安装和性能测试
Base du Shell - 30, utilisation de la commande sed
乘法逆元作用
关于 SAP Spartacus CmsService.getComponentData 可能的优化思路
Miidock file distribution
程序员面试这么重视考察概念还是第一次见
面试技巧问答
ue5 小知识点 geometry script modeling
break algorithm---dynamic planning(dp-func)