当前位置:网站首页>Text error correction -- crisp model
Text error correction -- crisp model
2022-06-13 11:39:00 【xuanningmeng】
Text correction –CRASpell Model
CRASpell: A Contextual Typo Robust Approach to Improve Chinese Spelling Correction This paper was published in 22 year ACL, stay Chinese spelling correction (CSC) The task is SOTA. be based on bert In the pre training model CSC Our model has two limits :
(1) The model does not work well on multi error text , Usually in misspelled text , Misspelled characters occur at least 1 Time , This will bring noise , This kind of noisy text leads to the performance degradation of multi wrong word text .
(2) because bert Mask task , These models over correct the useful expressions of high-frequency words .
CRASpell Model each training sample to construct a noisy sample ,correct The model is based on the output which is more similar to the original training data and the noise sample output , To solve the over correction problem , A copy mechanism is combined to enable our model to select input characters when error correction and input characters are valid according to the given context .
The address of the article is : article
The code address is :code
Model
Task description
The purpose of Chinese spelling correction is to detect and correct spelling errors in text . Usually expressed as X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} It's a length of n n n Contains misspelled text , Y = { y 1 , y 2 , … , y n } \Large\boldsymbol{Y} = \{y_{1}, y_{2}, \dots, y_{n}\} Y={ y1,y2,…,yn} It's a length of n n n Correct text for , Model input X \Large\boldsymbol{X} X Generate correct text Y \Large\boldsymbol{Y} Y.
CRASpell Model
On the left is Correction Model , On the right is Noise Model , The following describes the model in detail .
(1) Correction Module
Given the input text X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} obtain embedding vector E = { e 1 , e 2 , … , e n } \Large\boldsymbol{E} = \{e_{1}, e_{2}, \dots, e_{n}\} E={ e1,e2,…,en}, Each of these characters x i x_{i} xi Corresponding embedding The vector is written as e i e_{i} ei, take E \Large\boldsymbol{E} E Input to Transformer Encoder Get in hidden state matrix H = { h 1 , h 2 , … , h n } \Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\} H={ h1,h2,…,hn}, among h i ∈ R 768 h_{i}\in\Large\boldsymbol{R}^{768} hi∈R768 Is the character x i x_{i} xi after Transformer Encoder Get the feature .
(2) Generative Distribution
X = { x 1 , x 2 , … , x n } \Large\boldsymbol{X} = \{x_{1}, x_{2}, \dots, x_{n}\} X={ x1,x2,…,xn} after Transformer Encoder Get the eigenvector H = { h 1 , h 2 , … , h n } \Large\boldsymbol{H} = \{h_{1}, h_{2}, \dots, h_{n}\} H={ h1,h2,…,hn}, Through a forward linear layer and a softmax Layer gets each character token Generation probability of , The formula is as follows :
p g = s o f t m a x ( W g h i + b g ) p_{g} = softmax(W_{g}h_{i} + b_{g}) pg=softmax(Wghi+bg)
among W g ∈ R n v × 768 W_{g}\in \Large\boldsymbol{R}^{n_{v}\times768} Wg∈Rnv×768, b g ∈ R 768 b_{g}\in \Large\boldsymbol{R}^{768} bg∈R768, n v n_{v} nv Is the size of the pre training model vocabulary .
(3) Copy Distribution
x i x_{i} xi Of copy distribution p c ∈ { 0 , 1 } n v p_{c} \in \{0,1\}^{n_{v}} pc∈{ 0,1}nv yes x i x_{i} xi In the dictionary i d x ( x i ) idx(x_{i}) idx(xi) Of one-hot Express , The specific expression is as follows :
(4) Copy Probability
Copy Probability Is in the model diagram Copy Block The output in $\omega \in\Large\boldsymbol{R}$
, namely transformers encoder The obtained hidden layer eigenvector $h_{i}$
Through two forward linear layers and one layer normalization obtain $\omega$
, The formula is as follows :
h c = W c h f l n ( h i ) + b c h h c ′ = f l n ( f a c t ( h c ) ) ω = S i g m o i d ( W c h c ′ ) h_{c} = W_{ch}f_{ln}(h_{i}) + b_{ch} \\ h_{c}^{'} = f_{ln}(f_{act}(h_{c})) \\ \omega = Sigmoid(W_{c}h_{c}^{'}) hc=Wchfln(hi)+bchhc′=fln(fact(hc))ω=Sigmoid(Wchc′)
among W c h ∈ R 768 × d c W_{ch}\in\Large\boldsymbol{R}^{768\times d_{c}} Wch∈R768×dc, b c h ∈ R d c b_{ch} \in \Large\boldsymbol{R}^{d_{c}} bch∈Rdc, W c ∈ R d c × 1 W_{c}\in\Large\boldsymbol{R}^{d_{c}\times 1} Wc∈Rdc×1, f l n f_{ln} fln yes layer normalization, f a c t f_{act} fact Is the activation function , The activation function used in the code is glue. See the code for details
Copy Block Output probability p p p Combined generation Generative Distribution p g p_{g} pg and Copy Distribution p c p_{c} pc
p = ω × p c + ( 1 − ω ) × p g p = \omega\times p_{c} + (1 - \omega)\times p_{g} p=ω×pc+(1−ω)×pg
Prior to CSC The difference between models is that ,CRASpell The model takes into account... In the final generation output of the model Copy Probability p c p_{c} pc, Make the model valid in input characters but not the most suitable BERT There are more opportunities to choose input characters , Avoid overcorrection .
(5) Noise Modeling Module
Noise Modeling Module The correction model produces a similar distribution for the original context and the noise context to solve the problem of context misspelling interference . As shown on the right side of the model diagram above ,Noise Modeling Module It can be roughly divided into the following processes :
a. According to the input sample X \Large\boldsymbol{X} X Generate noise context X ~ \Large\widetilde{\boldsymbol{X}} X
b. Noise context X ~ \Large\widetilde{\boldsymbol{X}} X As input, get Transformer Encoder Get the hidden eigenvector H ~ \widetilde{\boldsymbol{H}} H
c. Based on hidden eigenvectors H ~ \widetilde{\boldsymbol{H}} H Generate generate distribution p g ~ \widetilde{p_{g}} pg
d. The generated distribution is similar to that generated by the calibration model . The generated distribution is similar to the distribution generated by the calibration model minimizing the bidirectional Kullback-
Leibler divergence reflect , The formula is as follows :
L K L = 1 2 ( D K L ( p g ∥ p g ~ ) + D K L ( p g ~ ∥ p g ) ) \mathcal{L}_{KL} = \frac{1}{2}(\mathcal{D}_{KL}(p_{g}\Vert\widetilde{p_{g}}) + \mathcal{D}_{KL}(\widetilde{p_{g}}\Vert p_{g})) LKL=21(DKL(pg∥pg)+DKL(pg∥pg))
remarks :Noise Modeling Module Only during training , Model reasoning only uses correction networks
Noisy Block
The following describes adding noise data to data . The noise sample is generated by replacing the characters of the original training sample . In the process of replacing characters, only the misspelled character context is replaced d t d_{t} dt Word , If the training sample is not misspelled , This sample is not replaced to generate noise samples . As shown in the figure below :
d t d_{t} dt Select the experimental results
We replace each selected position with a similar character based on the publicly available confusion set . say concretely , We choose the word at the position to replace
(i) 70% The replacement randomly selects phonetically similar characters
(ii) 15% Random selection of glyph like characters
(iii) 15% A random selection of alternatives from the vocabulary .
Loss
Given a training sample ( X , Y ) (\Large\boldsymbol{X}, \Large\boldsymbol{Y}) (X,Y), X \Large\boldsymbol{X} X Is an input error sample , Y \Large\boldsymbol{Y} Y Is to correct the sample , Correct sample for each calibration Y i \Large\boldsymbol{Y_{i}} Yi Of loss by
L c i = − log ( p ( Y i ∣ X ) ) \mathcal{L}_{c}^{i} = -\log(p(\Large\boldsymbol{Y_{i}}|\Large\boldsymbol{X})) Lci=−log(p(Yi∣X))
among p p p by
p = ω × p c + ( 1 − ω ) × p g p = \omega\times p_{c} + (1 - \omega)\times p_{g} p=ω×pc+(1−ω)×pg
See the introduction above for details .
Model loss by L \mathcal{L} L
L i = ( 1 − α i ) L c i + α i L K L i \mathcal{L}^{i} = (1 - \alpha_{i})\mathcal{L}_{c}^{i} + \alpha_{i} \mathcal{L}_{KL}^{i} Li=(1−αi)Lci+αiLKLi
among α i \alpha_{i} αi
among α \alpha α yes L c \mathcal{L}_{c} Lc and L K L \mathcal{L}_{KL} LKL The trade-off factor . The constructed noise samples themselves will not participate in the training process , It will only participate as a context . This strategy aims to ensure that the constructed noise data will not change the proportion of positive and negative samples in the training corpus .
experimental result
Later supplement sighan The result on the dataset .
CRASpell Model experimental results
边栏推荐
- How camunda uses script script nodes
- ARM64 上的性能怪兽:API 网关 Apache APISIX 在 AWS Graviton3 上的安装和性能测试
- State compression DP example (traveling salesman problem and rectangle filling problem)
- Anonymity in Web3 and NFT
- 1051. 高度检查器
- Performance monster on arm64: installation and performance test of API gateway Apache APIs IX on AWS graviton3
- MFC custom button to realize color control
- MIIdock文件分布
- 关于 SAP Spartacus CmsService.getComponentData 可能的优化思路
- Web 3.0? High cost version of P2P
猜你喜欢
17张图:读懂国内首个《主机安全能力建设指南》
Performance monster on arm64: installation and performance test of API gateway Apache APIs IX on AWS graviton3
Web3和NFT中的匿名性问题
How vscode converts a tab key in an edited file into a spacebar
数据库连接数设置多少合适?
树莓派开发笔记(十六):树莓派4B+安装mariadb数据库(mysql开源分支)并测试基本操作
F2. nearest beautiful number (hard version)
(幼升小信息-03)批量模板制作 幼儿基本信息收集文件夹(包含PDF、Word、证件文件夹)
[tcapulusdb knowledge base] Introduction to tmonitor background one click installation (II)
【TcaplusDB知识库】TcaplusDB Tmonitor模块架构介绍
随机推荐
17张图:读懂国内首个《主机安全能力建设指南》
LVGL库入门教程01-移植到STM32(触摸屏)
【TcaplusDB知识库】TcaplusDB常规单据介绍
【ROS】MoveIt-rviz-七自由度机械臂仿真
Meta universe land: what makes digital real estate valuable
(small information for children to children-03) batch template production of basic information collection folder for children (including PDF, word and certificate folder)
日志1111
[tcapulusdb knowledge base] tcapulusdb Model Management Introduction
socket编程(中)
Web3和NFT中的匿名性问题
Mac MySQL installation tutorial
5.5 clock screensaver
break algorithm---dynamic planning(dp-func)
Ue5 random point in bounding boxf from stream
Nim游戏阶梯 Nim游戏和SG函数应用(集合游戏)
Ipdu handling caused by mode change of COM
【TcaplusDB知识库】TcaplusDB运维单据介绍
Performance monster on arm64: installation and performance test of API gateway Apache APIs IX on AWS graviton3
塔米狗知识|全面剖析国有企业并购含义及其作用
UE4,UE5虚幻引擎,Command Console控制台命令,参数集