当前位置:网站首页>Wonderful! MarkBERT
Wonderful! MarkBERT
2022-07-01 11:10:00 【kaiyuan_ sjtu】

author | Prince Changqin
Arrangement | NewBeeNLP
Hello everyone , Here is NewBeeNLP. Today, let's watch an article about the cooperation between Tencent and Fudan University :MarkBERT: Marking Word Boundaries Improves Chinese BERT[1]
A word summary : stay Token Add the boundary mark of the word you are interested in .
MarkBERT Not based on words BERT, Still based on words , But cleverly 「 Boundary markers of words 」 Information integration model . In this way, any word can be handled uniformly , Whether or not OOV. in addition ,MarkBERT There are two additional benefits :
First , It is convenient to add word level learning goals to the boundary markers , This is a supplement to the traditional character and sentence level pre training tasks ;
secondly , You can use POS Tag specific tags replace generic tags to easily incorporate richer semantics .
stay NER The task has achieved 2 A point of improvement , In text categorization 、 Keyword recognition 、 Better accuracy has also been achieved in semantic similarity tasks .
This simple but effective Chinese pre training model MarkBERT, Word information is considered but not OOV problem . It has the following advantages :
Deal with common words and low-frequency words in a unified way , No, OOV problem .
Marker The introduction of allows the design of word level pre training tasks , This is the word level MLM And sentence level NSP A supplement to .
Easy to expand and add more word semantics ( The part of speech 、 Morphology, etc ).
There are two tasks in the pre training stage :
MLM: Yes Marker It also went on MASK, So that the model can learn boundary knowledge .
Alternative word detection : Manually replace a word , Then let the model distinguish whether the words in front of the mark are correct .
MarkBERT Preliminary training
MarkBERT
As shown in the figure below :

First participle , Insert special marks between words , These marks will also be treated as ordinary characters . There's a place , Will also be MASK, In this way, we need to pay attention to the boundary of words when encoding , Instead of simply filling ,MASK Prediction tasks become more challenging ( Prediction requires a better understanding of word boundaries ). such , The model is still character level , But it knows the boundaries of words ( Because the information of words is given explicitly ).
Alternative word detection
To be specific , When a word is replaced by a confused word , Marks should be made 「 Be replaced 」 The forecast , The label is False, Otherwise True.
This loss function will be combined with MLM The loss functions of are added together as a multi task training process . Confused words come from synonyms or words with similar pronunciation , Through this mission , Tags can be more sensitive to the span of words in the context . Use POS The marked model is called MarkBERT-POS.
Preliminary training
MASK The proportion of is still 15%,30% The time of does not insert any marks ( The original BERT);50% Time to execute WWM Prediction task ; The rest of the time MLM Prediction task .
In the insertion mark ,30% Replace words with pronunciation based confusion words or synonym based confusion words , Markers predict pronunciation confusion markers or synonym confusion markers ; Other time markers predict normal word markers . To avoid unbalanced labels , Only calculate the normal mark 15% The loss of .
experiment
stay NER The effect on the task is shown in the following table :

You can see , The effect is obviously improved .
Ablation experiments were done on three tasks :
MarkBERT-MLM: Only MLM Mission
MarkBERT-rwd: During substitution detection , Remove homophones or synonyms respectively
MarkBERT-w/o: Remove during the fine-tuning of downstream tasks Marker( And primitive BERT The same usage )
The results are shown in the table below :

The conclusion is as follows :
MarkBERT-MLM stay NER Get a significant improvement in the task , Explain that word boundary information is very important in fine-grained tasks .
Do not insert tags ,MarkBERT-w/o Also reached and baseline Similar effect , explain MarkBERT Can be like BERT The use of .
Yes NER In terms of tasks , Insertion marks are still important , indicate MarkBERT Structure is effective in learning the word boundaries of tasks that require this fine-grained representation .
Discuss
Existing Chinese BERT There are two strategies to integrate word information :
Use word information in the pre training stage , But use character sequences on downstream tasks , Such as Chinese-BERT-WWM,Lattice-BERT.
Use word information when using the pre training model in downstream tasks , Such as WoBERT,AmBERT,Lichee.
In addition, in relation to entities NLU Mission , In particular, the idea of inserting tags is discussed in relation classification . Given a subject entity and an object entity , Existing work injects non type tags or entity specific tags , And make better predictions about the relationship between entities .

This paper was really good when it was brushed , The method is simple but ingenious , Solved the Chinese pre training model at once 「 word 」 To deal with , It is very convenient to introduce word level tasks , And the rich meaning of words . Actually , We can even aim at 「 Some words of interest 」 Add tag , The rest is still processed by word .
Communicate together
I want to learn and progress with you !『NewBeeNLP』 At present, many communication groups in different directions have been established ( machine learning / Deep learning / natural language processing / Search recommendations / Figure network / Interview communication / etc. ), Quota co., LTD. , Quickly add the wechat below to join the discussion and exchange !( Pay attention to it o want Notes Can pass )

Resources for this article
[1]
MarkBERT: Marking Word Boundaries Improves Chinese BERT: https://arxiv.org/abs/2203.06378

边栏推荐
- No statements may be issued when any streaming result sets are open and in use on a given connection
- LeetCode.515. 在每个树行中找最大值___逐一BFS+DFS+按层BFS
- Infinite innovation in cloud "vision" | the 2022 Alibaba cloud live summit was officially launched
- Mall applet source code open source version - two open
- 放弃深圳高薪工作回老家
- tmux使用
- 力扣(LeetCode)181. 超过经理收入的员工(2022.06.29)
- Unittest框架中测试用例编写规范以及如何运行测试用例
- Exposure:A White-Box Photo Post-Processing Framework阅读札记
- 价值1000毕业设计校园信息发布平台网站源码
猜你喜欢
![[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba](/img/fa/f2d24ee3dbbbe6332c84a82109338e.png)
[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba

Combinaison Oracle et json

Huawei Equipment configure les services de base du réseau WLAN à grande échelle

MIT最新论文《对可解释特征的需求:动机和分类》:在机器学习模型的组成元素中建立可解释性

Exposure:A White-Box Photo Post-Processing Framework阅读札记

商汤进入解禁期:核心管理层自愿禁售 强化公司长期价值信心

Uncover the secrets of new products! Yadi Guanneng 3 multi product matrix to meet the travel needs of global users

LeetCode. 515. Find the maximum value in each tree row___ BFS + DFS + BFS by layer

Neurips 2022 | cell image segmentation competition officially launched!

我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户
随机推荐
为什么一定要从DevOps走向BizDevOps?
Unittest框架中跳过要执行的测试用例
MIT最新论文《对可解释特征的需求:动机和分类》:在机器学习模型的组成元素中建立可解释性
个人商城二开逍遥B2C商城系统源码-可商用版/拼团拼购优惠折扣秒杀源码
Harbor webhook从原理到构建
我国蜂窝物联网用户已达 15.9 亿,年内有望超越移动电话用户
Matplotlib data visualization Foundation
Graduation season · advanced technology er
Flip the array gracefully
CVPR 2022 | 基于密度与深度分解的自增强非成对图像去雾
Oracle和JSON的結合
Ask everyone in the group about the fact that the logminer scheme of flick Oracle CDC has been used to run stably in production
Huawei equipment is configured with large network WLAN basic services
LeetCode.515. 在每个树行中找最大值___逐一BFS+DFS+按层BFS
Personal mall two open Xiaoyao B2C mall system source code - Commercial Version / group shopping discount seckill source code
"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)
flutter Uint8List格式的图片和File格式图片的互相转换
开发说,“ 这个不用测,回归正常流程就行 “,测试人员怎么办?
Combination of Oracle and JSON
Intel Labs annonce de nouveaux progrès en photonique intégrée