当前位置:网站首页>Don't stop pre training practice (II) - see all MLM in one day
Don't stop pre training practice (II) - see all MLM in one day
2022-06-10 07:41:00 【JMXGODLZ】
List of articles
Welcome to personal blog : https://jmxgodlz.xyz
Preface

This article is mentioned above Don't stop pre training practice -Roberta And Albert On the basis of , Further complete the following :
- keras Preliminary training
- N-gram Mask task
- Span Mask task
Mask task
BERT The mask task in the equal pre training model mainly involves the following elements :
- Mask scale
- Replacement strategy
- Mask mode
Mask scale
The common mask scale is set to 15%, This ratio has been studied many times , It has been proved that this ratio can achieve good results .
In theory , What the author found on the Internet is :“ When taking 15% when , Just about 7 Word mask One , That's right. CBOW in , The length is 7 The central word of the sliding window , So it will have a better effect ”
Recently, Danqi and others' papers Should You Mask 15% in Masked Language Modeling? Indicate mask 40% Be able to achieve and 15% Almost the same effect .
The paper shows that **“ So-called optimal masking rate It is not an unchanging magic number , It's one that changes with the size of the model 、mask Strategy 、 Training recipe、 Function of downstream task change .”**
Replacement strategy
Common replacement strategies are as follows :
- 80% Replace words with [MASK]
- 10% The words remain the same
- 10% Words are randomly replaced with other words
The purpose of this is to force the model to learn the semantic information of the word context . Any word can be replaced , Not only by the current words , You also need to use the context information to predict the current word .
however [MASK] The tag does not appear in the downstream task , therefore There is inconsistency between pre training and fine tuning .
MacBERT Put forward MLM as correction Methods , The replacement strategy is as follows :
- 80% Replace words with synonyms
- 10% The words remain the same
- 10% Words are randomly replaced with other words
MacBERT The paper compares with the following substitution strategies , The comparison results are shown in the figure :
- MacBERT:80% Replace with synonyms ,10% Replace with random words ,10% remain unchanged ;
- Random Replace:90% Replace with random words ,10% remain unchanged ;
- Partial Mask: Same as the original BERT equally ,80% Replace with [MASK],10% Replace with random words ,10% remain unchanged ;
- ALL Mask:90% Replace with [MASK],10% remain unchanged .

The abscissa in the figure represents the number of training steps , The ordinate represents EM value . The first picture is CMRC Dataset results , The second picture is DRCD Dataset results .
Mask mode
The current mask methods are mainly divided into the following :
- Word mask
- Full word mask
- Entity mask
- N-gram Mask
- Span Mask
| chinese | english | |
|---|---|---|
| The original sentence | Use language models to predict the probability of the next word . | we use a language model to predict the probability of the next word. |
| participle | Use Language Model Come on forecast Next One word Of probability . | - |
| BERT Tokenizer | send use language said model type Come on pre measuring Next One individual word Of General rate . | we use a language model to pre ##di ##ct the pro ##ba ##bility of the next word. |
| Word mask | send use language said [M] type Come on [M] measuring Next One individual word Of General rate . | we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word. |
| Full word mask | send use language said [M] [M] Come on [M] [M] Next One individual word Of General rate . | we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word. |
| Entity mask | send use [M] [M] [M] [M] Come on [M] [M] Next One individual word Of General rate . | we use a [M] [M] to [M] [M] [M] the [M] [M] [M] of the next word. |
| N-gram Mask | send use [M] [M] [M] [M] Come on [M] [M] Next One individual word Of General rate . | we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word. |
| Span Mask | send use [M] [M] [M] [M] [M] [M] [M] Next One individual word Of General rate . | we use a [M] [M] [M] [M] [M] [M] the [M] [M] [M] [M] [M] next word. |
| MAC Mask | send use language Law build model Come on pre see Next One individual word Of A few rate . | we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word. |
Full word mask
Take the word segmentation result as the minimum granularity , Complete the mask task .
N-gram Mask
The minimum granularity is the result of word segmentation , With n-gram Take words to mask .
for example MacBERT Use participle based n-gram masking,1-gram~4gram Masking The probabilities are 40%、30%、20%、10%.
Entity mask
The representative model is :ERNIE
Introduce named entity information , Take entities as the minimum granularity , Mask .
Span Mask
The representative model is :SpanBERT
The above practice makes people think that , It may be necessary to introduce similar word boundary information to help train . But not long ago MASS Model , It shows that it may not be necessary , Random masking may also work well , So there is SpanBERT Of idea:
according to Geometric distribution , First, select a paragraph at random (span) Of length , And then, according to the uniform distribution, randomly select the The starting position , Finally, cover according to the length . The geometric distribution is used to take p=0.2, The maximum length can only be 10, Using this scheme, the average sampling length distribution is obtained .
Code implementation
The relevant code implementation is visible :https://github.com/447428054/Pretrain/tree/master/KerasExample/pretraining
Span The mask core code is as follows :
def __init__(
self, tokenizer, word_segment, lower=1, upper=10, p=0.3, mask_rate=0.15, sequence_length=512
):
""" Parameter description :
tokenizer Must be bert4keras Self contained tokenizer class ;
word_segment Is an arbitrary participle function .
"""
super(TrainingDatasetRoBERTa, self).__init__(tokenizer, sequence_length)
self.word_segment = word_segment
self.mask_rate = mask_rate
self.lower = lower
self.upper = upper
self.p = p
self.lens = list(range(self.lower, self.upper + 1))
self.len_distrib = [self.p * (1-self.p)**(i - self.lower) for i in range(self.lower, self.upper + 1)] if self.p >= 0 else None
self.len_distrib = [x / (sum(self.len_distrib)) for x in self.len_distrib]
print(self.len_distrib, self.lens)
def sentence_process(self, text):
""" A single text handler
technological process : participle , And then go id, according to mask_rate Construct whole words mask Sequence
To specify what token Whether or not to be mask
"""
word_tokens = self.tokenizer.tokenize(text=text)[1:-1]
word_token_ids = self.tokenizer.tokens_to_ids(word_tokens)
sent_length = len(word_tokens)
mask_num = math.ceil(sent_length * self.mask_rate)
mask = set()
spans = []
while len(mask) < mask_num:
span_len = np.random.choice(self.lens, p=self.len_distrib) # Random selection span length
anchor = np.random.choice(sent_length)
if anchor in mask: # Random generation starting point
continue
left1 = anchor
spans.append([left1, left1])
right1 = min(anchor + span_len, sent_length)
for i in range(left1, right1):
if len(mask) >= mask_num:
break
mask.add(i)
spans[-1][-1] = i
spans = merge_intervals(spans)
word_mask_ids = [0] * len(word_tokens)
for (st, ed) in spans:
for idx in range(st, ed + 1):
wid = word_token_ids[idx]
word_mask_ids[idx] = self.token_process(wid) + 1
return [word_token_ids, word_mask_ids]
边栏推荐
- 智能合并视频,随机合并视频封面,并预设新标题
- 618 l'informatique en nuage stimule la diffusion en direct du commerce électronique
- sql 将某一列置为空
- P1073 [noip2009 improvement group] optimal trade problem solving hierarchical graph shortest path
- R language data processing: tidyr package learning
- What brand is a cheap Bluetooth headset? Four cheap and easy-to-use Bluetooth headsets on the digital control panel
- autojs与冰狐智能辅助的优缺点
- Solving the problem of interval updating + interval maximum value by block sequence
- How R language uses ggplot2 to draw QQ graph and box graph
- Production of image cloud
猜你喜欢

「动态规划」0/1背包问题

618. How to prepare for the great promotion

Cython的使用

Go+vue+pgsql- family management system project conclusion

Relay log performance optimization practice in DM - tidb tool sharing

8-1不安全的文件下载原理和案例演示

"Three.js" take off!

Rk3399 default browser and Chinese language

3 zk的选举机制

Iterator迭代器,while循环,增强for循环的用法
随机推荐
Capacitive isolation principle
markdown md 文件编辑器测试使用说明
RT-Thread设计与实现:RT-Thread 概述和架构
7-1 intersection of two ordered linked list sequences (20 points)
SQL makes a column empty
Summary of technical scheme for automatic wool picking
Rk3399 default browser and Chinese language
How to quickly clip multiple short videos and remove Video Trailer
19 r judgment control function exercise
Zhangxiaobai teaches you how to use Ogg to synchronize Oracle 19C data with MySQL 5.7 (4)
How to use Navicat to create an associated primary foreign key for two tables
30. localstorage and sessionstorage are unknown
正则表达式 常用的正则规则汇总
PS 2022 installation failure error code 182 solution
All in one 1281 Dynamic programming of the longest ascending subsequence solution
Using the traversal sequence of the first and middle order of the binary tree to create the binary tree again
Jenkins-API
Cython的使用
【软件测试】多家大厂的软件测试面试常见问题合集(BAT、三大流量厂商、知名大厂)
Is it safe to open an account online?