当前位置：网站首页>Don't stop pre training practice (II) - see all MLM in one day

Don't stop pre training practice (II) - see all MLM in one day

2022-06-10 07:41:00 【JMXGODLZ】

List of articles

Preface
Mask task

Welcome to personal blog ： https://jmxgodlz.xyz

Preface

This article is mentioned above Don't stop pre training practice -Roberta And Albert On the basis of , Further complete the following ：

keras Preliminary training
N-gram Mask task
Span Mask task

Mask task

BERT The mask task in the equal pre training model mainly involves the following elements ：

Mask scale
Replacement strategy
Mask mode

Mask scale

The common mask scale is set to 15%, This ratio has been studied many times , It has been proved that this ratio can achieve good results .

In theory , What the author found on the Internet is ：“ When taking 15% when , Just about 7 Word mask One , That's right. CBOW in , The length is 7 The central word of the sliding window , So it will have a better effect ”

Recently, Danqi and others' papers Should You Mask 15% in Masked Language Modeling? Indicate mask 40% Be able to achieve and 15% Almost the same effect .

The paper shows that **“ So-called optimal masking rate It is not an unchanging magic number , It's one that changes with the size of the model 、mask Strategy 、 Training recipe、 Function of downstream task change .”**

Replacement strategy

Common replacement strategies are as follows ：

80% Replace words with [MASK]
10% The words remain the same
10% Words are randomly replaced with other words

The purpose of this is to force the model to learn the semantic information of the word context . Any word can be replaced , Not only by the current words , You also need to use the context information to predict the current word .

however [MASK] The tag does not appear in the downstream task , therefore There is inconsistency between pre training and fine tuning .

MacBERT Put forward MLM as correction Methods , The replacement strategy is as follows ：

80% Replace words with synonyms
10% The words remain the same
10% Words are randomly replaced with other words

MacBERT The paper compares with the following substitution strategies , The comparison results are shown in the figure ：

MacBERT：80% Replace with synonyms ,10% Replace with random words ,10% remain unchanged ;
Random Replace：90% Replace with random words ,10% remain unchanged ;
Partial Mask： Same as the original BERT equally ,80% Replace with [MASK],10% Replace with random words ,10% remain unchanged ;
ALL Mask：90% Replace with [MASK],10% remain unchanged .

The abscissa in the figure represents the number of training steps , The ordinate represents EM value . The first picture is CMRC Dataset results , The second picture is DRCD Dataset results .

Mask mode

The current mask methods are mainly divided into the following ：

Word mask
Full word mask
Entity mask
N-gram Mask
Span Mask

	chinese	english
The original sentence	Use language models to predict the probability of the next word .	we use a language model to predict the probability of the next word.
participle	Use Language Model Come on forecast Next One word Of probability .	-
BERT Tokenizer	send use language said model type Come on pre measuring Next One individual word Of General rate .	we use a language model to pre ##di ##ct the pro ##ba ##bility of the next word.
Word mask	send use language said [M] type Come on [M] measuring Next One individual word Of General rate .	we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word.
Full word mask	send use language said [M] [M] Come on [M] [M] Next One individual word Of General rate .	we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word.
Entity mask	send use [M] [M] [M] [M] Come on [M] [M] Next One individual word Of General rate .	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] of the next word.
N-gram Mask	send use [M] [M] [M] [M] Come on [M] [M] Next One individual word Of General rate .	we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word.
Span Mask	send use [M] [M] [M] [M] [M] [M] [M] Next One individual word Of General rate .	we use a [M] [M] [M] [M] [M] [M] the [M] [M] [M] [M] [M] next word.
MAC Mask	send use language Law build model Come on pre see Next One individual word Of A few rate .	we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word.

Full word mask

Take the word segmentation result as the minimum granularity , Complete the mask task .

N-gram Mask

The minimum granularity is the result of word segmentation , With n-gram Take words to mask .

for example MacBERT Use participle based n-gram masking,1-gram~4gram Masking The probabilities are 40%、30%、20%、10%.

Entity mask

The representative model is ：ERNIE

Introduce named entity information , Take entities as the minimum granularity , Mask .

Span Mask

The representative model is ：SpanBERT

The above practice makes people think that , It may be necessary to introduce similar word boundary information to help train . But not long ago MASS Model , It shows that it may not be necessary , Random masking may also work well , So there is SpanBERT Of idea：

according to Geometric distribution , First, select a paragraph at random （span） Of length , And then, according to the uniform distribution, randomly select the The starting position , Finally, cover according to the length . The geometric distribution is used to take p=0.2, The maximum length can only be 10, Using this scheme, the average sampling length distribution is obtained .

Code implementation

The relevant code implementation is visible ：https://github.com/447428054/Pretrain/tree/master/KerasExample/pretraining

Span The mask core code is as follows ：

def __init__(
    self, tokenizer, word_segment, lower=1, upper=10, p=0.3, mask_rate=0.15, sequence_length=512
):
    """ Parameter description ：
        tokenizer Must be bert4keras Self contained tokenizer class ;
        word_segment Is an arbitrary participle function .
    """
    super(TrainingDatasetRoBERTa, self).__init__(tokenizer, sequence_length)
    self.word_segment = word_segment
    self.mask_rate = mask_rate

    self.lower = lower
    self.upper = upper
    self.p = p

    self.lens = list(range(self.lower, self.upper + 1))
    self.len_distrib = [self.p * (1-self.p)**(i - self.lower) for i in range(self.lower, self.upper + 1)] if self.p >= 0 else None
    self.len_distrib = [x / (sum(self.len_distrib)) for x in self.len_distrib]
    print(self.len_distrib, self.lens)

def sentence_process(self, text):
    """ A single text handler 
     technological process ： participle , And then go id, according to mask_rate Construct whole words mask Sequence 
           To specify what token Whether or not to be mask
    """

    word_tokens = self.tokenizer.tokenize(text=text)[1:-1]
    word_token_ids = self.tokenizer.tokens_to_ids(word_tokens)

    sent_length = len(word_tokens)
    mask_num = math.ceil(sent_length * self.mask_rate)
    mask = set()
    spans = []

    while len(mask) < mask_num:
        span_len = np.random.choice(self.lens, p=self.len_distrib) #  Random selection span length 

        anchor = np.random.choice(sent_length)
        if anchor in mask: #  Random generation starting point 
            continue
        left1 = anchor
        spans.append([left1, left1])
        right1 = min(anchor + span_len, sent_length)
        for i in range(left1, right1):
            if len(mask) >= mask_num:
                break
            mask.add(i)
            spans[-1][-1] = i

    spans = merge_intervals(spans)
    word_mask_ids = [0] * len(word_tokens)
    for (st, ed) in spans:
        for idx in range(st, ed + 1):
            wid = word_token_ids[idx]
            word_mask_ids[idx] = self.token_process(wid) + 1

    return [word_token_ids, word_mask_ids]

原网站

版权声明
本文为[JMXGODLZ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206100737507382.html