当前位置：网站首页>RoBERTa：A Robustly Optimized BERT Pretraining Approach

RoBERTa：A Robustly Optimized BERT Pretraining Approach

2022-07-29 07:31:00 【Live up to your youth】

Model overview

RoBERTa It can be seen as BERT Of Improved version , In terms of model structure , comparison BERT,RoBERTa There is basically no innovation , It's more about BERT Further exploration in pre training . It improves BERT Many pre training strategies , The result shows that , original BERT Maybe not enough training , Not fully learning the language knowledge in the training data .

RoBERTa At model scale 、 Computational power and data , And BERT Compared with the following improvements ：

Bigger bacth size.RoBERTa In the training process, a larger bacth size. Tried from 256 To 8000 No wait bacth size.
More training data .RoBERTa Adopted 160G Training text for , and BERT Only 16G Training text for .
Longer training steps .RoBERT stay 160G Training data 、8K Of batch size The training steps are as high as 500K.

RoBERTa Compare the training methods BERT Improvements include ： Remove the next prediction (NSP) Mission ; Use dynamic mask ; use BPE Encoding mode .

To sum up , Let me draw a picture ：
Insert picture description here

Model optimization

Use dynamic mask

BERT There was a Masking Language Model(MLM) Pretraining task , When preparing training data , need Mask Drop some token, Let the model predict these during the training token, Here is the data Mask after , The training data will not change , These data will be used until the end of the training , such Mask The method is called Static Masking.

If during training , Expect the training data of each round ,Mask The position of also changes accordingly , This is it. Dynamic Masking,RoBERTa What you use is Dynamic Masking.

stay RoBERTa in , This is how it is implemented , Copy multiple copies of the original training data , Then proceed Masking. In this way, the same data is randomly Masking The position of has also changed , Equivalent to Dynamic Masking Purpose . For example, the original data has been copied 10 Copy of the data , Training is needed in total 40 round , Then each mask The method of will be used in training 4 Time .

Cancel NSP Mission , Use FULL-SENTENCES How to construct data

BERT In the construction data NSP This is what you do when you are working , Put two segment Splice as a series of sequence input model , And then use NSP The task is to predict these two segment Whether there is a context relationship , But the overall length of the sequence is less than 512.

However ,RoBERTa It is found through experiments that , Get rid of NSP The task will be improved down-stream Task indicators . As shown in the figure ：
Insert picture description here

among ,SEGMENT-PAIR、SENTENCE-PAIR、FULL-SENTENCES、DOC-SENTENCE Respectively represent different ways of constructing input data ,RoBERTa Used FULL-SENTENCES, And removed NSP Mission .

FULL-SENTENCES It means to extract sentences continuously from one or more articles , Fill in the model input sequence . in other words , An input sequence may span multiple article boundaries . In particular , It will continuously extract sentences from an article to fill the input sequence , But if it comes to the end of the article , Then we will continue to extract sentences from the next article and fill them in the sequence , The content in different articles is still in accordance with SEP Separator to split .

use BPE code

Byte-Pair Encodeing(BPE) It is a kind of word , How to generate a vocabulary .BERT Medium BPE The algorithm is character based BPE Algorithm , Constructed by it ” word ” Often between characters and words , The common form is the fragment in the word as an independent ” word ”, Especially for those longer words . For example, words woderful It may be split into two sub words ”wonder” and ”ful”.

differ BERT,RoBERTa Based on Byte Of BPE, The vocabulary contains 50K Left and right words , In this way, there is no need to worry about the appearance of unlisted words , Because it will start from Byte To decompose words at the level of .

Bigger data

comparison BERT, RoBERTa Using more training data ：

Insert picture description here

Longer training steps

RoBERTa With the increase of training data and training steps , Model in down-stream Our performance is also improving .

Insert picture description here

Bigger batch size

RoBERTa By increasing the training process Batch Size Size , Come and watch the model in the pre training task and down-stream Mission performance . Found an increase in Batch Size It is beneficial to reduce the number of reserved training data Perplexity, Improve down-stream Indicators of .

Insert picture description here

in addition ,RoBERTa Reference resources transformer Improvement , Use $\beta_1$ =0.9, $\beta_2$ =0.999, $\epsilon$ =1e-6,weight_decay_rate=0.01,num_warmup_steps=10000,init_lr=1e-4 Adaptive learning rate Adam Optimizer .