当前位置:网站首页>RoBERTa:A Robustly Optimized BERT Pretraining Approach
RoBERTa:A Robustly Optimized BERT Pretraining Approach
2022-07-29 07:31:00 【Live up to your youth】
Model overview
RoBERTa It can be seen as BERT Of Improved version , In terms of model structure , comparison BERT,RoBERTa There is basically no innovation , It's more about BERT Further exploration in pre training . It improves BERT Many pre training strategies , The result shows that , original BERT Maybe not enough training , Not fully learning the language knowledge in the training data .
RoBERTa At model scale 、 Computational power and data , And BERT Compared with the following improvements :
Bigger bacth size.RoBERTa In the training process, a larger bacth size. Tried from 256 To 8000 No wait bacth size.
More training data .RoBERTa Adopted 160G Training text for , and BERT Only 16G Training text for .
Longer training steps .RoBERT stay 160G Training data 、8K Of batch size The training steps are as high as 500K.
RoBERTa Compare the training methods BERT Improvements include : Remove the next prediction (NSP) Mission ; Use dynamic mask ; use BPE Encoding mode .
To sum up , Let me draw a picture :
Model optimization
Use dynamic mask
BERT There was a Masking Language Model(MLM) Pretraining task , When preparing training data , need Mask Drop some token, Let the model predict these during the training token, Here is the data Mask after , The training data will not change , These data will be used until the end of the training , such Mask The method is called Static Masking.
If during training , Expect the training data of each round ,Mask The position of also changes accordingly , This is it. Dynamic Masking,RoBERTa What you use is Dynamic Masking.
stay RoBERTa in , This is how it is implemented , Copy multiple copies of the original training data , Then proceed Masking. In this way, the same data is randomly Masking The position of has also changed , Equivalent to Dynamic Masking Purpose . For example, the original data has been copied 10 Copy of the data , Training is needed in total 40 round , Then each mask The method of will be used in training 4 Time .
Cancel NSP Mission , Use FULL-SENTENCES How to construct data
BERT In the construction data NSP This is what you do when you are working , Put two segment Splice as a series of sequence input model , And then use NSP The task is to predict these two segment Whether there is a context relationship , But the overall length of the sequence is less than 512.
However ,RoBERTa It is found through experiments that , Get rid of NSP The task will be improved down-stream Task indicators . As shown in the figure :
among ,SEGMENT-PAIR、SENTENCE-PAIR、FULL-SENTENCES、DOC-SENTENCE Respectively represent different ways of constructing input data ,RoBERTa Used FULL-SENTENCES, And removed NSP Mission .
FULL-SENTENCES It means to extract sentences continuously from one or more articles , Fill in the model input sequence . in other words , An input sequence may span multiple article boundaries . In particular , It will continuously extract sentences from an article to fill the input sequence , But if it comes to the end of the article , Then we will continue to extract sentences from the next article and fill them in the sequence , The content in different articles is still in accordance with SEP Separator to split .
use BPE code
Byte-Pair Encodeing(BPE) It is a kind of word , How to generate a vocabulary .BERT Medium BPE The algorithm is character based BPE Algorithm , Constructed by it ” word ” Often between characters and words , The common form is the fragment in the word as an independent ” word ”, Especially for those longer words . For example, words woderful It may be split into two sub words ”wonder” and ”ful”.
differ BERT,RoBERTa Based on Byte Of BPE, The vocabulary contains 50K Left and right words , In this way, there is no need to worry about the appearance of unlisted words , Because it will start from Byte To decompose words at the level of .
Bigger data
comparison BERT, RoBERTa Using more training data :

Longer training steps
RoBERTa With the increase of training data and training steps , Model in down-stream Our performance is also improving .

Bigger batch size
RoBERTa By increasing the training process Batch Size Size , Come and watch the model in the pre training task and down-stream Mission performance . Found an increase in Batch Size It is beneficial to reduce the number of reserved training data Perplexity, Improve down-stream Indicators of .

in addition ,RoBERTa Reference resources transformer Improvement , Use β 1 \beta_1 β1=0.9, β 2 \beta_2 β2=0.999, ϵ \epsilon ϵ=1e-6,weight_decay_rate=0.01,num_warmup_steps=10000,init_lr=1e-4 Adaptive learning rate Adam Optimizer .

边栏推荐
- 信用卡购物积分
- log4j Layout简介说明
- Halcon installation and testing in vs2017, DLL configuration in vs2017
- mysql 使用 DATE_FORMAT(date,'%Y-%m')
- What are the answers about older bloggers?
- Amazon cloud assistant applet is coming!
- 3-global exception handling
- SEGGER 的硬件异常 分析
- 程序的静态库与动态库的区别
- STM32 operation w25q256 w25q16 SPI flash
猜你喜欢

09 bloom filter

What are the answers about older bloggers?

关于大龄读博的几点回答?

0 9 布隆过滤器(Bloom Filter)

Spingboot integrates the quartz framework to realize dynamic scheduled tasks (support real-time addition, deletion, modification and query tasks)

Amazon cloud assistant applet is coming!

3-global exception handling

Using C language to skillfully realize the chess game -- Sanzi chess

PAT甲级 1146 拓扑顺序
![[OpenGL] use of shaders](/img/73/1322afec8add6462ca4b82cb8112d1.png)
[OpenGL] use of shaders
随机推荐
1-后台项目搭建
状态机dp三维
QT连接两个qslite数据库报错QSqlQuery::exec: database not open
基于高阶无六环的LDPC最小和译码matlab仿真
Pat class a 1146 topology sequence
BeanUtils.setProperty()
能在SQL 语句中 指定 内存参数吗?
蓝桥杯A组选数异或
[summer daily question] Luogu p4414 [coci2006-2007 2] ABC
stm32 操作W25Q256 W25Q16 spi flash
Introduction and introduction of logback
Docker最新超详细教程——Docker创建运行MySQL并挂载
QT topic: basic components (button class, layout class, output class, input class, container class)
5-integrate swagger2
[OpenGL] use of shaders
I, 28, a tester, was ruthlessly dismissed in October: I want to remind people who are still learning to test
2022年深圳杯A题破除“尖叫效应”与“回声室效应”走出“信息茧房”
thinkphp6 实现数据库备份
Meizhi optoelectronics' IPO was terminated: annual revenue of 926million he Xiangjian was the actual controller
关于大龄读博的几点回答?