当前位置:网站首页>Bert fine tuning skills experiment
Bert fine tuning skills experiment
2022-07-05 02:03:00 【Necther】
Background introduction
The text classification is NLP A classic task in , Generally, some pre trained models in large data sets can achieve good results in text classification . for example word2vec, CoVe(contextualized word embeddings) and ELMo All have achieved good results .Bert It's based on two-way transformer Use masked word prediction and NSP(next sentence prediction) The task of pre training , Then fine tune on downstream tasks .Bert The birth of , Swept the major lists . But has his potential been fully exploited ? This paper aims at text classification based on Bert Explore several methods that can optimize the effect .
These methods are :
- Fine-tune Strategy
- Deep pre training
- multitasking Fine-tune
Strategy introduction
1. Fine-tune Strategy
Different layers of neural network can capture different grammatical and semantic information . Use Bert To train downstream tasks, we need to consider several issues :
- Long article of pre training , because Bert The longest text sequence of is 512
- Layer selection , As mentioned above , Each layer captures different information , So we need to choose the most suitable layer
- Over fitting problem , Therefore, we need to consider the appropriate learning rate .Bert The bottom layer of will learn more general information , Right Bert Different layers of use different learning rates . The parameter iteration of each layer can be as follows :
among
- It means the first one l Layer t Parameters of step iteration
- It means the first one l Layer learning rate , The calculation method is as follows . Represents the decay coefficient , When >1 Indicates that the learning rate decays layer by layer , Otherwise, it means expanding layer by layer . When =1 Time and tradition Bert identical .
2. Deep pre training
Bert It is pre trained on the General Corpus , If you want to apply text classification in a specific field , There must be some gaps in data distribution . At this time, you can consider deep pre training .
- Within-task pre-training:Bert Pre training on the training corpus
- In-domain pre-training: Pre training the corpus in the same field
- Cross-domain pre-training: Pre training the corpus in different fields
3. multitasking Fine-tune
Multitasking tuning is using Bert To train different downstream tasks, but except for the last layer , Share parameters in other layers .
experimental result
1. Data sets
This article uses IMDB, Yelp Comment data sets are used for emotional analysis ,TREC( Open domain question and answer data sets ),yahoo Q & A is used for problem classification ,AG Journalism ,DBPedia and Sougou News topic classification . Use in this paper WordPiece embeddings With ## Segment sentences . Yes Sougou News adoption ".", "?", "!" To separate sentences .
2. Fine-tune Strategy
- Long text processing
There are two ways to deal with long text , Truncation and segmentation .
- truncation : Generally speaking, the most important information in a text is the beginning and end , Therefore, the long text is truncated in this paper .
head-only: Leave the 510 Characters
tail-only: After reservation 510 Characters
head+tail: Leave the 128 And after 382 Characters
- segmentation : Divide the text into k paragraph , Input of each paragraph and Bert The general input is the same , The first character is [CLS] Represents the weighted information of this paragraph . In this paper, we use Max-pooling, Average pooling and self-attention Combine the representation of these fragments .
Here are the results of the experiment ,head+tail The representation of is better on both data sets . It should be that the long text combines the information at the beginning and end of the sentence , The information obtained is relatively balanced . However, it is strange that the splicing method as a whole is not as good as truncation , My guess is that cutting the sentence into several paragraphs may increase the instability of the model , And errors may be magnified when superimposed . and max-pooling and self-attention It also emphasizes more useful information in the text , So the overall effect is better than average.
2. Layer selection
In this paper, the effect of each layer and the results of the first four layers are spliced , The result stitching of the last four layers and 12 The results of layer splicing were tested , Found the last four layers of splicing and the 11 Layers have the same effect .
3. Catastrophic Forgetting
Catastrophic forgetting It means that the pre trained knowledge is forgotten when learning new knowledge . Right Bert Of Catastrophic Forgetting The problem is explored . The picture below is IMDB Different learning rates and error-rate The curve of , It can be seen that a relatively small learning rate has a better effect .
4. Interlayer learning rate
The influence of inter layer learning rate on the model , You can see when the initial learning rate is high , The recession rate should be relatively low . Because the deep model can learn less , A relatively low learning rate is required for fitting . Does this also mean that a relatively fixed learning at a certain level can make the model optimal ?
3. Deep pre training
1. Within-Task Further Pre-Training
Use training data for pre training , The following figure shows the pre training step And the error rate of the test , You can see the pre training 100K The effect of training after rounds has been improved .
2. In-Domain and Cross-Domain Further Pre-Training
The corpus is divided into emotional analysis , Problem classification and topic classification , On these corpora, according to the pre training of intra domain and cross domain . The following figure shows the results of pre training ,all Is to use the corpus in all fields for pre training ,w/o It's primitive bert. It can be seen that the effect of pre training is better than that of the original Bert Improved . But be careful , Small scale corpus TREC After training in the field, the effect becomes worse .
The text also uses Bert Feature input for Bilstm+self-attention To evaluate , The effect is as follows , among :
- BERT-Feat: BERT as features
- BERT-FiT: BERT + Fine-Tuning
- BERT-ITPT-FiT: BERT + withIn-Task Pre-Training + Fine-Tuning
- BERT-IDPT-FiT: BERT + In-Domain Pre-Training + Fine-Tuning
- BERT-CDPT-FiT: BERT + Cross-Domain Pre-Training + Fine-Tuning
4. multitasking Fine-tune
Used in the paper 4 A dataset of English Classification (IMDB, Yelp.P, AG, DBP) multitask , At the same time, it uses cross domain pre training Bert Compare the models , The effect is as follows . It can be seen that multi task learning can improve Bert The effect of , At the same time, pre training in cross fields Bert Multitask on the model Fine tune The effect is the best .
5. The size of the training set
Right Fine-tune Explore the influence of the size of the training set on the effect of the model . It can be seen that when the training data set is relatively small , The error rate of the model is relatively high , As the training set increases , The error rate of the model decreases . But why abscissa from 20 here we are 100 There is a little confusion here . among Bert-Fit Express Bert Fine-tune, BERT-ITPT-Fit Express BERT + withIn-Task Pre-Training + Fine-Tuning.
6. Bert Large Preliminary training
Right Bert Large It also went on With task Preliminary training , Do wonders vigorously , Sure enough Bert large Is much better .
summary
I feel this is a very solid article , Consider a more comprehensive experimental report , But there are few thoughts and explanations about the experimental results . In short, use Bert When fine-tuning, you can consider re pre training in the field , Let the model learn more , And deep learning is still working hard to achieve miracles as always .
Related information
How to Fine-Tune BERT for Text Classificationarxiv.org/pdf/1905.05583.pdf
边栏推荐
- Is there a sudden failure on the line? How to make emergency diagnosis, troubleshooting and recovery
- Using openpyxl module to write the binary list into excel file
- 【LeetCode】88. Merge two ordered arrays
- Es uses collapsebuilder to de duplicate and return only a certain field
- He was laid off.. 39 year old Ali P9, saved 150million
- Win:使用 Shadow Mode 查看远程用户的桌面会话
- Exploration and practice of integration of streaming and wholesale in jd.com
- 172. Zero after factorial
- Write a thread pool by hand, and take you to learn the implementation principle of ThreadPoolExecutor thread pool
- [機緣參悟-38]:鬼穀子-第五飛箝篇 - 警示之一:有一種殺稱為“捧殺”
猜你喜欢
PHP wechat official account development
[technology development-26]: data security of new information and communication networks
Three properties that a good homomorphic encryption should satisfy
[swagger]-swagger learning
流批一體在京東的探索與實踐
Subject 3 how to turn on the high beam diagram? Is the high beam of section 3 up or down
Application and Optimization Practice of redis in vivo push platform
How to safely eat apples on the edge of a cliff? Deepmind & openai gives the answer of 3D security reinforcement learning
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
Express routing, express middleware, using express write interface
随机推荐
Rabbit MQ message sending of vertx
C语音常用的位运算技巧
Yolov5 model training and detection
Lsblk command - check the disk of the system. I don't often use this command, but it's still very easy to use. Onion duck, like, collect, pay attention, wait for your arrival!
Pgadmin 4 V6.5 release, PostgreSQL open source graphical management tool
[source code attached] Intelligent Recommendation System Based on knowledge map -sylvie rabbit
He was laid off.. 39 year old Ali P9, saved 150million
How to safely eat apples on the edge of a cliff? Deepmind & openai gives the answer of 3D security reinforcement learning
PHP 约瑟夫环问题
STM32 series - serial port UART software pin internal pull-up or external resistance pull-up - cause problem search
Runc hang causes the kubernetes node notready
"2022" is a must know web security interview question for job hopping
The steering wheel can be turned for one and a half turns. Is there any difference between it and two turns
Flutter 2.10 update details
runc hang 导致 Kubernetes 节点 NotReady
"C zero foundation introduction hundred knowledge and hundred cases" (72) multi wave entrustment -- Mom shouted for dinner
es使用collapseBuilder去重和只返回某个字段
如何搭建一支搞垮公司的技術團隊?
Subject 3 how to turn on the high beam diagram? Is the high beam of section 3 up or down
Practical case of SQL optimization: speed up your database