当前位置:网站首页>Bert fine tuning skills experiment
Bert fine tuning skills experiment
2022-07-05 02:03:00 【Necther】
Background introduction
The text classification is NLP A classic task in , Generally, some pre trained models in large data sets can achieve good results in text classification . for example word2vec, CoVe(contextualized word embeddings) and ELMo All have achieved good results .Bert It's based on two-way transformer Use masked word prediction and NSP(next sentence prediction) The task of pre training , Then fine tune on downstream tasks .Bert The birth of , Swept the major lists . But has his potential been fully exploited ? This paper aims at text classification based on Bert Explore several methods that can optimize the effect .
These methods are :
- Fine-tune Strategy
- Deep pre training
- multitasking Fine-tune
Strategy introduction
1. Fine-tune Strategy
Different layers of neural network can capture different grammatical and semantic information . Use Bert To train downstream tasks, we need to consider several issues :
- Long article of pre training , because Bert The longest text sequence of is 512
- Layer selection , As mentioned above , Each layer captures different information , So we need to choose the most suitable layer
- Over fitting problem , Therefore, we need to consider the appropriate learning rate .Bert The bottom layer of will learn more general information , Right Bert Different layers of use different learning rates . The parameter iteration of each layer can be as follows :

among
- It means the first one l Layer t Parameters of step iteration
- It means the first one l Layer learning rate , The calculation method is as follows . Represents the decay coefficient , When >1 Indicates that the learning rate decays layer by layer , Otherwise, it means expanding layer by layer . When =1 Time and tradition Bert identical .
2. Deep pre training
Bert It is pre trained on the General Corpus , If you want to apply text classification in a specific field , There must be some gaps in data distribution . At this time, you can consider deep pre training .
- Within-task pre-training:Bert Pre training on the training corpus
- In-domain pre-training: Pre training the corpus in the same field
- Cross-domain pre-training: Pre training the corpus in different fields
3. multitasking Fine-tune
Multitasking tuning is using Bert To train different downstream tasks, but except for the last layer , Share parameters in other layers .
experimental result
1. Data sets
This article uses IMDB, Yelp Comment data sets are used for emotional analysis ,TREC( Open domain question and answer data sets ),yahoo Q & A is used for problem classification ,AG Journalism ,DBPedia and Sougou News topic classification . Use in this paper WordPiece embeddings With ## Segment sentences . Yes Sougou News adoption ".", "?", "!" To separate sentences .

2. Fine-tune Strategy
- Long text processing
There are two ways to deal with long text , Truncation and segmentation .
- truncation : Generally speaking, the most important information in a text is the beginning and end , Therefore, the long text is truncated in this paper .
head-only: Leave the 510 Characters
tail-only: After reservation 510 Characters
head+tail: Leave the 128 And after 382 Characters
- segmentation : Divide the text into k paragraph , Input of each paragraph and Bert The general input is the same , The first character is [CLS] Represents the weighted information of this paragraph . In this paper, we use Max-pooling, Average pooling and self-attention Combine the representation of these fragments .
Here are the results of the experiment ,head+tail The representation of is better on both data sets . It should be that the long text combines the information at the beginning and end of the sentence , The information obtained is relatively balanced . However, it is strange that the splicing method as a whole is not as good as truncation , My guess is that cutting the sentence into several paragraphs may increase the instability of the model , And errors may be magnified when superimposed . and max-pooling and self-attention It also emphasizes more useful information in the text , So the overall effect is better than average.

2. Layer selection
In this paper, the effect of each layer and the results of the first four layers are spliced , The result stitching of the last four layers and 12 The results of layer splicing were tested , Found the last four layers of splicing and the 11 Layers have the same effect .

3. Catastrophic Forgetting
Catastrophic forgetting It means that the pre trained knowledge is forgotten when learning new knowledge . Right Bert Of Catastrophic Forgetting The problem is explored . The picture below is IMDB Different learning rates and error-rate The curve of , It can be seen that a relatively small learning rate has a better effect .

4. Interlayer learning rate
The influence of inter layer learning rate on the model , You can see when the initial learning rate is high , The recession rate should be relatively low . Because the deep model can learn less , A relatively low learning rate is required for fitting . Does this also mean that a relatively fixed learning at a certain level can make the model optimal ?

3. Deep pre training
1. Within-Task Further Pre-Training
Use training data for pre training , The following figure shows the pre training step And the error rate of the test , You can see the pre training 100K The effect of training after rounds has been improved .

2. In-Domain and Cross-Domain Further Pre-Training
The corpus is divided into emotional analysis , Problem classification and topic classification , On these corpora, according to the pre training of intra domain and cross domain . The following figure shows the results of pre training ,all Is to use the corpus in all fields for pre training ,w/o It's primitive bert. It can be seen that the effect of pre training is better than that of the original Bert Improved . But be careful , Small scale corpus TREC After training in the field, the effect becomes worse .

The text also uses Bert Feature input for Bilstm+self-attention To evaluate , The effect is as follows , among :
- BERT-Feat: BERT as features
- BERT-FiT: BERT + Fine-Tuning
- BERT-ITPT-FiT: BERT + withIn-Task Pre-Training + Fine-Tuning
- BERT-IDPT-FiT: BERT + In-Domain Pre-Training + Fine-Tuning
- BERT-CDPT-FiT: BERT + Cross-Domain Pre-Training + Fine-Tuning

4. multitasking Fine-tune
Used in the paper 4 A dataset of English Classification (IMDB, Yelp.P, AG, DBP) multitask , At the same time, it uses cross domain pre training Bert Compare the models , The effect is as follows . It can be seen that multi task learning can improve Bert The effect of , At the same time, pre training in cross fields Bert Multitask on the model Fine tune The effect is the best .

5. The size of the training set
Right Fine-tune Explore the influence of the size of the training set on the effect of the model . It can be seen that when the training data set is relatively small , The error rate of the model is relatively high , As the training set increases , The error rate of the model decreases . But why abscissa from 20 here we are 100 There is a little confusion here . among Bert-Fit Express Bert Fine-tune, BERT-ITPT-Fit Express BERT + withIn-Task Pre-Training + Fine-Tuning.

6. Bert Large Preliminary training
Right Bert Large It also went on With task Preliminary training , Do wonders vigorously , Sure enough Bert large Is much better .

summary
I feel this is a very solid article , Consider a more comprehensive experimental report , But there are few thoughts and explanations about the experimental results . In short, use Bert When fine-tuning, you can consider re pre training in the field , Let the model learn more , And deep learning is still working hard to achieve miracles as always .
Related information
How to Fine-Tune BERT for Text Classificationarxiv.org/pdf/1905.05583.pdf
边栏推荐
- PHP wechat official account development
- Can financial products be redeemed in advance?
- Yolov5 model training and detection
- The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
- Include rake tasks in Gems - including rake tasks in gems
- 220213c language learning diary
- Educational Codeforces Round 122 (Rated for Div. 2) ABC
- Tucson will lose more than $400million in the next year
- Traditional chips and AI chips
- What sparks can applet container technology collide with IOT
猜你喜欢

One plus six brushes into Kali nethunter

The MySQL team development specifications used by various factories are too detailed. It is recommended to collect them!

"C zero foundation introduction hundred knowledge and hundred cases" (72) multi wave entrustment -- Mom shouted for dinner

Comment mettre en place une équipe technique pour détruire l'entreprise?

Li Kou Jianzhi offer -- binary tree chapter

JVM's responsibility - load and run bytecode

A tab Sina navigation bar

如何搭建一支搞垮公司的技術團隊?

Visual studio 2019 set transparent background (fool teaching)

Lsblk command - check the disk of the system. I don't often use this command, but it's still very easy to use. Onion duck, like, collect, pay attention, wait for your arrival!
随机推荐
RichView TRVStyle MainRVStyle
Incremental backup? db full
流批一體在京東的探索與實踐
Rabbit MQ message sending of vertx
[机缘参悟-38]:鬼谷子-第五飞箝篇 - 警示之一:有一种杀称为“捧杀”
Es uses collapsebuilder to de duplicate and return only a certain field
Subject 3 how to turn on the high beam diagram? Is the high beam of section 3 up or down
PHP Joseph Ring problem
Application and Optimization Practice of redis in vivo push platform
增量备份 ?db full
MySQL regexp: Regular Expression Query
Win: use PowerShell to check the strength of wireless signal
PHP Basics - detailed explanation of DES encryption and decryption in PHP
Using druid to connect to MySQL database reports the wrong type
Use the difference between "Chmod a + X" and "Chmod 755" [closed] - difference between using "Chmod a + X" and "Chmod 755" [closed]
[illumination du destin - 38]: Ghost Valley - chapitre 5 Flying clamp - one of the Warnings: There is a kind of killing called "hold Kill"
Prometheus monitors the correct posture of redis cluster
Practical case of SQL optimization: speed up your database
Introduce reflow & repaint, and how to optimize it?
Binary tree traversal - middle order traversal (golang)