当前位置:网站首页>A Closer Look at How Fine-tuning Changes BERT
A Closer Look at How Fine-tuning Changes BERT
2022-07-06 07:51:00 【be_ humble】
A Closer Look at How Fine-tuning Changes BERT
ACL 2022, The author comes from every Utah University
Thesis link :[2102.12452] Probing Classifiers: Promises, Shortcomings, and Advances (arxiv.org)
After reading the title, I think it should be compared by using interpretable methods fine-tune Front and rear bert The model analyzes the effect of target task data , There may be some theoretical analysis , Proof and so on .
Abstract
First of all, let's talk about recent years as a whole pretrain Model fire , Usually in downstream tasks fine-tune, Perform better , The author reasoned fine-tune Added different label data representation Distance of , Then set the 5 A group of experiments proved , At the same time, I found that not all fine-tune Will make the model better , In the end, I mentioned fine-tune After representation The original spatial structure will still be preserved .
It seems that I think too much , Even the comparison of interpretable methods is useless , It's simply different label The vector is far away , Here's a look , It's not nonsense , original pretrain Our model does not perform directly in downstream tasks fine-tune good , Model applications are representation One after another fc layer softmax To classify , Of course fine-tune It's different label It's a long way away . Then we'll see what the exceptions are and how the author sets up experiments to compare the two spatial structures .
1. Introduction
First introduced Bert The paper , Then on fine-tune Use related work to introduce , Finally, I put forward myself motivation:fine-tuning How to change representation And why it works , Put forward 3 A question
- fine-tuning Whether it has always worked
- fine-tuning How to adjust representation
- fine-tuning stay bert Different laryers How much has changed
Use two kinds of probing Method :
- Based on classification probe
- Direct Probe
stay 5 Class different tasks (POS, Rely on header prediction , Deixis disambiguation , Function prediction , Text classification )
The conclusion is as follows :
fine-tune stay train and test The differences of , Generally, it has little effect on the results
fine-tune take representation Different label Distance increases , Different label Of cluster Distance increases
fine-tune Only slightly changed the top ,representation The relative position of the label cluster is preserved .
2. Preliminaries: Probing Methods
This paper mainly aims at representation analysis , The following mainly introduces the analysis method , Probe method
Classifier as Probes
Simply speaking , Is a classifier , The input is us bert At the top of the model embedding representation, The output is the classification result , adopt freeze embedding, Only train classifiers , Then compare the experimental results , Here the classifier is used twice fc, Followed by relu Activation and other super parameter settings
DirectProbe: Probing the Geometric Structure
Because it cannot be reflected directly with classifier probe representation The performance of the , Use similar clustering methods , according to embedding Get different clusters , Calculate the distance between clusters , The number of clusters is the same as label Number comparison , By calculating the distance between clusters Person coefficient To reflect the spatial similarity . To show fine-tune Before and after representation The difference .
Probe method , One by one classifier , A cluster , Express by clustering among clusters , Are very simple methods , I think I get the result like this , explain fine-tune The optimization of vector performance is not sufficient .
3. Experimental setup
3.1 representations
Use bert Different layers of the model , Different hidden_size Vector representation of , The model is for English text , Case insensitive (uncased), Use word segmentation subwords, Use average pooling to represent token representation, Code using huggingface Code .
3.2 Tasks
in the light of bert Common tasks , Covering grammatical and semantic tasks
POS Morphological tagging
DEP dependency parsing
PS-role Deixis disambiguation
Text -Classification Text classification , Use CLS As a sentence, it means
3.3 Fine tune settings
10 individual epoch, And it is pointed out that fine-tune And training classifier probe two-stage training process , Isn't that bullshit ,,
4. Analysis of experimental results
4.1 fine-tune performance
Experiments show that fine-tune Make the training set and test set diverge , And found in Bert-small Under the model PS-fxn Mission fine-tune The effect is worse , But the main reason is that the similarity between the training set and the test set is low , Then no specific reason was found ( I think this is nonsense , you fine-tune Data and test data are very different , that fine-tune Not only in the wrong direction , It is very possible that the effect is poor , And it's similar clip Wait for the model fine-tune There are also many with poor results )
4.2 The linearity of vector representation
As shown in the result chart , It shows fine-tune After clustering, the number of clusters decreases , Linear enhancement . Fine tuning makes the original complex spatial representation simple , After fine tuning, the vector cluster has a purpose label Convergence mobile .
4.3 Spatial structure of labels
Above, bert-base Represented by vectors at the top and bottom PCA Dimension reduction result graph , indicate fine-tune Can make a difference label Increase the distance between clusters ( this fine-tune The effect increases , Isn't it obvious that the vector represents the increase of distance )
4.4 Cross task fine-tune
The author also considers that since fine-tune One task Will increase the corresponding label Distance of , So for others task Of label The distance should be reduced accordingly , Check the effect through the experiment above , Different task The task is going on fine-tune, And then again PS-fxn Probe test results , The results show that the tasks with high similarity fine-tune It is possible to cross-task fine-tune The effect is good , Low correlation task Conduct fine-tune It will reduce the effect .( The experimental feeling of this part is also obvious )
Finally, the vector representation of different layers person coefficient , prove fine-tune The information representation of the pre training model is hardly modified ,high layer The change is very small .
summary
To summarize this article , Mainly to describe fine-tune Why the effect is good , And then make a comparative analysis on the vector representation , Final analysis fine-tune Yes bert Different layer vectors represent the effect . Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .
Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .
Go home and work during recent holidays , I often play basketball every day , Bodybuilding , The time for study has been reduced a lot , Slower , I must refuel recently .
边栏推荐
- 08- [istio] istio gateway, virtual service and the relationship between them
- 23. Update data
- [computer skills]
- 【T31ZL智能视频应用处理器资料】
- Generator Foundation
- Wonderful use of TS type gymnastics string
- Relevant introduction of clip image
- [dictionary tree] [trie] p3879 [tjoi2010] reading comprehension
- Google可能在春节后回归中国市场。
- esRally国内安装使用避坑指南-全网最新
猜你喜欢
In the era of digital economy, how to ensure security?
烧录场景下的源代码防泄密方案分享
Sharing of source code anti disclosure scheme under burning scenario
ROS learning (IX): referencing custom message types in header files
TS 类型体操 之 extends,Equal,Alike 使用场景和实现对比
leecode-C语言实现-15. 三数之和------思路待改进版
Relevant introduction of clip image
Codeforces Global Round 19(A~D)
Games101 Lesson 7 shading 1 Notes
23. Update data
随机推荐
octomap averageNodeColor函数说明
Solution: intelligent site intelligent inspection scheme video monitoring system
edge瀏覽器 路徑獲得
Luogu p1836 number page solution
C # create database connection object SQLite database
继电反馈PID控制器参数自整定
DataX self check error /datax/plugin/reader/_ drdsreader/plugin. Json] does not exist
Iterator Foundation
Apache middleware vulnerability recurrence
08- [istio] istio gateway, virtual service and the relationship between them
Data governance: data quality
[count] [combined number] value series
Risk planning and identification of Oracle project management system
24. Query table data (basic)
http缓存,强制缓存,协商缓存
[cf gym101196-i] waif until dark network maximum flow
洛谷P4127 [AHOI2009]同类分布 题解
Google可能在春节后回归中国市场。
软件开发的一点随记
js對象獲取屬性的方法(.和[]方式)