当前位置:网站首页>A Closer Look at How Fine-tuning Changes BERT
A Closer Look at How Fine-tuning Changes BERT
2022-07-06 07:51:00 【be_ humble】
A Closer Look at How Fine-tuning Changes BERT
ACL 2022, The author comes from every Utah University
Thesis link :[2102.12452] Probing Classifiers: Promises, Shortcomings, and Advances (arxiv.org)
After reading the title, I think it should be compared by using interpretable methods fine-tune Front and rear bert The model analyzes the effect of target task data , There may be some theoretical analysis , Proof and so on .
Abstract
First of all, let's talk about recent years as a whole pretrain Model fire , Usually in downstream tasks fine-tune, Perform better , The author reasoned fine-tune Added different label data representation Distance of , Then set the 5 A group of experiments proved , At the same time, I found that not all fine-tune Will make the model better , In the end, I mentioned fine-tune After representation The original spatial structure will still be preserved .
It seems that I think too much , Even the comparison of interpretable methods is useless , It's simply different label The vector is far away , Here's a look , It's not nonsense , original pretrain Our model does not perform directly in downstream tasks fine-tune good , Model applications are representation One after another fc layer softmax To classify , Of course fine-tune It's different label It's a long way away . Then we'll see what the exceptions are and how the author sets up experiments to compare the two spatial structures .
1. Introduction
First introduced Bert The paper , Then on fine-tune Use related work to introduce , Finally, I put forward myself motivation:fine-tuning How to change representation And why it works , Put forward 3 A question
- fine-tuning Whether it has always worked
- fine-tuning How to adjust representation
- fine-tuning stay bert Different laryers How much has changed
Use two kinds of probing Method :
- Based on classification probe
- Direct Probe
stay 5 Class different tasks (POS, Rely on header prediction , Deixis disambiguation , Function prediction , Text classification )
The conclusion is as follows :
fine-tune stay train and test The differences of , Generally, it has little effect on the results
fine-tune take representation Different label Distance increases , Different label Of cluster Distance increases
fine-tune Only slightly changed the top ,representation The relative position of the label cluster is preserved .
2. Preliminaries: Probing Methods
This paper mainly aims at representation analysis , The following mainly introduces the analysis method , Probe method
Classifier as Probes
Simply speaking , Is a classifier , The input is us bert At the top of the model embedding representation, The output is the classification result , adopt freeze embedding, Only train classifiers , Then compare the experimental results , Here the classifier is used twice fc, Followed by relu Activation and other super parameter settings
DirectProbe: Probing the Geometric Structure
Because it cannot be reflected directly with classifier probe representation The performance of the , Use similar clustering methods , according to embedding Get different clusters , Calculate the distance between clusters , The number of clusters is the same as label Number comparison , By calculating the distance between clusters Person coefficient To reflect the spatial similarity . To show fine-tune Before and after representation The difference .
Probe method , One by one classifier , A cluster , Express by clustering among clusters , Are very simple methods , I think I get the result like this , explain fine-tune The optimization of vector performance is not sufficient .
3. Experimental setup
3.1 representations
Use bert Different layers of the model , Different hidden_size Vector representation of , The model is for English text , Case insensitive (uncased), Use word segmentation subwords, Use average pooling to represent token representation, Code using huggingface Code .
3.2 Tasks
in the light of bert Common tasks , Covering grammatical and semantic tasks
POS Morphological tagging
DEP dependency parsing
PS-role Deixis disambiguation
Text -Classification Text classification , Use CLS As a sentence, it means
3.3 Fine tune settings
10 individual epoch, And it is pointed out that fine-tune And training classifier probe two-stage training process , Isn't that bullshit ,,
4. Analysis of experimental results
4.1 fine-tune performance
Experiments show that fine-tune Make the training set and test set diverge , And found in Bert-small Under the model PS-fxn Mission fine-tune The effect is worse , But the main reason is that the similarity between the training set and the test set is low , Then no specific reason was found ( I think this is nonsense , you fine-tune Data and test data are very different , that fine-tune Not only in the wrong direction , It is very possible that the effect is poor , And it's similar clip Wait for the model fine-tune There are also many with poor results )
4.2 The linearity of vector representation
As shown in the result chart , It shows fine-tune After clustering, the number of clusters decreases , Linear enhancement . Fine tuning makes the original complex spatial representation simple , After fine tuning, the vector cluster has a purpose label Convergence mobile .
4.3 Spatial structure of labels
Above, bert-base Represented by vectors at the top and bottom PCA Dimension reduction result graph , indicate fine-tune Can make a difference label Increase the distance between clusters ( this fine-tune The effect increases , Isn't it obvious that the vector represents the increase of distance )
4.4 Cross task fine-tune
The author also considers that since fine-tune One task Will increase the corresponding label Distance of , So for others task Of label The distance should be reduced accordingly , Check the effect through the experiment above , Different task The task is going on fine-tune, And then again PS-fxn Probe test results , The results show that the tasks with high similarity fine-tune It is possible to cross-task fine-tune The effect is good , Low correlation task Conduct fine-tune It will reduce the effect .( The experimental feeling of this part is also obvious )
Finally, the vector representation of different layers person coefficient , prove fine-tune The information representation of the pre training model is hardly modified ,high layer The change is very small .
summary
To summarize this article , Mainly to describe fine-tune Why the effect is good , And then make a comparative analysis on the vector representation , Final analysis fine-tune Yes bert Different layer vectors represent the effect . Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .
Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .
Go home and work during recent holidays , I often play basketball every day , Bodybuilding , The time for study has been reduced a lot , Slower , I must refuel recently .
边栏推荐
- Position() function in XPath uses
- Pangolin Library: control panel, control components, shortcut key settings
- 链表面试题(图文详解)
- Data governance: Data Governance under microservice architecture
- Do you really think binary search is easy
- Document 2 Feb 12 16:54
- shu mei pai
- Simulation of Teman green interferometer based on MATLAB
- Common functions for PHP to process strings
- C intercept string
猜你喜欢
Esrally domestic installation and use pit avoidance Guide - the latest in the whole network
智能终端设备加密防护的意义和措施
Google可能在春节后回归中国市场。
Linked list interview questions (Graphic explanation)
[factorial inverse], [linear inverse], [combinatorial counting] Niu Mei's mathematical problems
Do you really think binary search is easy
The ECU of 21 Audi q5l 45tfsi brushes is upgraded to master special adjustment, and the horsepower is safely and stably increased to 305 horsepower
esRally国内安装使用避坑指南-全网最新
[nonlinear control theory]9_ A series of lectures on nonlinear control theory
数字经济时代,如何保障安全?
随机推荐
PHP Coding Standard
Pangolin Library: control panel, control components, shortcut key settings
超级浏览器是指纹浏览器吗?怎样选择一款好的超级浏览器?
Force buckle day31
Mex related learning
leecode-C语言实现-15. 三数之和------思路待改进版
[count] [combined number] value series
Typescript interface and the use of generics
Simulation of Michelson interferometer based on MATLAB
TS 体操 &(交叉运算) 和 接口的继承的区别
实现精细化生产, MES、APS、ERP必不可少
leecode-C語言實現-15. 三數之和------思路待改進版
Comparison of usage scenarios and implementations of extensions, equal, and like in TS type Gymnastics
Luogu p1836 number page solution
ROS learning (IX): referencing custom message types in header files
Parameter self-tuning of relay feedback PID controller
(lightoj - 1410) consistent verbs (thinking)
If Jerry's Bluetooth device wants to send data to the mobile phone, the mobile phone needs to open the notify channel first [article]
onie支持pice硬盘
成为优秀的TS体操高手 之 TS 类型体操前置知识储备