当前位置:网站首页>A Closer Look at How Fine-tuning Changes BERT

A Closer Look at How Fine-tuning Changes BERT

2022-07-06 07:51:00 be_ humble

A Closer Look at How Fine-tuning Changes BERT

ACL 2022, The author comes from every Utah University

Thesis link :[2102.12452] Probing Classifiers: Promises, Shortcomings, and Advances (arxiv.org)


After reading the title, I think it should be compared by using interpretable methods fine-tune Front and rear bert The model analyzes the effect of target task data , There may be some theoretical analysis , Proof and so on .


Abstract

First of all, let's talk about recent years as a whole pretrain Model fire , Usually in downstream tasks fine-tune, Perform better , The author reasoned fine-tune Added different label data representation Distance of , Then set the 5 A group of experiments proved , At the same time, I found that not all fine-tune Will make the model better , In the end, I mentioned fine-tune After representation The original spatial structure will still be preserved .


It seems that I think too much , Even the comparison of interpretable methods is useless , It's simply different label The vector is far away , Here's a look , It's not nonsense , original pretrain Our model does not perform directly in downstream tasks fine-tune good , Model applications are representation One after another fc layer softmax To classify , Of course fine-tune It's different label It's a long way away . Then we'll see what the exceptions are and how the author sets up experiments to compare the two spatial structures .


1. Introduction

First introduced Bert The paper , Then on fine-tune Use related work to introduce , Finally, I put forward myself motivation:fine-tuning How to change representation And why it works , Put forward 3 A question

  • fine-tuning Whether it has always worked
  • fine-tuning How to adjust representation
  • fine-tuning stay bert Different laryers How much has changed

Use two kinds of probing Method :

  • Based on classification probe
  • Direct Probe

stay 5 Class different tasks (POS, Rely on header prediction , Deixis disambiguation , Function prediction , Text classification )

The conclusion is as follows :

  • fine-tune stay train and test The differences of , Generally, it has little effect on the results

  • fine-tune take representation Different label Distance increases , Different label Of cluster Distance increases

  • fine-tune Only slightly changed the top ,representation The relative position of the label cluster is preserved .


2. Preliminaries: Probing Methods

This paper mainly aims at representation analysis , The following mainly introduces the analysis method , Probe method

Classifier as Probes

Simply speaking , Is a classifier , The input is us bert At the top of the model embedding representation, The output is the classification result , adopt freeze embedding, Only train classifiers , Then compare the experimental results , Here the classifier is used twice fc, Followed by relu Activation and other super parameter settings

DirectProbe: Probing the Geometric Structure

Because it cannot be reflected directly with classifier probe representation The performance of the , Use similar clustering methods , according to embedding Get different clusters , Calculate the distance between clusters , The number of clusters is the same as label Number comparison , By calculating the distance between clusters Person coefficient To reflect the spatial similarity . To show fine-tune Before and after representation The difference .


Probe method , One by one classifier , A cluster , Express by clustering among clusters , Are very simple methods , I think I get the result like this , explain fine-tune The optimization of vector performance is not sufficient .


3. Experimental setup

3.1 representations

Use bert Different layers of the model , Different hidden_size Vector representation of , The model is for English text , Case insensitive (uncased), Use word segmentation subwords, Use average pooling to represent token representation, Code using huggingface Code .

3.2 Tasks

in the light of bert Common tasks , Covering grammatical and semantic tasks

  • POS Morphological tagging

  • DEP dependency parsing

  • PS-role Deixis disambiguation

  • Text -Classification Text classification , Use CLS As a sentence, it means

3.3 Fine tune settings

10 individual epoch, And it is pointed out that fine-tune And training classifier probe two-stage training process , Isn't that bullshit ,,

4. Analysis of experimental results

4.1 fine-tune performance

 Insert picture description here

Experiments show that fine-tune Make the training set and test set diverge , And found in Bert-small Under the model PS-fxn Mission fine-tune The effect is worse , But the main reason is that the similarity between the training set and the test set is low , Then no specific reason was found ( I think this is nonsense , you fine-tune Data and test data are very different , that fine-tune Not only in the wrong direction , It is very possible that the effect is poor , And it's similar clip Wait for the model fine-tune There are also many with poor results )

4.2 The linearity of vector representation

 Insert picture description here

As shown in the result chart , It shows fine-tune After clustering, the number of clusters decreases , Linear enhancement . Fine tuning makes the original complex spatial representation simple , After fine tuning, the vector cluster has a purpose label Convergence mobile .

 Insert picture description here

4.3 Spatial structure of labels

 Insert picture description here

Above, bert-base Represented by vectors at the top and bottom PCA Dimension reduction result graph , indicate fine-tune Can make a difference label Increase the distance between clusters ( this fine-tune The effect increases , Isn't it obvious that the vector represents the increase of distance )

4.4 Cross task fine-tune

 Insert picture description here

The author also considers that since fine-tune One task Will increase the corresponding label Distance of , So for others task Of label The distance should be reduced accordingly , Check the effect through the experiment above , Different task The task is going on fine-tune, And then again PS-fxn Probe test results , The results show that the tasks with high similarity fine-tune It is possible to cross-task fine-tune The effect is good , Low correlation task Conduct fine-tune It will reduce the effect .( The experimental feeling of this part is also obvious )

 Insert picture description here

Finally, the vector representation of different layers person coefficient , prove fine-tune The information representation of the pre training model is hardly modified ,high layer The change is very small .

summary

To summarize this article , Mainly to describe fine-tune Why the effect is good , And then make a comparative analysis on the vector representation , Final analysis fine-tune Yes bert Different layer vectors represent the effect . Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .


Usage probe ( Based on the representation vector, then classify and cluster ), The experimental idea is very simple , The content of the proof is also the result we take for granted , But the article is well described , The experiment is full , Enough work , I think there is no reference value in methods after reading , But this kind of thing proves to be taken for granted , Sometimes it is necessary .


Go home and work during recent holidays , I often play basketball every day , Bodybuilding , The time for study has been reduced a lot , Slower , I must refuel recently .

原网站

版权声明
本文为[be_ humble]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060748208381.html