当前位置：网站首页>A Closer Look at How Fine-tuning Changes BERT

A Closer Look at How Fine-tuning Changes BERT

2022-07-06 07:51:00 【be_ humble】

A Closer Look at How Fine-tuning Changes BERT

ACL 2022, The author comes from every Utah University

Thesis link ：[2102.12452] Probing Classifiers: Promises, Shortcomings, and Advances (arxiv.org)

After reading the title, I think it should be compared by using interpretable methods fine-tune Front and rear bert The model analyzes the effect of target task data , There may be some theoretical analysis , Proof and so on .

Abstract

First of all, let's talk about recent years as a whole pretrain Model fire , Usually in downstream tasks fine-tune, Perform better , The author reasoned fine-tune Added different label data representation Distance of , Then set the 5 A group of experiments proved , At the same time, I found that not all fine-tune Will make the model better , In the end, I mentioned fine-tune After representation The original spatial structure will still be preserved .

It seems that I think too much , Even the comparison of interpretable methods is useless , It's simply different label The vector is far away , Here's a look , It's not nonsense , original pretrain Our model does not perform directly in downstream tasks fine-tune good , Model applications are representation One after another fc layer softmax To classify , Of course fine-tune It's different label It's a long way away . Then we'll see what the exceptions are and how the author sets up experiments to compare the two spatial structures .

1. Introduction

First introduced Bert The paper , Then on fine-tune Use related work to introduce , Finally, I put forward myself motivation：fine-tuning How to change representation And why it works , Put forward 3 A question

fine-tuning Whether it has always worked
fine-tuning How to adjust representation
fine-tuning stay bert Different laryers How much has changed

Use two kinds of probing Method ：

Based on classification probe
Direct Probe

stay 5 Class different tasks （POS, Rely on header prediction , Deixis disambiguation , Function prediction , Text classification ）

The conclusion is as follows ：

fine-tune stay train and test The differences of , Generally, it has little effect on the results
fine-tune take representation Different label Distance increases , Different label Of cluster Distance increases
fine-tune Only slightly changed the top ,representation The relative position of the label cluster is preserved .

2. Preliminaries: Probing Methods

This paper mainly aims at representation analysis , The following mainly introduces the analysis method , Probe method

Classifier as Probes

Simply speaking , Is a classifier , The input is us bert At the top of the model embedding representation, The output is the classification result , adopt freeze embedding, Only train classifiers , Then compare the experimental results , Here the classifier is used twice fc, Followed by relu Activation and other super parameter settings

DirectProbe: Probing the Geometric Structure

Because it cannot be reflected directly with classifier probe representation The performance of the , Use similar clustering methods , according to embedding Get different clusters , Calculate the distance between clusters , The number of clusters is the same as label Number comparison , By calculating the distance between clusters Person coefficient To reflect the spatial similarity . To show fine-tune Before and after representation The difference .

Probe method , One by one classifier , A cluster , Express by clustering among clusters , Are very simple methods , I think I get the result like this , explain fine-tune The optimization of vector performance is not sufficient .

3. Experimental setup

3.1 representations

Use bert Different layers of the model , Different hidden_size Vector representation of , The model is for English text , Case insensitive （uncased）, Use word segmentation subwords, Use average pooling to represent token representation, Code using huggingface Code .

3.2 Tasks

in the light of bert Common tasks , Covering grammatical and semantic tasks

POS Morphological tagging
DEP dependency parsing
PS-role Deixis disambiguation
Text -Classification Text classification , Use CLS As a sentence, it means

3.3 Fine tune settings

10 individual epoch, And it is pointed out that fine-tune And training classifier probe two-stage training process , Isn't that bullshit ,,