当前位置：网站首页>Revisiting Self-Training for Few-Shot Learning of Language Model，EMNLP2021

Revisiting Self-Training for Few-Shot Learning of Language Model，EMNLP2021

2022-06-11 09:34:00 【Echo in May】

Insert picture description here

Method

This paper proposes a self supervised language model based framework for text learning with few samples SFLM. Given text samples , Learning methods by shielding the prompts of the language model , Generate weak enhancement and strong enhancement views for the same sample ,SFLM Generate a pseudo tag on the weakly enhanced version . then , When fine tuning with the enhanced version , The model predicts the same pseudo tag .
The overall model is as follows ：
Insert picture description here
First , Given a labeled dataset $\mathcal{X}$ , The number of each category is $N$ , The unlabeled training data set is $\mathcal{U}$ , The number of each class is $\mu N$ , $\mu$ Is a greater than 1 The integer of , Used to ensure that the number of labeled samples is always less than that of unlabeled samples . As shown in the figure , The loss function of the whole model can be divided into three categories ：
Insert picture description here
$\mathcal{L}_S$ Indicates a supervised loss , The latter two represent the loss of self supervision . There are two kinds of self supervision , One is based on MLM To predict the sentence being mask Missing words , That is to say $\lambda_2\mathcal{L}_{ssl}$ ; The second is the pseudo tag generated by weak enhancement , Let the prediction results on the strongly enhanced version of the same sample be consistent with the pseudo tags . ad locum ,Prompt Learn how to predict labels , So let's talk about that Prompt Method .

Prompt-based supervised loss

Prompt learning （Prompt learning） Build a template with the predicted tag and the text body , This article USES “It is [Mask]” Type suffix template , That is to say ：
Insert picture description here
So the model can predict [Mask], Get the word corresponding to the emotional tendency . This word is mapped $\mathcal{M}'$ Related to the corresponding classification , So by forecasting [Mask] Then you can get the specific categories . The formal description is ：
Insert picture description here
after , The predicted words can be adjusted by cross entropy ：

Self-training loss

For the same unlabeled sample $u_i$ , Weak enhancement and strong enhancement are respectively expressed as $\alpha(u_i)$ as well as $\mathcal{A}(u_i)$ . For strong enhancement , According to and Bert The same probability is 15% Go to masktoken. after , Both enhanced views go through a layer 0.1 Of dropout Reprocessing . The calculation of pseudo tags is the same as prompt Learn to predict specific words ：
Insert picture description here And in order to distribute $q_i$ Get the corresponding label , Need to carry out max:

after , The whole loss of self-monitoring can be expressed as ：

among $\tau$ It's a threshold , Only when the maximum probability is higher than this value can it be predicted as 1. And then the surface $H$ It is indicated under the pseudo tag LM Predicted results .

Experiment

The method introduction is very simple , Then there are a lot of experiments , This paper chooses two different experiments of text classification and text pair matching , And set up $N=6,\mu=4$ . The overall results are as follows ：
Insert picture description here
About parameters $N,\mu$ The experiment of ：

Comparison of different data augmentation strategies ：

In another language model DistilRoBERTa Effect on ：

Cross dataset Zero-shot Study ：