当前位置:网站首页>Revisiting Self-Training for Few-Shot Learning of Language Model,EMNLP2021

Revisiting Self-Training for Few-Shot Learning of Language Model,EMNLP2021

2022-06-11 09:34:00 Echo in May

 Insert picture description here

Method

This paper proposes a self supervised language model based framework for text learning with few samples SFLM. Given text samples , Learning methods by shielding the prompts of the language model , Generate weak enhancement and strong enhancement views for the same sample ,SFLM Generate a pseudo tag on the weakly enhanced version . then , When fine tuning with the enhanced version , The model predicts the same pseudo tag .
The overall model is as follows :
 Insert picture description here
First , Given a labeled dataset X \mathcal{X} X, The number of each category is N N N, The unlabeled training data set is U \mathcal{U} U, The number of each class is μ N \mu N μN, μ \mu μ Is a greater than 1 The integer of , Used to ensure that the number of labeled samples is always less than that of unlabeled samples . As shown in the figure , The loss function of the whole model can be divided into three categories :
 Insert picture description here
L S \mathcal{L}_S LS Indicates a supervised loss , The latter two represent the loss of self supervision . There are two kinds of self supervision , One is based on MLM To predict the sentence being mask Missing words , That is to say λ 2 L s s l \lambda_2\mathcal{L}_{ssl} λ2Lssl; The second is the pseudo tag generated by weak enhancement , Let the prediction results on the strongly enhanced version of the same sample be consistent with the pseudo tags . ad locum ,Prompt Learn how to predict labels , So let's talk about that Prompt Method .

Prompt-based supervised loss

Prompt learning (Prompt learning) Build a template with the predicted tag and the text body , This article USES “It is [Mask]” Type suffix template , That is to say :
 Insert picture description here
So the model can predict [Mask], Get the word corresponding to the emotional tendency . This word is mapped M ′ \mathcal{M}' M Related to the corresponding classification , So by forecasting [Mask] Then you can get the specific categories . The formal description is :
 Insert picture description here
after , The predicted words can be adjusted by cross entropy :
 Insert picture description here

Self-training loss

For the same unlabeled sample u i u_i ui, Weak enhancement and strong enhancement are respectively expressed as α ( u i ) \alpha(u_i) α(ui) as well as A ( u i ) \mathcal{A}(u_i) A(ui). For strong enhancement , According to and Bert The same probability is 15% Go to masktoken. after , Both enhanced views go through a layer 0.1 Of dropout Reprocessing . The calculation of pseudo tags is the same as prompt Learn to predict specific words :
 Insert picture description here And in order to distribute q i q_i qi Get the corresponding label , Need to carry out max:
 Insert picture description here
after , The whole loss of self-monitoring can be expressed as :
 Insert picture description here
among τ \tau τ It's a threshold , Only when the maximum probability is higher than this value can it be predicted as 1. And then the surface H H H It is indicated under the pseudo tag LM Predicted results .

Experiment

The method introduction is very simple , Then there are a lot of experiments , This paper chooses two different experiments of text classification and text pair matching , And set up N = 6 , μ = 4 N=6,\mu=4 N=6,μ=4. The overall results are as follows :
 Insert picture description here
About parameters N , μ N,\mu N,μ The experiment of :
 Insert picture description here
Comparison of different data augmentation strategies :
 Insert picture description here
In another language model DistilRoBERTa Effect on :
 Insert picture description here
Cross dataset Zero-shot Study :
 Insert picture description here

原网站

版权声明
本文为[Echo in May]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203012252469160.html