2022-07-04 23:17:00 Zhiyuan community

DisCo: Remedy Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

The paper :https://arxiv.org/abs/2104.09124 Code ( Open source ):https://github.com/Yuting-Gao/DisCo-pytorch



Self supervised learning usually refers to the model learning general representations on large-scale unlabeled data , Migrate to downstream related tasks . Because the learned general characterization can significantly improve the performance of downstream tasks , Self supervised learning is widely used in various scenarios . Generally speaking , The larger the model capacity , The better the effect of self supervised learning [1,2]. conversely , Lightweight model (EfficientNet-B0, MobileNet-V3, EfficientNet-B1) The effect of self supervised learning is far less than that of the relatively large capacity model (ResNet50/101/152/50*2).

At present, the way to improve the performance of lightweight models in self supervised learning is mainly through distillation , Transfer the knowledge of the model with larger capacity to the student model .SEED [2] be based on MoCo-V2 frame [3,4], Large capacity model as Teacher, Lightweight model as Student, share MoCo-V2 Negative sample space in the frame (Queue), Through cross entropy, positive samples and the same negative samples are forced to Student And Teacher The distribution in space should be as same as possible .CompRess [1] And tried Teacher and Student Maintain their respective negative sample spaces , Use at the same time KL Divergence to narrow the distribution . The above methods can effectively Teacher Knowledge transferred to Student, So as to improve the lightweight model Student The effect of ( This article will use... Alternately Student And lightweight models ).

This paper proposes  Distilled Contrastive Learning (DisCo), A simple and effective self supervised learning method based on distillation lightweight model , This method can significantly improve Student And Some lightweight models can be very close Teacher Performance of . This method has the following observations :

  1. Distillation learning based on self-monitoring , because The last layer of representation contains the global absolute position and local relative position information of different samples in the whole representation space , and Teacher This kind of information in Student Better , So just pull closer Teacher And Student The representation of the last layer may be the best .
  2. stay CompRess [1] in ,Teacher And Student The model shares a negative sample queue (1q) And have their own negative sample queue (2q) The gap is 1% Inside . This method is migrated to the downstream task data set CUB200, Car192, This method has its own negative sample queue and can even significantly exceed the shared negative sample queue . This explanation ,Student Not from Teacher Learn enough effective knowledge in the shared negative sample space .Student There is no need to rely on Teacher Negative sample space of .
  3. One of the benefits of abandoning shared queues , As a whole The framework does not depend on MoCo-V2, The whole framework is more concise .Teacher/Student The model can be compared with others MoCo-V2 More effective self-monitoring / Unsupervised representation learning method combined , Further improve the final performance of the lightweight model after distillation .

In the current self-monitoring methods ,MLP The low dimension of the hidden layer may be the bottleneck of distillation performance . Adding the dimension of the hidden layer of this structure in the self supervised learning and distillation stage can further improve the effect of the final lightweight model after distillation , There will be no extra cost in the deployment phase . Change the hidden layer dimension from 512->2048,ResNet-18 Can significantly improve 3.5%.



This paper proposes a simple but effective framework  Distilled Contrastive Learning (DisCo) .Student Self supervised learning will be carried out at the same time as learning the same sample in Teacher In the representation space of .

DisCo Framework

As shown in the figure above , Expand through data (Data Augmentation) The operation generates the image into two views (View). In addition to self supervised learning , A self supervised learning is also introduced Teacher Model . Require the same view of the same sample , after Student And fixed parameter Teacehr The final characterization of is consistent . In the main experiment of this paper , Self supervised learning is based on MoCo-V2 (Contrastive Learning), And keep the same sample passing Teacher And Student The characterization similarity of the output characterization is through consistent regularization (Consistency Regularization). This paper uses mean square error to make Student Learn that the sample is corresponding Teacher Distribution in space .


