当前位置：网站首页>Deep learning (incremental learning) -- iccv2021:ss-il: separated softmax for incremental learning

Deep learning (incremental learning) -- iccv2021:ss-il: separated softmax for incremental learning

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Method
- Motivation
- Separated-Softmax
experiment
reflection

Preface

This paper solves the problem of catastrophic forgetting in continuous learning from the perspective of category imbalance .

Save some old data , There will be category imbalance between old and new data , This leads to the model paying too much attention to new data during training , Ignore old data , Leading to catastrophic forgetting .

This paper will briefly introduce the method proposed in this paper , And introduce some interesting experiments , Finally, I give my views on this article

Method

Motivation

The author first introduces GKD and TKD Two kinds of knowledge distillation loss, The differences are shown in the following figure
Insert picture description here
Suppose the old model has 50 class , The new models are 60 class , Every task contain 10 Class data ,GKD The process of knowledge distillation is as follows

Before the output of the new model 50 individual logit Conduct softmax operation
The output of the old model 50 individual logit Conduct softmax operation
Step one 、 The result of two is used to calculate the cross entropy , Step 2 is called cross entropy target Distribution

TKD The process of knowledge distillation is as follows

Before the output of the new model 50 individual logit With task In units of , Every time 10 individual logit Calculation softmax, share 5 Group
The output of the old model 50 individual logit With task In units of , Every time 10 individual logit Calculation softmax, share 5 Group
Step one 、 The result of two By group Cross entropy calculation , Step 2 is called cross entropy target Distribution

Author points out GKD It will aggravate the classification preference of the model , As shown in the figure below
Insert picture description here
Let's look at the picture on the right , for instance ,Current Task=3 Express Task 2 Model of , stay Task 3 Prediction results on training data , Green indicates that the classification result of the model is Task 2 The proportion of , Light pink indicates that the classification result of the model is Task 1 The proportion of , It can be seen that the model tends to divide the samples into recently learned task, If you use GKD Distillation of knowledge , It will cause the model to focus only on the recently learned old task, And ignore the earlier task.

The author in Bias caused by ordinary cross-entropy In the first section, it is pointed out that cross entropy will aggravate classification preference , And a simple explanation is made from the perspective of gradient , I don't quite agree with his statement ,one-hot In the case of coding , Cross entropy is equivalent to maximum likelihood estimation , Maximum likelihood estimation will consider the categories with more training data , It is more likely to appear in real scenes , Thus, it tends to divide the test samples into categories with more training data , This is the reason why I think there will be classification preference in cross entropy in the case of category imbalance .

Whether it's GKD Or cross entropy , They all do all the output of the classifier softmax, The author believes that this may lead to classification preference ( The text is as follows ), So the author puts forward Separated-Softmax.

Above two observations suggest that the main reason for the prediction bias could be to compute the softmax probability by combining the old and new tasks altogether. Motivated by this, we propose Separated-Softmax for Incremen- tal Learning (SS-IL) in the next section.

Separated-Softmax

This paper will retain some old data , Suppose there are now 50 Class belongs to the old category ,10 Class belongs to the new category , When an image belongs to the old category , The front of the classifier 50 One output to do softmax Calculation , When an image belongs to a new category , After the classifier 10 Sub output softmax Calculation . besides , The author also proposes to use TKD Distillation of knowledge . The author will control one batch The proportion of new and old categories , amount to re-sample, But the article does not point out this trick What is the impact .

experiment

Insert picture description here
The accuracy of the initial stage must be basically consistent , The explanation given by the author is End2End The parameters of are not reasonable ,iCaRL Of NME The classifier is not good enough , This statement is not correct , The initial stage is trained with cross entropy , You adjust the parameters to make each method The fitting degree of is basically the same .

Ablation Experiment
Insert picture description here
among $L_{CE-SS}$ That is, the author proposed Separated-Softmax loss, It can be seen that the scheme proposed by the author can indeed slow down the classification preference

reflection

Of this article Motivation And the practice is quite strange . All outputs of the classifier do softmax How can it be the cause of classification preference ？ Besides , The author divides the output of the classifier into two parts according to the new and old categories , Do the two parts separately softmax, This approach is a bit counter intuitive , for instance , Suppose we have four categories , The labels are 1、2、3、4, Old category is 1、2, The new category is 3、4, When we enter a category 4 The image of , How to guarantee the category 4 Output logit It must be better than the category 1、2、3 All are big （ Training can only guarantee the category 4 Output logit Than category 3 Big ）？

Another part of the article contribution It is pointed out that TKD Than GKD The effect is good , From the confusion matrix ,TKD Compared with GKD, The degree of classification preference is indeed weaker .

Overall speaking , This article does have some highlights , It can give people some enlightenment , But I personally don't think it's enough ICCV Standards for publication

In recent years, the quality of incremental articles is relatively general , There are even many top papers that change the experimental settings according to the characteristics of the method , Such papers are difficult to give people some enlightenment , I prefer to see some papers that solve incremental learning from a new perspective , Even if the performance of these papers is not optimal . What this field lacks is a new perspective , Rather than a combination of various brush points . I hope reviewers will be more tolerant of papers that offer new perspectives , Instead of emphasizing performance.

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518199390.html