当前位置:网站首页>[comparative learning] understanding the behavior of contractual loss (CVPR '21)
[comparative learning] understanding the behavior of contractual loss (CVPR '21)
2022-07-25 12:00:00 【chad_ lee】
Understanding the Behaviour of Contrastive Loss (CVPR’21)
Contrastive Loss Temperature coefficient in τ \tau τ Is a key parameter , Most papers put τ \tau τ Set to a small number , This article starts with the analysis of temperature parameters τ \tau τ set out , Analysis shows that :
- Contrast loss can actually automatically mine hard negative samples , Therefore, we can learn high-quality self-monitoring representations . In particular , For negative samples that have been far away , There is no need to keep it away ; Mainly focus on negative samples that are not far away ( Hard negative sample ), Thus, the representation space is more uniform ( It's similar to the red circle chart below ).
- temperature coefficient τ \tau τ The degree of mining negative samples can be controlled , τ \tau τ The smaller the sample, the more attention is paid to the difficult negative sample .
Hardness-Awareness
The widely used comparison loss function is InfoNCE:
L ( x i ) = − log [ exp ( s i , i / τ ) ∑ k ≠ i exp ( s i , k / τ ) + exp ( s i , i / τ ) ] \mathcal{L}\left(x_{i}\right)=-\log \left[\frac{\exp \left(s_{i, i} / \tau\right)}{\sum_{k \neq i} \exp \left(s_{i, k} / \tau\right)+\exp \left(s_{i, i} / \tau\right)}\right] L(xi)=−log[∑k=iexp(si,k/τ)+exp(si,i/τ)exp(si,i/τ)]
This loss function requires the second i Samples and it's another amplified ( just ) Similarity between samples s i , i s_{i,i} si,i As big as possible , And with other examples ( Negative sample ) The similarity between s i , k s_{i,k} si,k As small as possible . But there are many loss functions that satisfy this condition , For example, the simplest function L simple \mathcal{L}_{\text {simple }} Lsimple :
L simple ( x i ) = − s i , i + λ ∑ i ≠ j s i , j \mathcal{L}_{\text {simple }}\left(x_{i}\right)=-s_{i, i}+\lambda \sum_{i \neq j} s_{i, j} Lsimple (xi)=−si,i+λi=j∑si,j
But the training effect of these two loss functions is much worse :
| Data sets | Contrastive Loss | Simple Loss |
|---|---|---|
| CIFAR-10 | 79.75 | 74 |
| CIFAR-100 | 51.82 | 49 |
| ImageNet-100 | 71.53 | 74.31 |
| SVHN | 92.55 | 94.99 |
This is because Simple Loss The same weight penalty is given to all negative sample similarity : ∂ L simple ∂ s i , k = λ \frac{\partial L_{\text {simple }}}{\partial s_{i, k}}=\lambda ∂si,k∂Lsimple =λ, That is, the gradient of the similarity of the loss function to all negative samples is equal . But in Contrastive Loss in , It will automatically give higher penalties to negative samples with higher similarity :
The gradient of the positive sample : ∂ L ( x i ) ∂ s i , i = − 1 τ ∑ k ≠ i P i , k The gradient of negative samples : ∂ L ( x i ) ∂ s i , j = 1 τ P i , j ∝ s i , j \text { The gradient of the positive sample : } \frac{\partial \mathcal{L}\left(x_{i}\right)}{\partial s_{i, i}}=-\frac{1}{\tau} \sum_{k \neq i} P_{i, k} \\ \text { The gradient of negative samples : } \frac{\partial \mathcal{L}\left(x_{i}\right)}{\partial s_{i, j}}=\frac{1}{\tau} P_{i, j} \propto s_{i, j} The gradient of the positive sample : ∂si,i∂L(xi)=−τ1k=i∑Pi,k The gradient of negative samples : ∂si,j∂L(xi)=τ1Pi,j∝si,j
among P i , j = exp ( s i , j / τ ) ∑ k ≠ i exp ( s i , k / τ ) + exp ( s i , i / τ ) P_{i, j}=\frac{\exp \left(s_{i, j /} \tau\right)}{\sum_{k \neq i} \exp \left(s_{i, k} / \tau\right)+\exp \left(s_{i, i} / \tau\right)} Pi,j=∑k=iexp(si,k/τ)+exp(si,i/τ)exp(si,j/τ), For all negative samples , P i , j P_{i, j} Pi,j The denominator of is the same , therefore s i , j s_{i, j} si,j The bigger it is , The gradient term of negative samples is also larger , This gives the negative sample a greater gradient away from the sample .( It can be understood as focal loss, The harder it is, the greater the gradient ). Thus, all samples are encouraged to be evenly distributed on a hypersphere .
To verify the truth Contrastive Loss It's really because we can mine the characteristics of difficult negative samples , The article shows that some additional difficult samples are selected for Simple Loss On ( Select for each sample 4096 A hard negative sample ), Improved performance :
| Data sets | Contrastive Loss | Simple Loss + Hard |
|---|---|---|
| CIFAR-10 | 79.75 | 84.84 |
| CIFAR-100 | 51.82 | 55.71 |
| ImageNet-100 | 71.53 | 74.31 |
| SVHN | 92.55 | 94.99 |
temperature coefficient τ \tau τ Degree of control
temperature coefficient τ \tau τ The smaller it is , The loss function pays more attention to hard negative samples , Specially :
When τ \tau τ Tend to be 0 when ,Contrastive Loss Degenerate into focusing only on the hardest samples :
lim τ → 0 + 1 τ max [ s max − s i , i , 0 ] \lim _{\tau \rightarrow 0^{+}} \frac{1}{\tau} \max \left[s_{\max }-s_{i, i}, 0\right] τ→0+limτ1max[smax−si,i,0]
This means that One by one Push each negative sample to the same distance from yourself :

When τ \tau τ Approaching infinity ,Contrastive Loss Almost degenerate into Simple Loss, The weight is the same for all negative samples .
So the temperature coefficient τ \tau τ The smaller it is , The more uniform the distribution of sample characteristics , But this is not a good thing , Because the potential positive sample (False Negative) Also pushed away :

边栏推荐
- 30 sets of Chinese style ppt/ creative ppt templates
- JaveScript循环
- 【6篇文章串讲ScalableGNN】围绕WWW 2022 best paper《PaSca》
- W5500 is in TCP_ In server mode, you cannot Ping or communicate in the switch / router network.
- 【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)
- The applet image cannot display Base64 pictures. The solution is valid
- 【图攻防】《Backdoor Attacks to Graph Neural Networks 》(SACMAT‘21)
- 软件测试阶段的风险
- [electronic device notes 5] diode parameters and selection
- Review in the middle of 2022 | understand the latest progress of pre training model
猜你喜欢
![[electronic device notes 5] diode parameters and selection](/img/4d/05c60641dbdbfbfa6c3cc19a24fa03.png)
[electronic device notes 5] diode parameters and selection

brpc源码解析(二)—— brpc收到请求的处理过程

brpc源码解析(五)—— 基础类resource pool详解

W5500上传温湿度到oneNET平台

阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程

Wiznet embedded Ethernet technology training open class (free!!!)
![There is no sound output problem in the headphone jack on the front panel of MSI motherboard [solved]](/img/e8/d663d0a3c26fce8940f91c6db4afdb.png)
There is no sound output problem in the headphone jack on the front panel of MSI motherboard [solved]

PHP curl post length required error setting header header

Chapter 4 linear equations

toString()与new String()用法区别
随机推荐
dirReader.readEntries 兼容性问题 。异常错误DOMException
Flinksql client connection Kafka select * from table has no data error, how to solve it?
Layout management ==pyqt5
异构图神经网络用于推荐系统问题(ACKRec,HFGN)
What is the difference between session and cookie?? Xiaobai came to tell you
擎创科技加入龙蜥社区,共建智能运维平台新生态
JaveScript循环
brpc源码解析(四)—— Bthread机制
W5500 upload temperature and humidity to onenet platform
JS operator
【6篇文章串讲ScalableGNN】围绕WWW 2022 best paper《PaSca》
JS流程控制
Varest blueprint settings JSON
JS中的函数
【GCN-CTR】DC-GNN: Decoupled GNN for Improving and Accelerating Large-Scale E-commerce Retrieval WWW22
Transformer变体(Routing Transformer,Linformer,Big Bird)
【图攻防】《Backdoor Attacks to Graph Neural Networks 》(SACMAT‘21)
Brpc source code analysis (VII) -- worker bthread scheduling based on parkinglot
[imx6ull notes] - a preliminary exploration of the underlying driver of the kernel
PHP curl post length required error setting header header