当前位置：网站首页>[out of distribution detection] learning confidence for out of distribution detection in neural networks arXiv '18

[out of distribution detection] learning confidence for out of distribution detection in neural networks arXiv '18

2022-06-22 06:55:00 【chad_ lee】

This article is somewhat like “learning loss” Same as that one , A wave of “end to end DL system solve everything” The smell of . I need a confidence To evaluate whether a sample is OOD data , Then my neural network model will output one confidence Indicators to predict the current sample .

Although the article was not published at the meeting , But it is highly cited .

Motivation

The author uses an example to introduce the motivation of design model . Suppose the students have to answer a series of questions to get scores in the exam , Students can choose to ask for help , But there is a small penalty for requesting a prompt . At this time, students should answer the questions with confidence , Ask for help on a topic you don't have confidence in .

At the end of the exam , Count the number of tips used by students , You can estimate their confidence in each problem . Then apply this same strategy to neural networks , It can also be used to learn confidence estimation .

Model architecture

Add a to any normal classification prediction model “ Confidence estimation branch ”, After the penultimate layer of the model , and “softmax Classification module ” parallel , Both branches accept the same input .
$\Theta) \quad p_{i}, c \in[0,1], \sum_{i=1}^{M} p_{i}=1$
above $p$ Represents the classification probability , Through one softmax Function to obtain ; $c$ Represents the confidence score , Through one sigmoid Function to obtain .

In order to give the model “ Tips ”, In primitive softmax Prediction probability and real label $y$ The final classification prediction is adjusted by interpolation , The degree of interpolation is expressed by the confidence of the network ：
$p_{i}^{\prime}=c \cdot p_{i}+(1-c) y_{i}$
The specific process is shown in the right figure of the above figure . Use the modified probability in training , Calculate like a normal classification task loss：
$\mathcal{L}_{t}=-\sum_{i=1}^{M} \log \left(p_{i}^{\prime}\right) y_{i}$
At the same time, in order to avoid the model, in order to reduce loss And always choose $c = 0$ , Also for $c$ Add an incentive loss, hope $c$ The bigger the better （ The more confident the model is, the better ）, therefore loss It also includes ：
$\mathcal{L}_{c}=-\log (c)$
The final model is loss It consists of two , Then a weight parameter $\lambda$ Adjust the ：
$\mathcal{L}=\mathcal{L}_{t}+\lambda \mathcal{L}_{c}$
Lower through training loss, Improve model performance , It can also be based on confidence $c$ To measure whether the input data is OOD sample .

Three details

（Idea Pretty good , It doesn't work ）

The author found that the model is always the same for all inputs $c$ . Set a parameter $\beta$ , $\mathcal{L}_{c}>\beta$ Increase when $\lambda$ ; $\mathcal{L}_{c}<\beta$ Time reduction $\lambda$ .
The author finds that the model is more effective in learning some difficult data “ lazy ”.（ My understanding is that those difficult models will reduce the confidence first loss, Not optimization loss The first item to explore the boundaries .） So in training, every batch Inside , Half of the data adopts the original loss function , A new loss function is used for half the data
Retain samples of misclassification . I didn't understand the author's description .

OOD testing

Once a model is trained , Can be used to OOD testing , say concretely ：
$\delta)=\left\{\begin{array}{ll} 1 & \text { if } c \leq \delta \\ 0 & \text { if } c>\delta \end{array}\right.$
For an input $x$ , Set a threshold $\delta$ .

Input preprocessing

（ Experimental trick）

suffer FGSM Inspired by the ,FGSM Is to add a small disturbance to the sample , To reduce the softmax Predictive value .FGSM Deviate the sample from the correct category , Instead, the sample is approximated to the correct category , That is, add a small disturbance to the sample , Make it more “ self-confidence ”（c -> 1）. In order to calculate the necessary disturbances , Simply back propagate the confidence loss relative to the input gradient （backpropagate the gradients of the confidence loss with respect to the inputs）：
$\bar{x}=x-\epsilon \operatorname{sign}\left(\nabla_{x} \mathcal{L}_{c}\right)$
I think this is an unrealistic trick, For the ID Of the training samples , After the model goes online, it is impossible to process the new data , The newcomer ID Is the data also “ Stay away from ” For training ID Data. ？