当前位置：网站首页>MKD [anomaly detection: knowledge disruption]

MKD [anomaly detection: knowledge disruption]

2022-07-28 22:41:00 【It's too simple】

Preface

The blog is from 2020.11CVPR A paper on , Take knowledge distillation as the main idea , Innovative way of adding multi-scale distillation , Make information transmission more sufficient .

background

Previous models only used the last layer of output as the way of knowledge distillation , This model also distills the middle layer output . The higher the number of layers , The richer the image semantic information , Using only the last layer will cause the model to converge to irrelevant regions . Knowledge can be understood as the value and direction of the vector in the middle layer . Use the Euclidean distance as the loss function for the value （ See interesting knowledge for explanation ）, See the formula in the original paper （1）, The loss function is set before the activation function . Use cosine similarity measure to do loss function for direction , See the formula in the original paper （3）, Only using Euclidean distance here will cause the activation function to erase some vectors used to extract features , See formula for explanation （2） It's about for instance Description after .

VGG The network performs better in classification and migration tasks . Here, the network selects the last pooling layer of each convolution block as the key layer .

Model principle

During training ,VGG-16 The network ImageNet Pre trained on large natural data sets . Then input the normal diagram VGG Source network and clone network , Compare the characteristic diagram of the middle key layer , Update clone network （ Fewer clone network channels ）, Complete the training process . Verification time , Abnormal image input ,VGG Vector of source network and clone network at the middle layer , The total loss obtained by the sum of distance loss and direction loss is obtained by gradient algorithm （ Find the pixel that has the greatest impact on the gradient ） Combined with Gaussian filtering （ Eliminate the influence of noise ） And open morphological filtering to generate segmentation location map .

Popular speaking , The source network has been pre trained on large natural data sets , Not sensitive to abnormal data , But because the clone network has not seen the abnormal figure , So clone network will be sensitive to abnormal data , The difference between the two networks locates the abnormal area . Equivalent to the teacher only handed in part of the students' knowledge , Encountered variant , The teacher knows more , Be able to learn and use flexibly , Students can only stare .

Interesting knowledge

Using deviation is easy to produce constant value function , When the network is trained with a normal graph with high similarity , Cause the network not to converge , Impact effect , See the source paper references for explanation 【33】. So clone networks don't use bias . When producing layers with similar output , take l The layer weight is set to 0, And adjust l+1 Layer deviation .

Euclid distance （ The actual distance between two points ）： In information or multidimensional space , It is often measured by some relative distance or measure , That is, the distance similarity between vectors .

Cosine similarity measure ： Judge the similarity based on the angle between vectors , That is, the closer the cosine value is to 1, The included angle degree is close to 0 degree .

The composition of the network ： Network architecture 、 Loss function 、 Interpretability algorithm （ The explained loss function helps to segment and locate ）, During training , The loss function comes out after forward propagation , Using loss function back propagation to update clone network . Verification time , Forward propagating loss function , The interpretable algorithm locates the exception .

Yes S-T The Internet （ Original papers 【8】） Evaluation comparison ：（1） All adopt knowledge distillation , but MKD Distill features into clone Networks ,ST Pre training with distillation .（2） Both adopt dual network training , but MKD Compare the middle layer ,ST Compare the output of the last layer .（3） All adopt pre training network , Using the features of natural images . Compare disadvantages ：（1） Only imitate the last layer , Not making full use of teachers' network knowledge , Complicate the model （2） Adopt complementary technology , Self supervised learning , Increase cost （3） Depending on the size of the small block , Make it into different pieces , Increase training costs .

experiment

Ablation Experiment

Choose a different number of middle layers ; Select a smaller number of channels to compare with the same clone network , There are obvious abnormal figures in some local areas , Networks with fewer channels perform better , Other types of anomaly diagrams are approaching ; Comparison between selecting two loss functions and selecting a single loss function respectively ; Use different interpretative algorithms （Gradients/SmoothGrad/GBP） Comparison with Gauss filtering .

Comparative experiments

Comparison between different data sets and the same model ; Comparison between different models and the same data set ; It is worth noting that the network effect after using data enhancement is a little better .

原网站

版权声明
本文为[It's too simple]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130601088831.html