当前位置：网站首页>Transformer for anomaly detection - instra "painting transformer for anomaly detection"

Transformer for anomaly detection - instra "painting transformer for anomaly detection"

2022-07-28 19:26:00 【I'm Mr. rhubarb】

Original address

https://arxiv.org/pdf/2104.13897v1.pdf

Thesis reading methods

First time to know

GAN,AE This kind of anomaly detection method based on reconstruction , The disadvantage is that it is also very good for the reconstruction of abnormal samples , This will cause detection errors . At present, some methods turn the problem of generating refactoring into inpainting Problem for anomaly detection ,inpainting Is to cover some areas with images , And then recover , It can also be regarded as a self-monitoring method .

solve inpainting This kind of problem , Capturing long-distance semantic information from a larger area helps to reconstruct the coverage area . but CNN Due to the limitation of receptive field , It is not good at capturing long-distance information . therefore , The author was visually affected by the recent fire Transformer Inspired by the , So we use Transformer Architecture solves this problem . Here's the picture (a) Shown , During training , The image is cut into equal sized pieces , Use other image blocks in a large area to inpainting. chart (b) It shows the effect of reconstruction , And the abnormal score map obtained according to the pixel level error .

And the author only bases on MVTec AD A small number of samples of the data set itself are trained , Also achieved state-of-the-art The effect of .

Know each other

2. Related Work

Detect the current exception / Segmentation methods are mainly divided into two categories , One is the method based on refactoring , similar AE、GAN、VAE Other methods ; The second is based on embedding (Embedding) Methods , Mainly based on ImageNet Pre trained CNN Extract discriminant features for comparison .

Then I also introduced inpainting and transformer Some related methods .

3. Inpainting Transformer for Anomaly Detection

Use Transformer perform inpainting Mission Training . When testing , The same to inpainting The way to rebuild , Compare the difference between the input image and the reconstructed image , Get the test results .

3.1 Embedding Patches and Positions

Pictured above (a) Shown , The method of this paper is to choose a length of L The square area of （ Instead of ViT The whole image in ） Conduct inpainting, There are two position coding methods in the process , One is local coding , As shown on the left of the figure below , The other is global coding , As shown on the right of the figure below .

Why do you need these two coding modes , Intuitively , Texture image ( l ) There is no need to consider the global position information of the image block , Other categories are important ( r ).

and ViT The settings in are similar to , The location embedded information is D dimension , Map the image block to D Weihou , Add the two together . It should be noted that , There is an image block $P (t, u)$ It is covered. . This article regards it as ViT Classification header in (class token)：

Finally get $L\times L$ The dimensions are D Sequence , Prepare to send into the follow-up Transformer.

3.2 Multihead Feature Self-Attention

The original MSA modular q And k It is maintained in D dimension , But the author's task is very similar between the image blocks of the training image , This causes the calculated attention weight to be almost equal . So the author is right Transformer The multi head attention module in has been slightly modified , In the calculation q And k when , utilize MLP Perform a nonlinear dimensionality reduction （ Set as D/2）, It is called MFSA (multihead feature self-attention).

MLP The hidden dimension is 2D D - 2D - D/2

Accelerate the convergence of the model and improve the accuracy , But this also increases the number of parameters

3.3 Network Architecture

Finally, the overall network architecture is as follows , On the left of the picture is a picture Transformer A module of , The input and output of each module are $L^2\times D$ . On the last floor block The output is averaged (D), Then map as inpainting Result ( $K^2*C$ ).

You can also use the first output of the last layer for direct linear mapping , This is related to ViT similar .

Insert picture description here

4. Training

Randomly select a size of L The window of , Then select an image block in the window to cover , Then send the image blocks in the window together Transformer In the implementation of inpainting Mission .

The loss function is pixel level L2 loss, It also uses SSIM And GMS Two kinds of loss.

5. Inference and Anomaly Detection

First, calculate according to the difference between the reconstructed image and the original image pixel-level Abnormal score graph , Then choose the largest one as image-level Test score of .

Consistent with training , The test image is divided into NxM block , about (t,u) The image block on the position adopts the following formula to select its surrounding size as L The window of ,(r,s) Is the coordinate of the upper left corner of the window .

Make the image block in the center of the window as much as possible

Finally, for all NxM Image blocks inpainting, The reconstructed result of the whole image can be obtained . It is worth noting that , When the author is calculating the abnormal score graph , Instead of using L2 distance , Instead, it adopted GMS-based Methods , stay {1/2,1/4} Calculate the gradient amplitude similarity under the scale , Then the mean filter and Gaussian filter .

The anomaly maps at two scales are obtained respectively m1,m2, Restore it to the original image size , Calculate the pixel level mean of both as the abnormal score graph .

Further subtract the abnormal mean score graph in the training set from the abnormal score graph （ The training sets are all normal samples ） Square after ,T As the training set .

Finally, select the largest pixel score in the score graph as the final image level score .

4. Experiments

See the original text for specific results , Here are some implementation details

Details of the experiment ：

Random selection from training images 10% As validation set , Used to control reconstruction results （ most 20 Zhang ）. This epoch Inside , Each image will be randomly selected 600 Windows for training , Random rotation and overturning are used as augmentation means during training .

Image block size K=16, Window size is L=7, about MVTec Different types of data in also choose different image sizes ,{256x256,320x320,512x512}.Transformer The dimensions in the world D Set up 512. be-all resize The operation adopts bilinear interpolation .

The optimizer uses adam, however transformer The training time is relatively long , Sometimes more than 500 individual epoch. When validation sets loss exceed 50 individual epoch There is no obvious decline , Then stop training , And choose the best model to evaluate .

Looking back

In fact, the biggest innovation of the whole article is to Transformer Introduced into the field of anomaly detection , perform inpainting Mission . Designed for different situations local and global Two position embedding methods . Although it also introduces U-Net framework （ But ablation experiments have proved that it is not effective for all categories ）, And modify MSA by MFSA, But they are all minor modifications , No big improvement .

In general ,inpainting Or based on refactoring . For some noisy areas, it is difficult to reconstruct , For example, character area . Besides , This method is still patch-based, This leads to a test sample that needs to perform multiple model reasoning to get the final abnormal score graph . And obvious boundary effect can be observed , As shown in the figure below ：