当前位置:网站首页>'virtue and art' in the field of recurrent+transformer video recovery

'virtue and art' in the field of recurrent+transformer video recovery

2022-06-12 16:23:00 I love computer vision

Official account , Find out CV The beauty of Technology

This article shares papers 『Recurrent Video Restoration Transformer with Guided Deformable Attention』, yes Jingyun Dailaoji SwinIR,VRT Then another masterpiece , stay Transformer The circular architecture is applied in the structure ( The author has been research This point , But the boss is too fierce ) And from frame level alignment to segment alignment .RVRT stay VID4 More than a VRT, stay REDS4 More than a Basicvsr++!

The details are as follows :

bce5434bea47eafa5aa39bcc2f68c8d2.png

  • Author's unit : Zurich Federal Institute of technology 、Meta、 University of wirtzburg  

  • Thesis link :https://arxiv.org/pdf/2206.02146.pdf

  • Project links :https://github.com/JingyunLiang/RVRT

      01      

Watch it

There are two main methods of video restoration :

  1. Recover all frames in parallel , It has the advantage of time information fusion , But the model size is big , Memory consumption is large

  2. Cycle frame by frame , It shares parameters across frames, so the model size is small , But it lacks long-term modeling ability and parallelism

In this paper, we propose a cyclic video restoration transformer(RVRT) To combine these advantages , It processes local adjacent frames in parallel in the framework of global loop, so as to achieve a good trade-off between model size and efficiency , The main contributions are as follows :

  • RVRT Break the video into multiple clips , The previous segment features are used to estimate the subsequent segment features . By reducing the length of the video sequence and passing information in a larger hidden state , It alleviates the information loss and noise amplification in the cyclic network , You can also partially parallelize the model .

  • Use guided deformation caution (GDA) Predict multiple related positions from the entire inferred segment , Then we use the attention mechanism to aggregate their features to align the fragments .

  • It is realized on the multi benchmark data set of super division de-noising and de blurring SOTA.

      02      

Method

Overview

The frame is shown in the following figure , The model consists of three parts : Shallow feature extraction , Cyclic feature refinement and frame reconstruction . Shallow feature extraction uses convolution layer and multiple SwinIR Medium RSTB Block to low quality video LQ Feature extraction , Then we use the cycle feature refinement module to model the time , And use the guided deformation to pay attention to the video alignment , Finally, multiple RSTB The block generates the final feature and passes pixelShuffle Conduct HQ The reconstruction .

ec085054845e72023947b2b6a81fe29a.png

Loop feature refinement

This article stacks L A cyclic feature refinement module , The video features are refined by using the temporal correlation between different frames . Given layer video characteristics , First, it is divided into segment features , Each fragment feature contains N Adjacent frame features

402 Payment Required

. The aligned segment features are calculated as :

Where is optical flow , The current clip feature is calculated as :

Where is the output of shallow feature extraction ,RFR(·) Refine the module for cyclic features , Like the picture on the right , It consists of a convolution layer for feature fusion and several layers for feature refinement RSTB The improved MRSTB form .MRSTB The original two-dimensional h × w Notice that the window is upgraded to 3D N × h × w, This enables each frame in the clip to focus on itself and other frames at the same time to achieve implicit feature aggregation . Besides , Reverse the video sequence to obtain backward information .

e5acfdd03e76a67221b0af15680b1f9e.png

Guide the deformation attention  

Different from the previous frame level alignment ,GDA Need to align adjacent related but unaligned video clips , As shown in the figure below . The order means that it is made up of t-1 Frames to in clips t The... In the clip n Frame alignment feature of a frame . suffer Basicvsr Inspired by the , First, use optical flow to obtain pre aligned features , And then offset ( A lowercase letter o) Calculated as :

Middle mining CNN It consists of multiple convolution layers and ReLU form , Each frame of optical flow has M It's an offset , The optical flow is then updated to :

402 Payment Required

For the sake of simplicity , This article will K、Q、V The definition is as follows :

402 Payment Required

First, the feature is projected , Then samples are taken to reduce redundant calculations . The alignment feature is then computed by an attention mechanism :

402 Payment Required

Where is the sampling factor . Last , Since the above operations only aggregate information in space , To this end, this article adds a MLP( Two fully connected and one GELU) Interact with channels in the form of residuals . Besides , Channels can be divided into deformable groups , Operate in parallel . The deformable group can be further divided into multiple attention heads , And pay attention to different heads .

5af858e50684a7073cb7ac695d7f866e.png

It is worth noting that , Deformable convolution uses the learned weights to aggregate features , This can be seen as GDA A special case of , That is, different projection matrices are used for different positions , Then the obtained features are averaged . The number of parameters and computational complexity are and respectively . by comparison ,GDA Use the same projection matrix for all positions , But generate dynamic weights to aggregate them . The number of parameters and computational complexity are and , In choosing the right M and R Time is similar to deformable convolution .


      03      

experiment

Ablation Experiment

Ablation study of different video alignment techniques

48bf5c35778d3c88bb7cc5b6288c90f7.png

Different GDA Ablation study of components

6d91aa3b16ced0b0c1dfc0d6ecd6d507.png

Quantitative assessment

stay BD Vid4 Up to 29.54dB, stay BI REDS4 Up to 32.75dB

96f176a00d75f6dc6924a243c4282ab9.png

Parameter quantity , Time is better than VRT, And CNN Architecture is still no better than

db6f693a0c036735c872d8931ef9d798.png

Deblurring and denoising

a88d31f58a27c885b72a668d6e0625fe.png

Qualitative assessment

Details are visible to the naked eye

b603b49770102c899f27739ed9c80a0c.png

e161d54538027ec64875990a561fd0e7.png

END

Welcome to join 「 Video recovery Exchange group notes : Repair

ce5a8fd997f256ea53c8e3214900210f.png

原网站

版权声明
本文为[I love computer vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206121615376715.html