当前位置：网站首页>26 FPS video super-resolution model DAP! Output 720p Video Online

26 FPS video super-resolution model DAP! Output 720p Video Online

2022-07-02 20:53:00 【I love computer vision】

Official account , Find out CV The beauty of Technology

Company ： Zurich Federal Institute of technology 、 University of Leuven 、 University of vilzburg

The paper ：https://arxiv.org/pdf/2202.01731v1.pdf

Editor's speech ： Different from at this stage VSR Two hot research directions ： real world / blind VSR、VSR With transmission , The author of this paper has made a breakthrough in the direction of super division in the field of real-time online , This is related to IPRRN The starting point of this article is similar to . this paper DAP The effect is comparable EDVR, But it's three times shorter ,180P Video can be accessed online 26FPS！

Watch it

VSR The application of has strict causality 、 Real time and other restrictions . There are two challenges ： Information for future frames is not available 、 Design an efficient and effective frame alignment and fusion module . In this paper, a deformable attention pyramid is proposed （DAP） The cycle of VSR structure .

DAP Align and integrate the information from the loop state into the current frame prediction . In order to avoid the computational cost of traditional attention-based methods ,DAP Focus only on a limited number of spatial locations , These positions are made by DAP Dynamic prediction . It exceeds... On two benchmarks EDVR-M Method , At a faster rate than 3 times .

Method

Overview

According to Nyquist - Shannon's sampling theorem , The frequency band of discrete signals is limited ,VSR The task of the algorithm is to recover high-frequency content higher than the above frequency from low-resolution video . The recursive algorithm in this paper focuses on the fast runtime combined with the update and extraction of information in the hidden state to deal with the alignment between frames .

First , Our encoder network encodes the input frame into a multi-level feature map from fine to coarse , Then the deformable attention module iteratively refines the calculated offset from coarse to fine , Then the fusion module aggregates the hidden state features according to the final offset , Finally, the main processing unit composed of multiple residual information distillation blocks estimates the high-resolution frame and the next hidden state , The frame is shown in the following figure .

DAP

use first U-Net Type encoder calculates multi-level features from and . On the second floor of the pyramid ,k A sampling position is calculated to act as the key position of the upper deformable attention module , The feature of using convolution block to calculate residual offset is based on t-1 To t From the fusion of cross attention , The offset will be optimized repeatedly , until =0, As shown in the figure below , among ⊗ Represents channel superposition ,⊕ Represents the addition of pixels .

Multistage encoder

There is fast motion in the video , In this paper, a multi-level encoder is designed to obtain multi-resolution features . Because there are different spatial views on different resolution frames , This can capture different ranges of motion . The hierarchy is defined as , In this study L=3, Separate processing chains are used for input at different times , The characteristics are calculated as follows ：

402 Payment Required

Where means by 4 A convolution block composed of convolutions , Represents bilinear down sampling .#### Deformable note To reduce the complexity of the attention module , In this paper, the search of salient features is limited to the dynamically selected position in the feature graph , Instead of related exhaustive calculation in a large neighborhood or even the whole frame . By calculating only the correlation of dense pixels , The calculation workload is greatly reduced . Where is the feature representation of the current frame , And by dynamically predicting the spatial position and calculating . The calculation is as follows ：

Where is bilinear upsampling .

Iterative refinement

In each pyramid , The dense offset is iteratively optimized by adding the residual offset to the offset of the previous level using convolution blocks . Used in offset prediction networks 7×7 The kernel of , To ensure intensive calculation under large receptive field , The calculation is as follows ：

402 Payment Required

Hidden state fusion

Final , The top-level offset is used in t Always integrate significant hidden state features , Another variable attention block calculates , As shown below ：

In addition, the internal tensors are grouped and sampled at all stages of the runtime , According to the sampling key / It's worth it k=4 Select the number of groups .

experiment

Ablation Experiment

Ablation experiments with different components and channel numbers ：

One of the core features of the most advanced two-way method is the ability to fuse information offline in the whole video . This naturally includes aggregation in reverse chronological order . Because this paper studies forward / Differences between backward assessments . It's amazing , Reverse chronological aggregation significantly improves performance .

The authors attribute this gain to the fact that forward motion of the camera is more common in video . If the object moves towards the camera , Or vice versa , Then they first appear in high resolution , This simplifies the super-resolution of these objects . therefore , Having the opportunity to reverse process video may improve VSR Performance of , Thus, the non causal method has more advantages than the online algorithm .