当前位置：网站首页>Replace convolution with full connection layer -- repmlp

Replace convolution with full connection layer -- repmlp

2022-07-02 07:52:00 【MezereonXP】

Replace convolution with full connection layer – RepMLP

This time I will introduce you to a job , “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, It's recent MLP A representative article in the upsurge .

Its github Link to https://github.com/DingXiaoH/RepMLP, Energetic friends can go for a run , Take a look at the code .

So let's go back , Previous work based on convolutional Networks . The reason why convolution network can be effective , To some extent, it captures spatial information , Spatial features are extracted through multiple convolutions , And basically cover the whole picture . Suppose we put pictures “ Beat flat ” And then use MLP Training , Then the characteristic information in space is lost .

The contribution of this article lies in ：

Fully connected （FC） Global capabilities of (global capacity) as well as Location awareness (positional perception), It is applied to image recognition
A simple 、 Platform independent (platform-agnostic)、 Differential algorithm , To convolute and BN Merge into FC
Sufficient experimental analysis , Verified RepMLP The feasibility of

The overall framework

Whole RepMLP There are two stages ：

Training phase
Testing phase

For these two stages , As shown in the figure below ：

framework

It looks a little complicated , Let's take a look at the training phase alone .

First of all Global perception （global perceptron）

global

It is mainly divided into two paths ：

route 1: The average pooling + BN + FC1 + ReLU + FC2
route 2: Block

We remember that the shape of the input tensor is $(N, C, H, W)$

route 1

For the path 1, First, average pooling converts input into $(N,C,\frac{H}{h},\frac{W}{w})$ , Equivalent to scaling , Then the green part indicates that the tensor “ Beat flat ”

That is, to become $(N,\frac{CHW}{hw})$ The tensor of shape , Through two layers FC After the layer , Dimensions remain , Because the entire FC It is equivalent to multiplying left by a square matrix .

Eventually the $(N,\frac{CHW}{hw})$ Shape output reshape, Get a shape that is $(\frac{NHW}{hw}, C, 1, 1)$ Output

route 2

For the path 2, Directly enter $(N, C, H, W)$ convert to $\frac{NHW}{hw}$ individual $(h, w)$ Small pieces , Its shape is $(\frac{NHW}{hw},C,h,w)$

Finally, the path 1 And the path 2 Add the result of , Because the dimensions are not right , But in the PyTorch in , Automatic copy operation , That's all $(h, w)$ The size of each pixel of the block , Will add a value .

The output shape of this part is $(\frac{NHW}{hw},C,h,w)$

Then enter Local awareness and Block perception Part of , As shown in the figure below ：

local

about Block perception （partition perceptron）

First , take 4 The tensor of dimension is plotted as 2 dimension , namely $(\frac{NHW}{hw},C,h,w)$ become $(\frac{NHW}{hw},Chw)$

then FC3 It's a reference Grouping convolution （groupwise conv） The operation of , among $g$ Is the number of groups

Original FC3 Should be $(O h w, C h w)$ A matrix of , But in order to reduce the number of parameters , Used In groups FC（groupwise FC）

Packet convolution is essentially grouping channels , Let me give you an example ：
Suppose the input is a $(C, H, W)$ Tensor , If we want the output to be $(N, H^{'}, W^{'})$
Usually, the shape of our convolution kernel is $(N, C, K, K)$ , among $K$ It's the size of the convolution kernel
We are right on the channel $C$ Grouping , Every time $g$ A group of channels , Then there is $\frac{C}{g}$ A set of
For each group individually , Convolution operation , The shape of our convolution kernel will be reduced to $(N,\frac{C}{g},K,K)$

ad locum , grouping FC That is, the number of channels $C h w$ Group and then pass each group FC, The resulting $(\frac{NHW}{hw}, O,h,w)$ Tensor

after BN layer , The tensor shape remains unchanged .

And for Local awareness （local perceptron）

local perceptron

similar FPN Thought , Group convolution of different scales is carried out , Got it 4 The shape is $(\frac{NHW}{hw}, O,h,w)$ Tensor

Add the results of local perception and block perception , Got it. $(N, O, H, W)$ Output

Here you might ask , Isn't there still convolution ？

This is just the training stage , In the reasoning stage , Will throw away the convolution , As shown in the figure below ：

inference

thus , We use it MLP Instead of a convolution operation

experimental analysis

The first is a series of Ablation Experiment （Ablation Study）, stay CIFAR-10 Test on dataset

cifar-10-ablation

A The condition is to keep when inferring BN Layer and the conv layer , The results have not changed

D,E The conditions are to use a 9x9 Instead of the convolution layer FC3 And the whole RepMLP

Wide ConvNet It is to double the number of channels of the original network structure

The results show the importance of local perception and global perception , At the same time, it has no effect to remove the convolution part when inferring , Realized MLP Replacement

Then the author replaced ResNet50 Some of block, Tested

c4-only