当前位置：网站首页>(cvpr-2019) selective kernel network

(cvpr-2019) selective kernel network

2022-07-29 02:02:00 【Gu daochangsheng '】

Selective kernel network

paper subject ：Selective Kernel Networks

paper Published by Nanjing University of technology on CVPR 2019 The job of

paper link

Code: link

Abstract

In standard convolutional neural networks （CNN） in , The receptive fields of artificial neurons in each layer are designed to have the same size . It's well known in Neuroscience , The size of receptive field of neurons in visual cortex is regulated by stimulation , This is building CNN Rarely considered when . We put forward a kind of CNN Dynamic selection mechanism in , Each neuron is allowed to adaptively adjust the size of its receptive field according to multiple scales of input information . We have designed a system called selective nucleus （SK） Components of the unit , Among them, multiple branches with different core sizes are guided by the information of these branches , Use softmax Focus on Fusion . Different attention to these branches leads to different sizes of effective receptive fields of neurons in the fusion layer . Multiple SK The cells are stacked into a deep network , It is called selective nuclear network （SKNets）. stay ImageNet and CIFAR On the benchmark , Our experience shows that ,SKNet It surpasses the most advanced existing architecture with low model complexity . Detailed analysis shows that ,SKNet Neurons in can capture target objects of different scales , This verifies the ability of neurons to adaptively adjust the size of their receptive field according to input .

1. Introduction

Last century , The primary visual cortex of cats （V1） The local receptive field of neurons in （RF）[14] Inspired convolutional neural network （CNN）[26] The construction of , And it continues to inspire modern CNN The structure of . for example , as everyone knows , In the visual cortex , The same area （ Such as V1 Area ） Of neurons RF The size is different , This enables neurons to collect multiscale spatial information at the same processing stage . This mechanism is found in recent Convolutional Neural Networks （CNNs） Widely used in . A typical example is InceptionNets[42, 15, 43, 41], One simple series connection is designed from "inception " In the component $\times 3,5 \times 5,7 \times 7$ Gather multiscale information in convolution kernel .

However , In the design CNN when , Other cortical neurons RF Features have not been emphasized , One of the features is RF Adaptive changes in size . A great deal of experimental evidence shows that , Neurons in the visual cortex RF Size is not fixed , But stimulated regulation .V1 Classic of area neurons RF（CRF） By Hubel and Wiesel[14] Discovered , It is determined by a single directional bar . later , Many studies （ Such as [30]） Find out ,CRF Other stimuli will also affect the response of neurons . These neurons are known to have nonclassical RFs（nCRFs）. Besides ,nCRF The size of is related to the contrast of the stimulus ： The less contrast , Effective nCRF The larger the size [37]. It's amazing , Through stimulation nCRF A span , After removing these stimuli , Neuronal CRF Will also expand [33]. All these experiments show that , Neuronal RF Size is not fixed , It is modulated by stimulation [38]. Unfortunately , When building the deep learning model , This feature has not received much attention . Those models with multi-scale information on the same layer , Such as InceptionNets, There is an inherent mechanism , The neurons in the next convolution layer can be adjusted according to the input content RF size , Because the next convolution layer will linearly aggregate multiscale information from different branches . But this linear aggregation method may not be enough to provide neurons with strong adaptability .

In this paper , We propose a nonlinear method , Aggregate information from multiple cores , Realize the adaptation of neurons RF size . We introduced “ Selective kernel ”（SK） Convolution , It consists of three groups of operators . split 、 Fusion and choice . The split operator produces multiple paths with different kernel sizes , Corresponding to the difference of neurons RF size . The fusion operator combines and summarizes information from multiple paths , To obtain the global and comprehensive representation of the selection weight . The selection operator aggregates the characteristic graphs of kernels of different sizes according to the selection weight .

SK Convolution can be lightweight in computation , And there is only a slight increase in parameters and calculation costs . We show that , stay ImageNet 2012 Data sets [35] On ,SKNets Better than the most advanced model before , Its model complexity is similar . be based on SKNet50, We found it SK The best setting for convolution , And show the contribution of each component . To prove its universal applicability , We are still working on smaller datasets CIFAR-10 and 100[22] Provides convincing results , And successfully SK Embedded small model （ Such as ShuffleNetV2[27]） in .

In order to verify that the proposed model does have adjustment neurons RF The power of size , We simulate stimulation by enlarging the target object in the natural image and reducing the background to keep the image size unchanged . Results found , When the target object becomes larger , Most neurons collect more and more information from larger kernel pathways . These results suggest that , The proposed SKNet The neurons in have adaptive RF size , This may be the basis for the excellent performance of the model in object recognition .

2. Related Work

Multi branch convolution network .Highway The Internet [39] Skip path and gating unit are introduced . The dual branch structure reduces the difficulty of training hundreds of layers of Networks . This idea is also used in ResNet[9, 10], But skipping paths is a pure identity mapping . Except for identity mapping , Swing network [7] And multi residual network [1] Extends the main transformation with more identical paths . Deep neural decision forest [21] The tree structure multi branch principle with learning split function is formed .FractalNets[25] and Multilevel ResNets[52] The design method of is that multiple paths can be fractal and recursive extended .InceptionNets[42, 15, 43, 41] Carefully configure each branch with a custom kernel filter , In order to aggregate more information and multiple features . Please note that , The proposed SKNets follow InceptionNets Thought , Configure various filters for multiple branches , But there are at least two important differences .1）SKNets Our plan is much simpler , No need for a lot of custom design ;2） These multi branch adaptive selection mechanisms are used to realize the adaptation of neurons RF size .

grouping / depth / Expansion convolution . Packet convolution has become popular due to its low computational cost . use G Indicates the group size , Then compared with ordinary convolution , The number of parameters and the calculation cost will be divided by G. They first AlexNet [23] Used in , The purpose is to distribute the model in more GPU Resources . It's amazing , Using packet convolution ,ResNeXts [47] It can also improve accuracy . This G be called “ base ”, It characterizes the model with depth and width .

Many compact models have been developed based on interleaved block convolution , for example IGCV1 [53]、IGCV2 [46] and IGCV3 [40]. A special case of block convolution is depth convolution , Where the number of groups is equal to the number of channels . Xception [3] and MobileNetV1 [11] Introduced depthwise separable convolution, The ordinary convolution is decomposed into depthwise convolution and pointwise convolution. stay MobileNetV2 [36] and ShuffleNet [54, 27] And other follow-up work has verified the effectiveness of depth convolution . Apart from grouping / Beyond depth convolution , Cavity convolution [50, 51] Support RF Exponential expansion without losing coverage . for example , With expansion 2 Of 3×3 Convolution can cover approximately 5×5 The filter RF, It consumes less than half of the computing and memory at the same time . stay SK The convolution , Larger size （ for example ,>1） The kernel of is designed with grouping / depth / Extended convolution integration , To avoid a lot of overhead .

Attention mechanism . lately , The benefits of attention mechanism have been shown in a series of tasks , From neural machine translation in natural language processing [2] To image explanation in image understanding [49]. It focuses on the distribution of the most informative feature expression [16, 17, 24, 28, 31], And suppress less useful expressions . Attention has been widely used in recent applications , Such as pedestrian re identification [4]、 Image restoration [55]、 Text abstraction [34] And lip reading [48]. In order to improve the performance of image classification ,Wang wait forsomeone [44] Came up with a CNN Baseline and mask attention between intermediate stages . An hourglass module is introduced to achieve global emphasis across spatial and channel dimensions . Besides ,SENet[12] It brings an effective 、 Lightweight gating mechanism , I calibrate the feature map through channel oriented import . Except for the passage ,BAM[32] and CBAM[45] Spatial attention is also introduced in a similar way . by comparison , What we proposed SKNets It is the first self adaptation that explicitly focuses on neurons by introducing an attention mechanism RF size .

Dynamic convolution . Space transformation network [18] Learning parameter transformation to distort feature map , This is considered difficult to train . Dynamic filters [20] Only the parameters of the filter can be adaptively modified , There is no need to resize the kernel . Active convolution [19] The sampling position in convolution is increased by offset . These offsets are learned end-to-end , But it becomes static after training , And in the SKNet in , Neuronal RF The size can be changed adaptively in the reasoning process . Deformable convolution network [6] Further make the position offset dynamic , But it's not like SKNet Aggregate multi-scale information like that .

3. Methods

3.1. Selective Kernel Convolution

In order to enable neurons to adaptively adjust their RF size , We propose an automatic selection operation , namely “ Selective kernel ”（SK） Convolution , In multiple cores with different kernel sizes . say concretely , We implement SK Convolution ——Split、Fuse and Select, Pictured 1 Shown , It shows two branches . So in this case , There are only two kernels of different sizes , But it's easy to expand to multiple branches .

chart 1

chart 1. Selective kernel convolution .

Split ： For any given characteristic graph $\mathbf{X} \in \mathbb{R}^{H^{\prime} \times W^{\prime} \times C^{\prime}}$ , By default , Let's start with two transformations $\tilde{\mathcal{F}}: \mathbf{X} \rightarrow \tilde{\mathbf{U}} \in$ $\mathbb{R}^{H \times W \times C}$ and $\widehat{\mathcal{F}}: \mathbf{X} \rightarrow \widehat{\mathbf{U}} \in \mathbb{R}^{H \times W \times C}$ with kernel Respectively 3 and 5. Please note that , $\tilde{\mathcal{F}}$ and $\widehat{\mathcal{F}}$ Are grouped by valid / Deep convolution 、 Batch normalization [15] and ReLU [29] Functions are composed in turn . In order to further improve efficiency , have $\times 5$ The traditional convolution of the kernel is replaced with $\times 3$ Kernel and expansion size 2 Extended convolution of .

Fuse： As stated in the introduction , Our goal is to enable neurons to adaptively adjust their RF size . The basic idea is to use gates to control the flow of information from multiple branches , These branches carry information of different sizes into the next layer of neurons .

To achieve this goal , Doors need to integrate information from all branches . We first fuse elements from multiple （ chart 1 Two of them ） The result of branching ：
$\mathbf{U}=\tilde{\mathbf{U}}+\widehat{\mathbf{U}}$
Then we embed global information by simply using global average pooling , To generate $\mathbf{s} \in \mathbb{R}^{C}$ Channel statistics . say concretely , $\mathbf{s}$ Of the $c$ Elements are passed in the spatial dimension $\times W$ Zoom up $\mathbf{U}$ Calculated ：
$s_{c}=\mathcal{F}_{g p}\left(\mathbf{U}_{c}\right)=\frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} \mathbf{U}_{c}(i, j) .$

Besides , It also creates a compact feature $\mathbf{z} \in \mathbb{R}^{d \times 1}$ , To achieve precise and adaptive selection guidance . This is through a simple full connection (fc) Layer , Improve efficiency by reducing dimensions ：
$\mathbf{z}=\mathcal{F}_{f c}(\mathbf{s})=\delta(\mathcal{B}(\mathbf{W} \mathbf{s})),$
among $\delta$ yes ReLU function [29], $\mathcal{B}$ Indicates batch normalization [15], $W\in \mathbb{R}^{d \times C}$ . To study $d$ Impact on model efficiency , We use a reduction ratio $r$ To control its value ：
$d=\max (C / r, L),$
among $L$ Express $d$ The minimum value of （ $L = 32$ It is a typical setting in our experiment ）.

choice ： Cross channel soft attention is used to adaptively select information of different spatial scales , By compact feature descriptors $z$ guide . say concretely ,softmax Operator applied to channel number ：
$a_{c}=\frac{e^{\mathbf{A}_{c} \mathbf{z}}}{e^{\mathbf{A}_{c} \mathbf{z}}+e^{\mathbf{B}_{c} \mathbf{z}}}, b_{c}=\frac{e^{\mathbf{B}_{c} \mathbf{z}}}{e^{\mathbf{A}_{c} \mathbf{z}}+e^{\mathbf{B}_{c} \mathbf{z}}}$
among $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{C \times d}$ and $\mathbf{a}, \mathbf{b}$ respectively $\widetilde{\mathbf{U}}$ and $\widehat{\mathbf{U}}$ Soft attention vector . Please note that , $\mathbf{A}_{c} \in \mathbb{R}^{1 \times d}$ yes $\mathbf{A}$ Of the $c$ That's ok , $a_{c}$ yes $\mathbf{a}$ Of the $c$ Elements , $\mathbf{B}_{c}$ and $b_{c}$ So it is with . In the case of two branches , matrix $\mathbf{B}$ It's redundant , because $a_{c}+b_{c}=1$ . The final feature map $\mathbf{V}$ It is obtained through the attention weight on various kernels ：
$\mathbf{V}_{c}=a_{c} \cdot \widetilde{\mathbf{U}}_{c}+b_{c} \cdot \widehat{\mathbf{U}}_{c}, \quad a_{c}+b_{c}=1$
among $\mathbf{V}=\left[\mathbf{V}_{1}, \mathbf{V}_{2}, \ldots, \mathbf{V}_{C}\right], \mathbf{V}_{c} \in \mathbb{R}^{H \times W}$ .